Train LLMs with reinforcement learning using SkyRL
Train LLMs with reinforcement learning using SkyRL
This tutorial shows you how to set up and run reinforcement learning training for LLMs using SkyRL on Anyscale.
SkyRL is a modular reinforcement learning (RL) library designed for training large language models (LLMs). Developed by the Berkeley Sky Computing Lab in collaboration with Anyscale, it provides a flexible framework for implementing RL algorithms (such as PPO, GRPO, and DAPO), tool-use tasks, and multi-turn agentic workflows.
Anyscale supports a number of post-training libraries for LLM fine-tuning. See Choose a framework for LLM post-training.
Configure your workspace
To create and configure your workspace, see Workspaces. The following steps are essential for setting up the correct environment to run SkyRL training workloads.
Set up the Docker image
Use the pre-built SkyRL Docker image for your workspace:
novaskyai/skyrl-train-ray-2.48.0-py3.12-cu12.8
This image contains all necessary dependencies for SkyRL training with Ray 2.48.0, Python 3.12, and CUDA 12.8.
Configure compute
This workload uses a single node with 4xL4 GPUs. Configure your compute with the following settings:
- Select CPU similar to
m5.2xlarge
with 8 vCPU, 32 GiB memory. - Select a GPU type
4xL4
for the worker node. If4xL4
isn't available, choose an equivalent GPU configuration. Set autoscaling parameters toMin nodes: 0
andMax nodes: 1
to allow the node to scale down when idle and ensure only one node is used.
Set up the SkyRL repository
-
In your Anyscale workspace terminal, clone the SkyRL repository:
git clone https://github.com/NovaSky-AI/SkyRL.git
-
Change to the SkyRL training directory:
cd SkyRL/skyrl-train
For more detailed setup information, see the SkyRL Quickstart documentation.
Prepare the dataset
SkyRL requires datasets in a specific Parquet format. This example uses the GSM8K math word-problem dataset.
In your workspace terminal from the SkyRL/skyrl-train
directory, run the following command:
uv run --isolated examples/gsm8k/gsm8k_dataset.py --output_dir /mnt/cluster_storage/data/gsm8k
You must use shared storage across nodes such as /mnt/cluster_storage
to ensure all Ray workers can access the data.
This script converts the GSM8K dataset from Hugging Face into two Parquet files with the schema required for instruction-tuning or RL-style training:
train.parquet
- Training data.validation.parquet
- Validation data.
The dataset schema must follow the SkyRL dataset preparation format.
Configure the training parameters
Training configuration
Edit the configuration file at examples/gsm8k/run_gsm8k.sh
with the following settings:
set -x
# Colocated GRPO training+generation for Qwen2.5-1.5B-Instruct on GSM8K.
# uv run examples/gsm8k/gsm8k_dataset.py --output_dir $HOME/data/gsm8k
# export WANDB_API_KEY=<your_key_here>
# bash examples/gsm8k/run_gsm8k.sh
# NOTE: `micro_train_batch_size_per_gpu` and `micro_forward_batch_size_per_gpu` can be tuned
DATA_DIR="/mnt/cluster_storage/data/gsm8k"
NUM_GPUS=4
LOGGER="console"
INFERENCE_BACKEND="vllm"
uv run --isolated --extra $INFERENCE_BACKEND -m skyrl_train.entrypoints.main_base \
data.train_data="['$DATA_DIR/train.parquet']" \
data.val_data="['$DATA_DIR/validation.parquet']" \
trainer.algorithm.advantage_estimator="grpo" \
trainer.policy.model.path="Qwen/Qwen2.5-1.5B-Instruct" \
trainer.placement.colocate_all=true \
trainer.strategy=fsdp2 \
trainer.placement.policy_num_gpus_per_node=$NUM_GPUS \
trainer.placement.ref_num_gpus_per_node=$NUM_GPUS \
generator.num_inference_engines=$NUM_GPUS \
generator.inference_engine_tensor_parallel_size=1 \
trainer.epochs=2 \
trainer.eval_batch_size=512 \
trainer.eval_before_train=true \
trainer.eval_interval=5 \
trainer.update_epochs_per_batch=1 \
trainer.train_batch_size=256 \
trainer.policy_mini_batch_size=64 \
trainer.micro_forward_batch_size_per_gpu=16 \
trainer.micro_train_batch_size_per_gpu=16 \
trainer.ckpt_interval=10 \
trainer.max_prompt_length=512 \
generator.sampling_params.max_generate_length=1024 \
trainer.policy.optimizer_config.lr=1.0e-6 \
trainer.algorithm.use_kl_loss=true \
generator.backend=$INFERENCE_BACKEND \
generator.run_engines_locally=true \
generator.weight_sync_backend=nccl \
generator.async_engine=true \
generator.batched=true \
environment.env_class=gsm8k \
generator.n_samples_per_prompt=5 \
generator.gpu_memory_utilization=0.8 \
trainer.logger="$LOGGER" \
trainer.project_name="gsm8k" \
trainer.run_name="gsm8k_test" \
trainer.resume_mode=null \
trainer.ckpt_path="/mnt/cluster_storage/ckpts/gsm8k_1.5B_ckpt" \
$@
The following are the key parameters you should modify from the default example:
-
Data directory: Set
DATA_DIR="/mnt/cluster_storage/data/gsm8k"
to use the shared storage location where you prepared the dataset.noteIf you launch the training workload as an Anyscale job, use a shared location such as
/mnt/shared_storage
or/mnt/user_storage
. Jobs spawn a separate Ray cluster from your workspace, so/mnt/cluster_storage
isn't shared between them. See Create and manage jobs. -
Training epochs: Set
trainer.epochs=2
(reduced from 20) to decrease training time for this demo. -
Experiment tracking: Set
LOGGER="wandb"
if you have W&B access and configure yourWANDB_API_KEY
in the workspace environment variables. Otherwise, setLOGGER="console"
to print training logs to stdout. -
Checkpoint directory: Set
trainer.ckpt_path="/mnt/cluster_storage/ckpts/gsm8k_1.5B_ckpt"
to save checkpoints to cluster storage.tipFor this demo, checkpoints are stored in
/mnt/cluster_storage/
. For larger LLMs, it's recommended to save checkpoints to local storage during training and then upload them to artifact storage for long-term persistence and easier access. -
Batch size parameters: The configuration uses
trainer.micro_forward_batch_size_per_gpu=16
andtrainer.micro_train_batch_size_per_gpu=16
. If you encounter GPU out-of-memory errors, reduce these batch size parameters.
For more information about GRPO configuration on on GSM8K dataset, see the SkyRL quickstart documentation.
Launch the training process
In your workspace terminal from the SkyRL/skyrl-train
directory, start the training process:
bash examples/gsm8k/run_gsm8k.sh
The training process completes the following steps:
- Load the prepared GSM8K dataset.
- Initialize the model and training configuration.
- Run the GRPO training loop.
- Save checkpoints and model artifacts.
Monitor your training progress through the Ray dashboard or your configured experiment tracking system (W&B or console logs).
Next steps
After completing this tutorial, you can:
- Experiment with different datasets by following the dataset preparation schema.
- Adjust hyperparameters in the configuration file for better performance.
- Scale up training by modifying the GPU configuration.
- Deploy your trained LLM using Ray Serve LLM on Anyscale. See Serve LLMs with Anyscale services.
- Explore other SkyRL examples for PPO training, code generation, and search tasks.
Resources
- GitHub repository: github.com/NovaSky-AI/SkyRL
- Documentation: skyrl.readthedocs.io
- Discord community: NovaSky @ Berkeley