Skip to main content

Train LLMs with reinforcement learning using SkyRL

Train LLMs with reinforcement learning using SkyRL

This tutorial shows you how to set up and run reinforcement learning training for LLMs using SkyRL on Anyscale.

SkyRL is a modular reinforcement learning (RL) library designed for training large language models (LLMs). Developed by the Berkeley Sky Computing Lab in collaboration with Anyscale, it provides a flexible framework for implementing RL algorithms (such as PPO, GRPO, and DAPO), tool-use tasks, and multi-turn agentic workflows.

Anyscale supports a number of post-training libraries for LLM fine-tuning. See Choose a framework for LLM post-training.

Configure your workspace

To create and configure your workspace, see Workspaces. The following steps are essential for setting up the correct environment to run SkyRL training workloads.

Set up the Docker image

Use the pre-built SkyRL Docker image for your workspace:

novaskyai/skyrl-train-ray-2.48.0-py3.12-cu12.8
note

This image contains all necessary dependencies for SkyRL training with Ray 2.48.0, Python 3.12, and CUDA 12.8.

Configure compute

This workload uses a single node with 4xL4 GPUs. Configure your compute with the following settings:

  1. Select CPU similar to m5.2xlarge with 8 vCPU, 32 GiB memory.
  2. Select a GPU type 4xL4 for the worker node. If 4xL4 isn't available, choose an equivalent GPU configuration. Set autoscaling parameters to Min nodes: 0 and Max nodes: 1 to allow the node to scale down when idle and ensure only one node is used.

Set up the SkyRL repository

  1. In your Anyscale workspace terminal, clone the SkyRL repository:

    git clone https://github.com/NovaSky-AI/SkyRL.git
  2. Change to the SkyRL training directory:

    cd SkyRL/skyrl-train

For more detailed setup information, see the SkyRL Quickstart documentation.

Prepare the dataset

SkyRL requires datasets in a specific Parquet format. This example uses the GSM8K math word-problem dataset.

In your workspace terminal from the SkyRL/skyrl-train directory, run the following command:

uv run --isolated examples/gsm8k/gsm8k_dataset.py --output_dir /mnt/cluster_storage/data/gsm8k
important

You must use shared storage across nodes such as /mnt/cluster_storage to ensure all Ray workers can access the data.

This script converts the GSM8K dataset from Hugging Face into two Parquet files with the schema required for instruction-tuning or RL-style training:

  • train.parquet - Training data.
  • validation.parquet - Validation data.

The dataset schema must follow the SkyRL dataset preparation format.

Configure the training parameters

Training configuration

Edit the configuration file at examples/gsm8k/run_gsm8k.sh with the following settings:

set -x

# Colocated GRPO training+generation for Qwen2.5-1.5B-Instruct on GSM8K.

# uv run examples/gsm8k/gsm8k_dataset.py --output_dir $HOME/data/gsm8k
# export WANDB_API_KEY=<your_key_here>
# bash examples/gsm8k/run_gsm8k.sh

# NOTE: `micro_train_batch_size_per_gpu` and `micro_forward_batch_size_per_gpu` can be tuned

DATA_DIR="/mnt/cluster_storage/data/gsm8k"
NUM_GPUS=4
LOGGER="console"

INFERENCE_BACKEND="vllm"

uv run --isolated --extra $INFERENCE_BACKEND -m skyrl_train.entrypoints.main_base \
data.train_data="['$DATA_DIR/train.parquet']" \
data.val_data="['$DATA_DIR/validation.parquet']" \
trainer.algorithm.advantage_estimator="grpo" \
trainer.policy.model.path="Qwen/Qwen2.5-1.5B-Instruct" \
trainer.placement.colocate_all=true \
trainer.strategy=fsdp2 \
trainer.placement.policy_num_gpus_per_node=$NUM_GPUS \
trainer.placement.ref_num_gpus_per_node=$NUM_GPUS \
generator.num_inference_engines=$NUM_GPUS \
generator.inference_engine_tensor_parallel_size=1 \
trainer.epochs=2 \
trainer.eval_batch_size=512 \
trainer.eval_before_train=true \
trainer.eval_interval=5 \
trainer.update_epochs_per_batch=1 \
trainer.train_batch_size=256 \
trainer.policy_mini_batch_size=64 \
trainer.micro_forward_batch_size_per_gpu=16 \
trainer.micro_train_batch_size_per_gpu=16 \
trainer.ckpt_interval=10 \
trainer.max_prompt_length=512 \
generator.sampling_params.max_generate_length=1024 \
trainer.policy.optimizer_config.lr=1.0e-6 \
trainer.algorithm.use_kl_loss=true \
generator.backend=$INFERENCE_BACKEND \
generator.run_engines_locally=true \
generator.weight_sync_backend=nccl \
generator.async_engine=true \
generator.batched=true \
environment.env_class=gsm8k \
generator.n_samples_per_prompt=5 \
generator.gpu_memory_utilization=0.8 \
trainer.logger="$LOGGER" \
trainer.project_name="gsm8k" \
trainer.run_name="gsm8k_test" \
trainer.resume_mode=null \
trainer.ckpt_path="/mnt/cluster_storage/ckpts/gsm8k_1.5B_ckpt" \
$@

The following are the key parameters you should modify from the default example:

  • Data directory: Set DATA_DIR="/mnt/cluster_storage/data/gsm8k" to use the shared storage location where you prepared the dataset.

    note

    If you launch the training workload as an Anyscale job, use a shared location such as /mnt/shared_storage or /mnt/user_storage. Jobs spawn a separate Ray cluster from your workspace, so /mnt/cluster_storage isn't shared between them. See Create and manage jobs.

  • Training epochs: Set trainer.epochs=2 (reduced from 20) to decrease training time for this demo.

  • Experiment tracking: Set LOGGER="wandb" if you have W&B access and configure your WANDB_API_KEY in the workspace environment variables. Otherwise, set LOGGER="console" to print training logs to stdout.

  • Checkpoint directory: Set trainer.ckpt_path="/mnt/cluster_storage/ckpts/gsm8k_1.5B_ckpt" to save checkpoints to cluster storage.

    tip

    For this demo, checkpoints are stored in /mnt/cluster_storage/. For larger LLMs, it's recommended to save checkpoints to local storage during training and then upload them to artifact storage for long-term persistence and easier access.

  • Batch size parameters: The configuration uses trainer.micro_forward_batch_size_per_gpu=16 and trainer.micro_train_batch_size_per_gpu=16. If you encounter GPU out-of-memory errors, reduce these batch size parameters.

For more information about GRPO configuration on on GSM8K dataset, see the SkyRL quickstart documentation.

Launch the training process

In your workspace terminal from the SkyRL/skyrl-train directory, start the training process:

bash examples/gsm8k/run_gsm8k.sh

The training process completes the following steps:

  1. Load the prepared GSM8K dataset.
  2. Initialize the model and training configuration.
  3. Run the GRPO training loop.
  4. Save checkpoints and model artifacts.
tip

Monitor your training progress through the Ray dashboard or your configured experiment tracking system (W&B or console logs).

Next steps

After completing this tutorial, you can:

Resources