Train LLMs with reinforcement learning using verl
Train LLMs with reinforcement learning using verl
This tutorial shows you how to set up and run reinforcement learning training for LLMs using verl on Anyscale.
verl is a flexible and efficient reinforcement learning framework for training large language models, developed by Volcengine. It provides a modular architecture for implementing RL algorithms such as PPO and GRPO, with optimized support for distributed training and inference.
Anyscale supports a number of post-training libraries for LLM fine-tuning. See Choose a framework for LLM post-training.
Configure your workspace
To create and configure your workspace, see Workspaces. The following steps are essential for setting up the correct environment to run verl training workloads.
Set up the Docker image
Use the pre-built verl Docker image for your workspace:
novaskyai/skyrl-train-ray-2.48.0-py3.12-cu12.8
This Docker image is also used for SkyRL training and comes pre-installed with all the dependencies required for verl, including Ray 2.48.0, Python 3.12, and CUDA 12.8. You do not need to install any additional system packages.
Configure compute
This workload uses a single node with 4xL4 GPUs. Configure your compute with the following settings:
- Select CPU similar to
m5.2xlarge
with 8 vCPU, 32 GiB memory. - Select a GPU type
4xL4
for the worker node. If4xL4
isn't available, choose an equivalent GPU configuration. Set autoscaling parameters toMin nodes: 0
andMax nodes: 1
to allow the node to scale down when idle and ensure only one node is used. - (Optional) Enable Cross zone autoscaling for better resource availability.
Set up the verl repository
In your workspace terminal, clone the verl repository:
git clone https://github.com/volcengine/verl.git
Update project dependencies
Replace the original pyproject.toml
file with a compatible version for Anyscale. In your workspace terminal, do the following:
-
Navigate to the verl directory and create a new
pyproject.toml
file:cd verl
-
Create the
pyproject.toml
file with the following content:# -------------------------------
# build-system
# -------------------------------
[build-system]
requires = [
"setuptools>=61.0",
"wheel"
]
build-backend = "setuptools.build_meta"
# -------------------------------
# project (PEP 621 metadata)
# -------------------------------
[project]
name = "verl"
# We'll mark the version as "dynamic" because it's read from the file "verl/version/version"
# (PEP 621 calls this "dynamic version").
# The actual version is specified in the [tool.setuptools.dynamic] section below.
dynamic = ["version", "authors", "urls"]
description = "verl: Volcano Engine Reinforcement Learning for LLM"
license = {file = "LICENSE"} # or "Apache-2.0", if you prefer an SPDX identifier
readme = {file = "README.md", content-type = "text/markdown"}
requires-python = ">=3.12"
dependencies=[
"flash-attn@https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.0.post2/flash_attn-2.8.0.post2+cu12torch2.7cxx11abiFALSE-cp312-cp312-linux_x86_64.whl",
"accelerate",
"codetiming",
"datasets",
"dill",
"hydra-core",
"numpy",
"pandas",
"peft",
"pyarrow>=19.0.0",
"pybind11",
"pylatexenc",
"ray==2.48.0",
"torchdata",
"tensordict<=0.6.2",
"transformers>=4.51.3, <4.54.0",
"wandb",
"packaging>=20.0",
"tensordict<=0.6.2",
"vllm==0.9.2",
"flashinfer-python@https://download.pytorch.org/whl/cu128/flashinfer/flashinfer_python-0.2.6.post1%2Bcu128torch2.7-cp39-abi3-linux_x86_64.whl",
"torch==2.7.0",
"torchvision",
"debugpy>=1.8.0",
]
[tool.uv]
override-dependencies = ["ray==2.48.0", "xgrammar==0.1.17"]
[project.optional-dependencies]
test = ['pytest', 'yapf', 'py-spy']
# -------------------------------
# tool.setuptools - Additional config
# -------------------------------
[tool.setuptools]
# True means `setuptools` will attempt to include all relevant files in package_data automatically.
# This corresponds to `include_package_data=True` in setup.py.
include-package-data = true
# We read the version from a file in 'verl/version/version'
[tool.setuptools.dynamic]
version = {file = "verl/version/version"}
# If you need to mimic `package_dir={'': '.'}`:
[tool.setuptools.package-dir]
"" = "."
# If you need to include specific non-Python data (like YAML files or version file):
# This is the rough equivalent of package_data={'': ['version/*'], 'verl': ['trainer/config/*.yaml']}
[tool.setuptools.package-data]
verl = [
"version/*",
"trainer/config/*.yaml"
] -
Install dependencies using uv:
uv lock
This step ensures all dependencies are properly resolved and compatible with the Anyscale environment.
Prepare the dataset
verl requires datasets in a specific Parquet format. This example uses the GSM8K math word-problem dataset.
From the verl
directory, run the following command in your workspace terminal:
uv run --isolated examples/data_preprocess/gsm8k.py --local_save_dir /mnt/cluster_storage/data/gsm8k
Use shared storage across nodes such as /mnt/cluster_storage
to ensure all Ray workers can access the data.
This script converts the GSM8K dataset from Hugging Face into two Parquet files with the schema required for instruction-tuning or RL-style training:
train.parquet
- Training data.test.parquet
- Validation data.
If you launch the training workload as an Anyscale job, use a shared location to store the datasets such as /mnt/shared_storage
or /mnt/user_storage
. Anyscale Jobs spawn a separate Ray cluster from your workspace, so /mnt/cluster_storage
isn't shared between them. See Create and manage jobs.
Configure Ray runtime environment
verl requires specific Ray runtime environment settings to function correctly in Anyscale workspaces. To configure the environment, do the following:
-
Create a
.env
file in theverl
directory (same location aspyproject.toml
) with the following content:RAY_JOB_CONFIG_JSON_ENV_VAR="{\"runtime_env\": {\"working_dir\": \"./\"}}"
-
Modify the
.anyscaleignore
file to remove the following line:**/.venv/
-
Modify the
.gitignore
file to remove the following line:.env
These modifications ensure that Ray can detect and use the .env
file. By default, Anyscale excludes files matching patterns in .gitignore
and .anyscaleignore
from being synced to worker nodes. For more information, see Exclude files with .gitignore
and Exclude files with .anyscaleignore
.
Configure the training parameters
Create the training script
Create a new bash script file at examples/grpo_trainer/run_qwen2.5-3b.sh
with the following content:
set -x
uv run --isolated --env-file .env python -m verl.trainer.main_ppo \ # using the .env file
algorithm.adv_estimator=grpo \
trainer.val_before_train=False \
data.train_files=/mnt/cluster_storage/data/gsm8k/train.parquet \
data.val_files=/mnt/cluster_storage/data/gsm8k/test.parquet \
data.train_batch_size=16 \
data.max_prompt_length=512 \
data.max_response_length=1024 \
data.filter_overlong_prompts=True \
data.truncation='error' \
data.shuffle=False \
actor_rollout_ref.model.path=Qwen/Qwen2.5-3B-Instruct \
actor_rollout_ref.model.lora_rank=64 \
actor_rollout_ref.model.lora_alpha=32 \
actor_rollout_ref.actor.optim.lr=3e-6 \
actor_rollout_ref.model.use_remove_padding=True \
actor_rollout_ref.actor.ppo_mini_batch_size=16 \
actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=4 \
actor_rollout_ref.actor.use_kl_loss=True \
actor_rollout_ref.actor.kl_loss_coef=0.001 \
actor_rollout_ref.actor.kl_loss_type=low_var_kl \
actor_rollout_ref.actor.entropy_coeff=0 \
actor_rollout_ref.model.enable_gradient_checkpointing=True \
actor_rollout_ref.actor.fsdp_config.param_offload=False \
actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \
actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=16 \
actor_rollout_ref.rollout.tensor_model_parallel_size=2 \
actor_rollout_ref.rollout.name=vllm \
actor_rollout_ref.rollout.gpu_memory_utilization=0.6 \
actor_rollout_ref.rollout.n=4 \
actor_rollout_ref.rollout.load_format=safetensors \
actor_rollout_ref.rollout.layered_summon=True \
actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=16 \
actor_rollout_ref.ref.fsdp_config.param_offload=True \
algorithm.use_kl_in_reward=False \
trainer.critic_warmup=0 \
trainer.logger='["console"]' \
trainer.project_name='verl_grpo_example_gsm8k' \
trainer.experiment_name='qwen2.5_3b_grpo_lora' \
trainer.n_gpus_per_node=4 \
trainer.nnodes=1 \
trainer.default_local_dir='/mnt/cluster_storage/verl_ckpts/${trainer.project_name}/${trainer.experiment_name}' \
trainer.save_freq=20 \
trainer.test_freq=5 \
trainer.total_epochs=1 $@
Key configuration parameters
This configuration includes several important settings:
- Model configuration: Uses
Qwen2.5-3B-Instruct
with LoRA (rank 64, alpha 32) for parameter-efficient training. - Data paths: Points to the previously prepared GSM8K dataset on cluster storage.
- GRPO algorithm: Configured with KL loss for regularization.
- vLLM rollout: Uses tensor parallelism (size 2) with 60% GPU memory utilization for efficient inference.
- Checkpoint directory: Saves checkpoints to
/mnt/cluster_storage/verl_ckpts
for persistence across nodes. - Training duration: Set to 1 epoch for demonstration purposes.
- Adjust
trainer.total_epochs
,trainer.save_freq
, andtrainer.test_freq
based on your training requirements. For production training, increase the number of epochs and adjust the checkpoint frequency. - For this demo, checkpoints are stored in
/mnt/cluster_storage/
. For larger LLMs, it's recommended to save checkpoints to local storage during training and then upload them to artifact storage for long-term persistence and easier access.
Launch the training process
Start the training process by running the following command from the verl
directory in your workspace terminal:
bash examples/grpo_trainer/run_qwen2.5-3b.sh
The training process does the following:
- Loads the prepared GSM8K dataset from cluster storage.
- Initializes the Qwen2.5-3B model with LoRA adapters.
- Sets up the vLLM inference engine for rollout generation.
- Runs the GRPO training loop with KL regularization.
- Saves checkpoints at the specified frequency.
Monitor your training progress through the Ray dashboard or console logs. Check GPU utilization and memory usage to ensure efficient resource usage.
Resources
- GitHub repository: volcengine/verl
- verl documentation: verl.readthedocs.io
- Model used in this tutorial: Qwen2.5-3B-Instruct