Train LLMs with reinforcement learning using verl
Train LLMs with reinforcement learning using verl
This tutorial shows you how to set up and run reinforcement learning training for LLMs using verl on Anyscale.
verl is a flexible and efficient reinforcement learning framework for training large language models, developed by Volcengine. It provides a modular architecture for implementing RL algorithms such as PPO and GRPO, with optimized support for distributed training and inference.
Anyscale supports a number of post-training libraries for LLM fine-tuning. See Choose a framework for LLM post-training.
Configure your workspace
To create and configure your workspace, see Workspaces. The following steps are essential for setting up the correct environment to run verl training workloads.
Set up the Docker image
Use the pre-built verl Docker image for your workspace:
novaskyai/skyrl-train-ray-2.48.0-py3.12-cu12.8
This Docker image is also used for SkyRL training and comes pre-installed with all the dependencies required for verl, including Ray 2.48.0, Python 3.12, and CUDA 12.8. You do not need to install any additional system packages.
Configure compute
This workload uses a single node with 4xL4 GPUs. Configure your compute with the following settings:
- Select CPU similar to
m5.2xlargewith 8 vCPU, 32 GiB memory. - Select a GPU type
4xL4for the worker node. If4xL4isn't available, choose an equivalent GPU configuration. Set autoscaling parameters toMin nodes: 0andMax nodes: 1to allow the node to scale down when idle and ensure only one node is used. - (Optional) Enable Cross zone autoscaling for better resource availability.
Set up the verl repository
In your workspace terminal, clone the verl repository:
git clone https://github.com/volcengine/verl.git
Update project dependencies
Replace the original pyproject.toml file with a compatible version for Anyscale. In your workspace terminal, do the following:
-
Navigate to the verl directory and create a new
pyproject.tomlfile:cd verl -
Create the
pyproject.tomlfile with the following content:# -------------------------------
# build-system
# -------------------------------
[build-system]
requires = [
"setuptools>=61.0",
"wheel"
]
build-backend = "setuptools.build_meta"
# -------------------------------
# project (PEP 621 metadata)
# -------------------------------
[project]
name = "verl"
# We'll mark the version as "dynamic" because it's read from the file "verl/version/version"
# (PEP 621 calls this "dynamic version").
# The actual version is specified in the [tool.setuptools.dynamic] section below.
dynamic = ["version", "authors", "urls"]
description = "verl: Volcano Engine Reinforcement Learning for LLM"
license = {file = "LICENSE"} # or "Apache-2.0", if you prefer an SPDX identifier
readme = {file = "README.md", content-type = "text/markdown"}
requires-python = ">=3.12"
dependencies=[
"flash-attn@https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.0.post2/flash_attn-2.8.0.post2+cu12torch2.7cxx11abiFALSE-cp312-cp312-linux_x86_64.whl",
"accelerate",
"codetiming",
"datasets",
"dill",
"hydra-core",
"numpy",
"pandas",
"peft",
"pyarrow>=19.0.0",
"pybind11",
"pylatexenc",
"ray==2.48.0",
"torchdata",
"tensordict<=0.6.2",
"transformers>=4.51.3, <4.54.0",
"wandb",
"packaging>=20.0",
"tensordict<=0.6.2",
"vllm==0.9.2",
"flashinfer-python@https://download.pytorch.org/whl/cu128/flashinfer/flashinfer_python-0.2.6.post1%2Bcu128torch2.7-cp39-abi3-linux_x86_64.whl",
"torch==2.7.0",
"torchvision",
"debugpy>=1.8.0",
]
[tool.uv]
override-dependencies = ["ray==2.48.0", "xgrammar==0.1.17"]
[project.optional-dependencies]
test = ['pytest', 'yapf', 'py-spy']
# -------------------------------
# tool.setuptools - Additional config
# -------------------------------
[tool.setuptools]
# True means `setuptools` will attempt to include all relevant files in package_data automatically.
# This corresponds to `include_package_data=True` in setup.py.
include-package-data = true
# We read the version from a file in 'verl/version/version'
[tool.setuptools.dynamic]
version = {file = "verl/version/version"}
# If you need to mimic `package_dir={'': '.'}`:
[tool.setuptools.package-dir]
"" = "."
# If you need to include specific non-Python data (like YAML files or version file):
# This is the rough equivalent of package_data={'': ['version/*'], 'verl': ['trainer/config/*.yaml']}
[tool.setuptools.package-data]
verl = [
"version/*",
"trainer/config/*.yaml"
] -
Install dependencies using uv:
uv lock
This step ensures all dependencies are properly resolved and compatible with the Anyscale environment.
Prepare the dataset
verl requires datasets in a specific Parquet format. This example uses the GSM8K math word-problem dataset.
From the verl directory, run the following command in your workspace terminal:
uv run --isolated examples/data_preprocess/gsm8k.py --local_save_dir /mnt/cluster_storage/data/gsm8k
Use shared storage across nodes such as /mnt/cluster_storage to ensure all Ray workers can access the data.
This script converts the GSM8K dataset from Hugging Face into two Parquet files with the schema required for instruction-tuning or RL-style training:
train.parquet- Training data.test.parquet- Validation data.
If you launch the training workload as an Anyscale job, use a shared location to store the datasets such as /mnt/shared_storage or /mnt/user_storage. Anyscale Jobs spawn a separate Ray cluster from your workspace, so /mnt/cluster_storage isn't shared between them. See Create and manage jobs.
Configure Ray runtime environment
verl requires specific Ray runtime environment settings to function correctly in Anyscale workspaces. To configure the environment, do the following:
-
Create a
.envfile in theverldirectory (same location aspyproject.toml) with the following content:RAY_JOB_CONFIG_JSON_ENV_VAR="{\"runtime_env\": {\"working_dir\": \"./\"}}" -
Modify the
.anyscaleignorefile to remove the following line:**/.venv/ -
Modify the
.gitignorefile to remove the following line:.env
These modifications ensure that Ray can detect and use the .env file. By default, Anyscale excludes files matching patterns in .gitignore and .anyscaleignore from being synced to worker nodes. For more information, see Exclude files with .gitignore and Exclude files with .anyscaleignore.
Configure the training parameters
Create the training script
Create a new bash script file at examples/grpo_trainer/run_qwen2.5-3b.sh with the following content:
set -x
uv run --isolated --env-file .env python -m verl.trainer.main_ppo \ # using the .env file
algorithm.adv_estimator=grpo \
trainer.val_before_train=False \
data.train_files=/mnt/cluster_storage/data/gsm8k/train.parquet \
data.val_files=/mnt/cluster_storage/data/gsm8k/test.parquet \
data.train_batch_size=16 \
data.max_prompt_length=512 \
data.max_response_length=1024 \
data.filter_overlong_prompts=True \
data.truncation='error' \
data.shuffle=False \
actor_rollout_ref.model.path=Qwen/Qwen2.5-3B-Instruct \
actor_rollout_ref.model.lora_rank=64 \
actor_rollout_ref.model.lora_alpha=32 \
actor_rollout_ref.actor.optim.lr=3e-6 \
actor_rollout_ref.model.use_remove_padding=True \
actor_rollout_ref.actor.ppo_mini_batch_size=16 \
actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=4 \
actor_rollout_ref.actor.use_kl_loss=True \
actor_rollout_ref.actor.kl_loss_coef=0.001 \
actor_rollout_ref.actor.kl_loss_type=low_var_kl \
actor_rollout_ref.actor.entropy_coeff=0 \
actor_rollout_ref.model.enable_gradient_checkpointing=True \
actor_rollout_ref.actor.fsdp_config.param_offload=False \
actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \
actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=16 \
actor_rollout_ref.rollout.tensor_model_parallel_size=2 \
actor_rollout_ref.rollout.name=vllm \
actor_rollout_ref.rollout.gpu_memory_utilization=0.6 \
actor_rollout_ref.rollout.n=4 \
actor_rollout_ref.rollout.load_format=safetensors \
actor_rollout_ref.rollout.layered_summon=True \
actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=16 \
actor_rollout_ref.ref.fsdp_config.param_offload=True \
algorithm.use_kl_in_reward=False \
trainer.critic_warmup=0 \
trainer.logger='["console"]' \
trainer.project_name='verl_grpo_example_gsm8k' \
trainer.experiment_name='qwen2.5_3b_grpo_lora' \
trainer.n_gpus_per_node=4 \
trainer.nnodes=1 \
trainer.default_local_dir='/mnt/cluster_storage/verl_ckpts/${trainer.project_name}/${trainer.experiment_name}' \
trainer.save_freq=20 \
trainer.test_freq=5 \
trainer.total_epochs=1 $@
Key configuration parameters
This configuration includes several important settings:
- Model configuration: Uses
Qwen2.5-3B-Instructwith LoRA (rank 64, alpha 32) for parameter-efficient training. - Data paths: Points to the previously prepared GSM8K dataset on cluster storage.
- GRPO algorithm: Configured with KL loss for regularization.
- vLLM rollout: Uses tensor parallelism (size 2) with 60% GPU memory utilization for efficient inference.
- Checkpoint directory: Saves checkpoints to
/mnt/cluster_storage/verl_ckptsfor persistence across nodes. - Training duration: Set to 1 epoch for demonstration purposes.
- Adjust
trainer.total_epochs,trainer.save_freq, andtrainer.test_freqbased on your training requirements. For production training, increase the number of epochs and adjust the checkpoint frequency. - For this demo, checkpoints are stored in
/mnt/cluster_storage/. For larger LLMs, it's recommended to save checkpoints to local storage during training and then upload them to artifact storage for long-term persistence and easier access.
Launch the training process
Start the training process by running the following command from the verl directory in your workspace terminal:
bash examples/grpo_trainer/run_qwen2.5-3b.sh
The training process does the following:
- Loads the prepared GSM8K dataset from cluster storage.
- Initializes the Qwen2.5-3B model with LoRA adapters.
- Sets up the vLLM inference engine for rollout generation.
- Runs the GRPO training loop with KL regularization.
- Saves checkpoints at the specified frequency.
Monitor your training progress through the Ray dashboard or console logs. Check GPU utilization and memory usage to ensure efficient resource usage.
Resources
- GitHub repository: volcengine/verl
- verl documentation: verl.readthedocs.io
- Model used in this tutorial: Qwen2.5-3B-Instruct