Skip to main content

Train LLMs with reinforcement learning using verl

Train LLMs with reinforcement learning using verl

This tutorial shows you how to set up and run reinforcement learning training for LLMs using verl on Anyscale.

verl is a flexible and efficient reinforcement learning framework for training large language models, developed by Volcengine. It provides a modular architecture for implementing RL algorithms such as PPO and GRPO, with optimized support for distributed training and inference.

Anyscale supports a number of post-training libraries for LLM fine-tuning. See Choose a framework for LLM post-training.

Configure your workspace

To create and configure your workspace, see Workspaces. The following steps are essential for setting up the correct environment to run verl training workloads.

Set up the Docker image

Use the pre-built verl Docker image for your workspace:

novaskyai/skyrl-train-ray-2.48.0-py3.12-cu12.8
note

This Docker image is also used for SkyRL training and comes pre-installed with all the dependencies required for verl, including Ray 2.48.0, Python 3.12, and CUDA 12.8. You do not need to install any additional system packages.

Configure compute

This workload uses a single node with 4xL4 GPUs. Configure your compute with the following settings:

  1. Select CPU similar to m5.2xlarge with 8 vCPU, 32 GiB memory.
  2. Select a GPU type 4xL4 for the worker node. If 4xL4 isn't available, choose an equivalent GPU configuration. Set autoscaling parameters to Min nodes: 0 and Max nodes: 1 to allow the node to scale down when idle and ensure only one node is used.
  3. (Optional) Enable Cross zone autoscaling for better resource availability.

Set up the verl repository

In your workspace terminal, clone the verl repository:

git clone https://github.com/volcengine/verl.git

Update project dependencies

Replace the original pyproject.toml file with a compatible version for Anyscale. In your workspace terminal, do the following:

  1. Navigate to the verl directory and create a new pyproject.toml file:

    cd verl
  2. Create the pyproject.toml file with the following content:

    # -------------------------------
    # build-system
    # -------------------------------
    [build-system]
    requires = [
    "setuptools>=61.0",
    "wheel"
    ]
    build-backend = "setuptools.build_meta"

    # -------------------------------
    # project (PEP 621 metadata)
    # -------------------------------
    [project]
    name = "verl"
    # We'll mark the version as "dynamic" because it's read from the file "verl/version/version"
    # (PEP 621 calls this "dynamic version").
    # The actual version is specified in the [tool.setuptools.dynamic] section below.
    dynamic = ["version", "authors", "urls"]

    description = "verl: Volcano Engine Reinforcement Learning for LLM"
    license = {file = "LICENSE"} # or "Apache-2.0", if you prefer an SPDX identifier
    readme = {file = "README.md", content-type = "text/markdown"}
    requires-python = ">=3.12"

    dependencies=[
    "flash-attn@https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.0.post2/flash_attn-2.8.0.post2+cu12torch2.7cxx11abiFALSE-cp312-cp312-linux_x86_64.whl",
    "accelerate",
    "codetiming",
    "datasets",
    "dill",
    "hydra-core",
    "numpy",
    "pandas",
    "peft",
    "pyarrow>=19.0.0",
    "pybind11",
    "pylatexenc",
    "ray==2.48.0",
    "torchdata",
    "tensordict<=0.6.2",
    "transformers>=4.51.3, <4.54.0",
    "wandb",
    "packaging>=20.0",
    "tensordict<=0.6.2",
    "vllm==0.9.2",
    "flashinfer-python@https://download.pytorch.org/whl/cu128/flashinfer/flashinfer_python-0.2.6.post1%2Bcu128torch2.7-cp39-abi3-linux_x86_64.whl",
    "torch==2.7.0",
    "torchvision",
    "debugpy>=1.8.0",
    ]

    [tool.uv]
    override-dependencies = ["ray==2.48.0", "xgrammar==0.1.17"]

    [project.optional-dependencies]
    test = ['pytest', 'yapf', 'py-spy']

    # -------------------------------
    # tool.setuptools - Additional config
    # -------------------------------
    [tool.setuptools]
    # True means `setuptools` will attempt to include all relevant files in package_data automatically.
    # This corresponds to `include_package_data=True` in setup.py.
    include-package-data = true

    # We read the version from a file in 'verl/version/version'
    [tool.setuptools.dynamic]
    version = {file = "verl/version/version"}

    # If you need to mimic `package_dir={'': '.'}`:
    [tool.setuptools.package-dir]
    "" = "."

    # If you need to include specific non-Python data (like YAML files or version file):
    # This is the rough equivalent of package_data={'': ['version/*'], 'verl': ['trainer/config/*.yaml']}
    [tool.setuptools.package-data]
    verl = [
    "version/*",
    "trainer/config/*.yaml"
    ]
  3. Install dependencies using uv:

    uv lock

This step ensures all dependencies are properly resolved and compatible with the Anyscale environment.

Prepare the dataset

verl requires datasets in a specific Parquet format. This example uses the GSM8K math word-problem dataset.

From the verl directory, run the following command in your workspace terminal:

uv run --isolated examples/data_preprocess/gsm8k.py --local_save_dir /mnt/cluster_storage/data/gsm8k
caution

Use shared storage across nodes such as /mnt/cluster_storage to ensure all Ray workers can access the data.

This script converts the GSM8K dataset from Hugging Face into two Parquet files with the schema required for instruction-tuning or RL-style training:

  • train.parquet - Training data.
  • test.parquet - Validation data.
note

If you launch the training workload as an Anyscale job, use a shared location to store the datasets such as /mnt/shared_storage or /mnt/user_storage. Anyscale Jobs spawn a separate Ray cluster from your workspace, so /mnt/cluster_storage isn't shared between them. See Create and manage jobs.

Configure Ray runtime environment

verl requires specific Ray runtime environment settings to function correctly in Anyscale workspaces. To configure the environment, do the following:

  1. Create a .env file in the verl directory (same location as pyproject.toml) with the following content:

    RAY_JOB_CONFIG_JSON_ENV_VAR="{\"runtime_env\": {\"working_dir\": \"./\"}}"
  2. Modify the .anyscaleignore file to remove the following line:

    **/.venv/
  3. Modify the .gitignore file to remove the following line:

    .env

These modifications ensure that Ray can detect and use the .env file. By default, Anyscale excludes files matching patterns in .gitignore and .anyscaleignore from being synced to worker nodes. For more information, see Exclude files with .gitignore and Exclude files with .anyscaleignore.

Configure the training parameters

Create the training script

Create a new bash script file at examples/grpo_trainer/run_qwen2.5-3b.sh with the following content:

set -x

uv run --isolated --env-file .env python -m verl.trainer.main_ppo \ # using the .env file
algorithm.adv_estimator=grpo \
trainer.val_before_train=False \
data.train_files=/mnt/cluster_storage/data/gsm8k/train.parquet \
data.val_files=/mnt/cluster_storage/data/gsm8k/test.parquet \
data.train_batch_size=16 \
data.max_prompt_length=512 \
data.max_response_length=1024 \
data.filter_overlong_prompts=True \
data.truncation='error' \
data.shuffle=False \
actor_rollout_ref.model.path=Qwen/Qwen2.5-3B-Instruct \
actor_rollout_ref.model.lora_rank=64 \
actor_rollout_ref.model.lora_alpha=32 \
actor_rollout_ref.actor.optim.lr=3e-6 \
actor_rollout_ref.model.use_remove_padding=True \
actor_rollout_ref.actor.ppo_mini_batch_size=16 \
actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=4 \
actor_rollout_ref.actor.use_kl_loss=True \
actor_rollout_ref.actor.kl_loss_coef=0.001 \
actor_rollout_ref.actor.kl_loss_type=low_var_kl \
actor_rollout_ref.actor.entropy_coeff=0 \
actor_rollout_ref.model.enable_gradient_checkpointing=True \
actor_rollout_ref.actor.fsdp_config.param_offload=False \
actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \
actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=16 \
actor_rollout_ref.rollout.tensor_model_parallel_size=2 \
actor_rollout_ref.rollout.name=vllm \
actor_rollout_ref.rollout.gpu_memory_utilization=0.6 \
actor_rollout_ref.rollout.n=4 \
actor_rollout_ref.rollout.load_format=safetensors \
actor_rollout_ref.rollout.layered_summon=True \
actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=16 \
actor_rollout_ref.ref.fsdp_config.param_offload=True \
algorithm.use_kl_in_reward=False \
trainer.critic_warmup=0 \
trainer.logger='["console"]' \
trainer.project_name='verl_grpo_example_gsm8k' \
trainer.experiment_name='qwen2.5_3b_grpo_lora' \
trainer.n_gpus_per_node=4 \
trainer.nnodes=1 \
trainer.default_local_dir='/mnt/cluster_storage/verl_ckpts/${trainer.project_name}/${trainer.experiment_name}' \
trainer.save_freq=20 \
trainer.test_freq=5 \
trainer.total_epochs=1 $@

Key configuration parameters

This configuration includes several important settings:

  • Model configuration: Uses Qwen2.5-3B-Instruct with LoRA (rank 64, alpha 32) for parameter-efficient training.
  • Data paths: Points to the previously prepared GSM8K dataset on cluster storage.
  • GRPO algorithm: Configured with KL loss for regularization.
  • vLLM rollout: Uses tensor parallelism (size 2) with 60% GPU memory utilization for efficient inference.
  • Checkpoint directory: Saves checkpoints to /mnt/cluster_storage/verl_ckpts for persistence across nodes.
  • Training duration: Set to 1 epoch for demonstration purposes.
tip
  1. Adjust trainer.total_epochs, trainer.save_freq, and trainer.test_freq based on your training requirements. For production training, increase the number of epochs and adjust the checkpoint frequency.
  2. For this demo, checkpoints are stored in /mnt/cluster_storage/. For larger LLMs, it's recommended to save checkpoints to local storage during training and then upload them to artifact storage for long-term persistence and easier access.

Launch the training process

Start the training process by running the following command from the verl directory in your workspace terminal:

bash examples/grpo_trainer/run_qwen2.5-3b.sh

The training process does the following:

  1. Loads the prepared GSM8K dataset from cluster storage.
  2. Initializes the Qwen2.5-3B model with LoRA adapters.
  3. Sets up the vLLM inference engine for rollout generation.
  4. Runs the GRPO training loop with KL regularization.
  5. Saves checkpoints at the specified frequency.
tip

Monitor your training progress through the Ray dashboard or console logs. Check GPU utilization and memory usage to ensure efficient resource usage.

Resources