Run LLaMA-Factory fine-tuning as an Anyscale job

Run LLaMA-Factory fine-tuning workloads as Anyscale jobs for greater stability and automatic retry behavior.

Anyscale jobs run independently of your interactive workspace session. This provides greater stability, automatic retry behavior, and centralized logging, which is ideal for long-running or critical fine-tuning tasks. You can submit an Anyscale job from either your local machine using the Create and manage jobs or directly from a workspace.

Step 1: Build a custom container image

To run LLaMA-Factory as an Anyscale job, first Build a custom image in the console with LLaMA-Factory pre-installed. See Custom images on Anyscale for more details on using container images on Anyscale.

Specify the required packages in a Dockerfile. For example, if you're following the Supervised Fine-Tuning (SFT) at Scale with DeepSpeed template, create a Dockerfile using the anyscale/ray-llm:2.48.0-py311-cu128 base image:

# Dockerfile

# Start with a recommended Anyscale base image
FROM anyscale/ray-llm:2.48.0-py311-cu128

WORKDIR /app

# Add your pip dependencies
RUN pip install --no-cache-dir --upgrade \
    llamafactory@git+https://github.com/hiyouga/LLaMA-Factory.git@v0.9.3 \
    deepspeed==0.16.9 \
    wandb==0.21.3 \
    hf_transfer==0.1.9

Step 2: Prepare the LLaMA-Factory training configuration

Prepare the training YAML file.

Use shared storage for jobs

The LLaMA-Factory training configuration (for example, qwen2.5_deepspeed_lora_sft.yaml) and every file path it references must reside on storage that's shared across all worker nodes.

For example, in the YAML configuration below, the paths for deepspeed, dataset_dir, and ray_storage_path must all resolve to a shared location such as /mnt/shared_storage or /mnt/user_storage.

caution

Don't use /mnt/cluster_storage even if you launch the job from an Anyscale workspace. A workspace runs on its own cluster, and a job typically runs on a separate execution cluster. See Storage shared across nodes for more details.

# qwen2.5_deepspeed_lora_sft.yaml

### Deepspeed config
deepspeed: /mnt/user_storage/ds_z3_config.json

### Dataset config
# Local dataset on shared storage
dataset: my_glaive_toolcall_en_demo
dataset_dir: /mnt/shared_storage/<user>

### Ray config
ray_run_name: qwen2.5_deepspeed_lora_sft
ray_storage_path: /mnt/user_storage

ray_init_kwargs:
  runtime_env:
    env_vars:
        # Only add environment variables here if they aren't defined in the Anyscale job configuration.
        # Don't define the same variable in both places.

        # Example for Weights & Biases tracking
        # WANDB_API_KEY: <your_wandb_token>

        # Example for gated models like meta-llama/Llama-3.1-8B-Instruct
        # HF_TOKEN: <your_huggingface_token>

        # Enable faster downloads with hf_transfer
        HF_HUB_ENABLE_HF_TRANSFER: '1'

Step 3: Create the Anyscale job configuration

Create a job configuration file (for example, qwen-job.yaml) to define the job parameters. This is the only file that doesn't need to reside on shared storage. However, every path it references, including the entrypoint, must be on shared storage.

# qwen-job.yaml

name: qwen2.5_deepspeed_lora_sft_job
image_uri: <your_image_uri>:<version> # Your custom image from Step 1
cloud: <your-cloud-name>
ray_version: 2.48.0
max_retries: 1

env_vars:
    # Set this to enable Ray Train in LLaMA-Factory
    USE_RAY: '1'
    # Add environment variables here (e.g., API keys)
    # WANDB_API_KEY: <your_wandb_token>
    # HF_TOKEN: <your_huggingface_token>

entrypoint: llamafactory-cli train /mnt/user_storage/qwen2.5_deepspeed_lora_sft.yaml

Step 4: Submit and monitor the job

You can submit the job from either an Anyscale workspace or your local machine's terminal.

note

When launching jobs from a workspace, use a clean environment. Job services inherit the workspace's pip packages and environment variables. Avoid extra installs to prevent version conflicts and verify that you set only the intended environment variables (API keys, tracking tokens, etc.) before submitting.

Submit your job using the Anyscale CLI:

anyscale job submit --wait --config-file qwen-job.yaml

Monitor your job's logs, status, and hardware metrics on the Anyscale jobs page. Once finished, the job's status changes to Succeeded.

Terminate a job early

If you need to stop a running job before it completes, use the following commands:

# First, list all jobs to find the ID of the one you want to stop
anyscale job list

# Then, terminate the job using its ID
anyscale job terminate --id <prodjob_...>

Step 1: Build a custom container image​

Step 2: Prepare the LLaMA-Factory training configuration​

Use shared storage for jobs​

Step 3: Create the Anyscale job configuration​

Step 4: Submit and monitor the job​

Terminate a job early​