Deprecated

LLMForge is being deprecated: The Ray Team is consolidating around open source fine-tuning solutions. Llama Factory and Axolotl provide enhanced functionality (quantization, advanced algorithms) and native Ray support for scaling. See the migration guide for transitioning your workflows.

Running LLM fine-tuning template as an Anyscale Job

For developer velocity, use workspaces to run Python scripts. For automation and launching jobs from your laptop without having to spin up a workspace run the fine-tuning workloads with isolated Anyscale Jobs. You may also want to launch production long running jobs through a workspace because you might set your workspace setup as ephemeral.

For specifying a job, specify the command that needs to run, for example, [COMMAND][ARGS], along with the requirements, for example, Docker image, additional, pip packages, etc., in a job YAML and then call anyscale job submit to launch the job on Anyscale.

Assume the following files in your local setup:

.
├── config
│   ├── llama-3-8b.yaml
│   └── zero_3_offload_optim+param.json
└── job_config.yaml

Here is an example content of job_config.yaml for submitting a job:

From Workspace
From laptop

name: "llmforge-job"
entrypoint: "llmforge anyscale finetune config/llama-3-8b.yaml"
max_retries: 0

name: "llmforge-job"
entrypoint: "llmforge anyscale finetune config/llama-3-8b.yaml"
image_uri: <replace_with_llmforge_image_uri_value>
max_retries: 0
working_dir: "."

note

Executing an Anyscale Job within a Workspace will ensure that files in the current working directory are available for the Job (unless excluded with --exclude). But we can also load files from anywhere (ex. GitHub repo, S3, etc.) if we want to launch a Job from anywhere.

These available settings can be found on Anyscale jobs API docs. A few notes:

entrypoint is basically the command we want to run. Pay attention to the relative file location (config/llama-3-8b.yaml) and the working_dir. Inside llama-3-8b.yaml we are also referencing a relative path to config/zero_3_offload_optim+param.json. This works because we specify the working_dir to be the current directory . when submitting the job from client side. If submitting from the workspace the ~/default directory is treated as working_dir.
image_uri refers to the image that has LLMforge installed. The fine-tuning template automatically lists the latest released image. For the full list of versions and their URIs, see LLMForge releases. If you run this job from a workspace, the job inherits the image_uri from the workspace image.
max_retries: setting this to zero makes sure we do not keep retrying if the job fails. We should retry only when the job is flaky (maybe due to resource constraints, etc.)
working_dir: Setting working_dir to the current directory is necessary when you're submitting this job from outside of a workspace, for example, your laptop or CI/CD pipelines.

anyscale job submit --config-file ./job_configs/job_workspace.yaml

As the job runs we can go to the provided URL (console.anyscale.con/jobs/prod_job...) to monitor the logs and metrics related to the job.

tip

To provide WANDB_API_KEY you can use env_vars in the job specification YAML.