Skip to main content

Running LLM fine-tuning template as an Anyscale Job

For developer velocity, use workspaces to run Python scripts. For automation and launching jobs from your laptop without having to spin up a workspace run the fine-tuning workloads with isolated Anyscale Jobs. You may also want to launch production long running jobs through a workspace because you might set your workspace setup as ephemeral.

For specifying a job, specify the command that needs to run, for example, [COMMAND][ARGS], along with the requirements, for example, Docker image, additional, pip packages, etc., in a job YAML and then call anyscale job submit to launch the job on Anyscale.

Assume the following files in your local setup:

.
├── config
│ ├── llama-3-8b.yaml
│ └── zero_3_offload_optim+param.json
└── job_config.yaml

Here is an example content of job_config.yaml for submitting a job:

name: "llmforge-job"
entrypoint: "llmforge anyscale finetune config/llama-3-8b.yaml"
max_retries: 0
note

Executing an Anyscale Job within a Workspace will ensure that files in the current working directory are available for the Job (unless excluded with --exclude). But we can also load files from anywhere (ex. GitHub repo, S3, etc.) if we want to launch a Job from anywhere.

These available settings can be found on Anyscale jobs API docs. A few notes:

  • entrypoint is basically the command we want to run. Pay attention to the relative file location (config/llama-3-8b.yaml) and the working_dir. Inside llama-3-8b.yaml we are also referencing a relative path to config/zero_3_offload_optim+param.json. This works because we specify the working_dir to be the current directory . when submitting the job from client side. If submitting from the workspace the ~/default directory is treated as working_dir.
  • image_uri refers to the image that has LLMforge installed. The fine-tuning template automatically lists the latest released image. For the full list of versions and their URIs, visit llmforge versions. If you run this job from a workspace, the job inherits the image_uri from the workspace image.
  • max_retries: setting this to zero makes sure we do not keep retrying if the job fails. We should retry only when the job is flaky (maybe due to resource constraints, etc.)
  • working_dir: Setting working_dir to the current directory is necessary when you're submitting this job from outside of workspace, for example, your laptop or CI/CD pipelines.
anyscale job submit --config-file ./job_configs/job_workspace.yaml

As the job runs we can go to the provided URL (console.anyscale.con/jobs/prod_job...) to monitor the logs and metrics related to the job.

tip

To provide WANDB_API_KEY you can use env_vars in the job specification YAML.