Running LLM fine-tuning template as an Anyscale Job
For developer velocity, use workspaces to run Python scripts. For automation and launching jobs from your laptop without having to spin up a workspace run the fine-tuning workloads with isolated Anyscale Jobs. You may also want to launch production long running jobs through a workspace because you might set your workspace setup as ephemeral.
For specifying a job, specify the command that needs to run, for example, [COMMAND][ARGS], along with the requirements, for example, Docker image, additional, pip packages, etc., in a job YAML and then call anyscale job submit
to launch the job on Anyscale.
Assume the following files in your local setup:
.
├── config
│ ├── llama-3-8b.yaml
│ └── zero_3_offload_optim+param.json
└── job_config.yaml
Here is an example content of job_config.yaml
for submitting a job:
- From Workspace
- From laptop
name: "llmforge-job"
entrypoint: "llmforge anyscale finetune config/llama-3-8b.yaml"
max_retries: 0
name: "llmforge-job"
entrypoint: "llmforge anyscale finetune config/llama-3-8b.yaml"
image_uri: <replace_with_llmforge_image_uri_value>
max_retries: 0
working_dir: "."
Executing an Anyscale Job within a Workspace will ensure that files in the current working directory are available for the Job (unless excluded with --exclude
). But we can also load files from anywhere (ex. GitHub repo, S3, etc.) if we want to launch a Job from anywhere.
These available settings can be found on Anyscale jobs API docs. A few notes:
entrypoint
is basically the command we want to run. Pay attention to the relative file location (config/llama-3-8b.yaml
) and theworking_dir
. Insidellama-3-8b.yaml
we are also referencing a relative path toconfig/zero_3_offload_optim+param.json
. This works because we specify theworking_dir
to be the current directory.
when submitting the job from client side. If submitting from the workspace the~/default
directory is treated asworking_dir
.image_uri
refers to the image that has LLMforge installed. The fine-tuning template automatically lists the latest released image. For the full list of versions and their URIs, visit llmforge versions. If you run this job from a workspace, the job inherits theimage_uri
from the workspace image.max_retries
: setting this to zero makes sure we do not keep retrying if the job fails. We should retry only when the job is flaky (maybe due to resource constraints, etc.)working_dir
: Settingworking_dir
to the current directory is necessary when you're submitting this job from outside of workspace, for example, your laptop or CI/CD pipelines.
anyscale job submit --config-file ./job_configs/job_workspace.yaml
As the job runs we can go to the provided URL (console.anyscale.con/jobs/prod_job...
) to monitor the logs and metrics related to the job.
To provide WANDB_API_KEY
you can use env_vars
in the job specification YAML.