Version: Canary 🐤

note

LLMForge APIs are in Beta.

Fine-tuning Open-weight LLMs with Anyscale

Fine-tuning LLMs is an easy and cost-effective way to tailor their capabilities towards niche applications with high-acccuracy. While Ray and RayTrain offer generic primitives for building such workloads, at Anyscale we have created a higher-level library called LLMForge that builds on top of Ray and other open-source libraries to provide an easy to work with interface for fine-tuning and training LLMs.

What is LLMForge?

LLMForge is a library that implements a collection of design patterns that use Ray, Ray Train, and Ray Data in combination with other open-source libraries (for example DeepSpeed, 🤗 Hugging Face Accelerate, Transformers, etc.) to provide an easy to use library for fine-tuning LLMs. In addition to these design patterns, it offers tight integrations with the Anyscale platform, such as model registry, streamlined deployment, observability, Anyscale's job submission, etc.

Configurations

LLMForge workloads are specified using YAML configurations (documentation here). The library offers two main modes: default and custom.

Default
Custom

Similar to OpenAI's fine-tuning experience, the default mode provides a minimal and efficient setup. It allows you to quickly start a fine-tuning job by setting just a few parameters (model_id and train_path). All other settings are optional and Anyscale automatically selects them based on dataset statistics and predefined configurations.

The custom mode offers more flexibility and control over the fine-tuning process, allowing for advanced optimizations and customizations. You need to provide more configurations to set up this mode (for example prompt format, hardware, batch size, etc.)

Here's a comparison of the two modes:

Feature	Default Mode	Custom Mode
Ideal For	Prototyping what's possible, focusing on dataset cleaning, fine-tuning, and evaluation pipeline	Optimizing model quality by controlling more parameters, hardware control
Command	`llmforge anyscale finetune config.yaml --default`	`llmforge anyscale finetune config.yaml`
Model Support	Popular models with their prompt format (for example `meta-llama/Meta-Llama-3-8B-Instruct`)*	Any HuggingFace model, any prompt format (for example `meta-llama/Meta-Llama-Guard-2-8B`)
Task Support	Instruction tuning for multi-turn chat	Causal language modeling, Instruction tuning, Classification
Data Format	Supports chat-style datasets, with fixed prompt formats per model	Supports chat-style datasets, with flexible prompt format
Hardware	Automatically selected (limited by availability)	User-configurable
Fine-tuning type	Only supports LoRA (Rank-8, all linear layers)	User-defined LoRA and Full-parameter

*NOTE: old models will get deprecated

Choose the mode that best fits your project requirements and level of customization needed.

Note:

Cluster type for all models: 8xA100-80G
Supported context length for models: 512 up to max. context length of each model in powers of 2.

Models Supported in Default Mode

Default mode supports a select list of models, with a fixed cluster type of 8xA100-80G. For each model we only support context lengths of 512 up to Max. context length in increments of 2x (that is, 512, 1024, ...). Here are the supported models and their configurations:

Model family	`model_id`	Max. context lengths
Llama-3.1	`meta-llama/Meta-Llama-3.1-8B-Instruct`	4096
Llama-3.1	`meta-llama/Meta-Llama-3.1-70B-Instruct`	4096
Llama-3	`meta-llama/Meta-Llama-3-8B-Instruct`	4096
Llama-3	`meta-llama/Meta-Llama-3-70B-Instruct`	4096
Mistral	`mistralai/Mistral-Nemo-Instruct-2407`	4096
Mistral	`mistralai/Mistral-7B-Instruct-v0.3`	4096
Mixtral	`mistralai/Mixtral-8x7B-Instruct-v0.1`	4096

Summary of Features in Custom Mode

✅ Support both Full parameter and LoRA

LoRA with different configurations, ranks, layers, etc. (Anything supported by Hugging Face transformers)
Full-parameter with multi-node training support

Gradient checkpointing
Mixed precision training
Flash attention v2
DeepSpeed support (zero-DDP sharding)

✅ Unified chat data format with flexible prompt format support enabling fine tuning for:

Use-case: Multi-turn chat, Instruction tuning, Classification

Data format
Prompt format

{
    "messages": [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Hi"},
        {"role": "assistant", "content": "Howdy!"},
        {"role": "user", "content": "What is the type of this model?"},
        {"role": "assistant", "content": "[[1]]"},
    ]
}

Prompt Format for llama-3-instruct:

system: "<|start_header_id|>system<|end_header_id|>\n\n{instruction}<|eot_id|>"
user: "<|start_header_id|>user<|end_header_id|>\n\n{instruction}<|eot_id|>"
assistant: "<|start_header_id|>assistant<|end_header_id|>\n\n{instruction}<|eot_id|>"
system_in_user: False

Use-case: Casual language modeling (aka continued pre-training), custom prompt formats (for example Llama-guard)

Example Continued pre-training (JSON):

{
    "messages": [
        {"role": "user", "content": "Once upon a time ..."},
    ],
},
{
    "messages": [
        {"role": "user", "content": "..."},
    ],
}

Prompt Format for doing nothing except concatenation:

system: "{instruction}"
user: "{instruction}"
assistant: "{instruction}"
system_in_user: False

✅ Flexible task support:

Causal language modeling: Each token predicted based on all past tokens.
Instruction tuning: Only assistant tokens are predicted based on past tokens.
Classification: Only special tokens in the assistant message are predicted based on past tokens.
Preference tuning: Use the contrast between chosen and rejected messages to improve the model.

✅ Support for multi-stage continuous fine-tuning

Fine-tune on one dataset, then continue fine-tuning on another dataset, for iterative improvements.
Do continued pre-training on one dataset, then chat-style fine-tuning on another dataset.
Do continued pre-training on one dataset followed by iterations of supervised-finetuning and preference tuning on independent datasets.

✅ Support for context length extension

Extend the context length of the model using methods like RoPE scaling.

✅ Configurability of hyper-parameters

Full control over learning hyperparameters such as learning rate, number of epochs, batch size, etc.

✅ Anyscale and third-party integrations

(Coming soon) Model registry:
- SDK for accessing fine-tuned models for creating automated pipelines
- More streamlined deployment flow when you fine-tune on Anyscale
Monitoring and observability:
- Take advantage of standard logging frameworks such as Weights and Biases
- Use of Ray dashboard and Anyscale loggers for debugging and monitoring the training process
Anyscale jobs integration: Use Anyscale's job submission API to programitically submit long-running jobs through LLMForge

Example Configs

Here are some examples for default mode and custom mode. To run these examples, you can open-up the fine-tuning template as workspace on Anyscale and run the commands in the terminal. The example configs can be found under ./training_configs. Outside of a workspace you can also find them here.

Default Mode
Custom Mode

Fine-tune llama-3-8b-instruct in default mode (LoRA rank 8). Just giving the dataset.

Command:

llmforge anyscale finetune training_configs/default/meta-llama/Meta-Llama-3-8B-Instruct-simple.yaml --default

Config:

model_id: meta-llama/Meta-Llama-3-8B-Instruct
train_path: s3://...

Fine-tune llama-3-8b-instruct in default mode but also control parameters like learning_rate and num_epochs.

Command:

llmforge anyscale finetune training_configs/custom/meta-llama--Meta-Llama-3-8B-Instruct/lora/16xA10-512.yaml

Config:

model_id: meta-llama/Meta-Llama-3-8B-Instruct
train_path: s3://...
valid_path: s3://...
context_length: 512
deepspeed:
  config_path: deepspeed_configs/zero_3_offload_optim+param.json
worker_resources:
  accelerator_type:A10G: 0.001

Fine-tune llama-3-8b-instruct in custom mode (model is supported in default mode) on 16xA10s (auto mode uses 8xA100-80G) with context length of 512.

Command:

llmforge anyscale finetune training_configs/custom/meta-llama--Meta-Llama-3-8B-Instruct/lora/16xA10-512.yaml

Config:

model_id: meta-llama/Meta-Llama-3-8B-Instruct
train_path: s3://...
valid_path: s3://...
context_length: 512
deepspeed:
  config_path: deepspeed_configs/zero_3_offload_optim+param.json
worker_resources:
  accelerator_type:A10G: 0.001

Fine-tune gemma-2-27b in custom mode (model is not supported in default-mode) on 8xA100-80G.

Command:

llmforge anyscale finetune training_configs/custom/google--gemma-2-27b-it/lora/8xA100-80G-512.yaml

Config:

model_id: google/gemma-2-27b-it
train_path: s3://...
valid_path: s3://...
num_devices: 8
worker_resources:
  accelerator_type:A100-80G: 0.001
generation_config:
  prompt_format:
    system: "{instruction} + "
    assistant: "<start_of_turn>model\n{instruction}<end_of_turn>\n"
    trailing_assistant: "<start_of_turn>model\n"
    user: "<start_of_turn>user\n{system}{instruction}<end_of_turn>\n"
    system_in_user: True
    bos: "<bos>"
    default_system_message: ""
  stopping_sequences: ["<end_of_turn>"]

What is LLMForge?​

Configurations​

Summary of Features in Custom Mode​

✅ Support both Full parameter and LoRA​

✅ State of the art performance related features:​

✅ Unified chat data format with flexible prompt format support enabling fine tuning for:​

✅ Flexible task support:​

✅ Support for multi-stage continuous fine-tuning​

✅ Support for context length extension​

✅ Configurability of hyper-parameters​

✅ Anyscale and third-party integrations​

Example Configs​

What is LLMForge?

Configurations

Summary of Features in Custom Mode

✅ Support both Full parameter and LoRA

✅ State of the art performance related features:

✅ Unified chat data format with flexible prompt format support enabling fine tuning for:

✅ Flexible task support:

✅ Support for multi-stage continuous fine-tuning

✅ Support for context length extension

✅ Configurability of hyper-parameters

✅ Anyscale and third-party integrations

Example Configs