LLMForge APIs are in Beta.
Fine-tuning Open-weight LLMs with Anyscale
Fine-tuning LLMs is an easy and cost-effective way to tailor their capabilities towards niche applications with high-acccuracy. While Ray and RayTrain offer generic primitives for building such workloads, at Anyscale we have created a higher-level library called LLMForge that builds on top of Ray and other open-source libraries to provide an easy to work with interface for fine-tuning and training LLMs.
What is LLMForge?
LLMForge is a library that implements a collection of design patterns that use Ray, Ray Train, and Ray Data in combination with other open-source libraries (for example DeepSpeed, 🤗 Hugging Face Accelerate, Transformers, etc.) to provide an easy to use library for fine-tuning LLMs. In addition to these design patterns, it offers tight integrations with the Anyscale platform, such as model registry, streamlined deployment, observability, Anyscale's job submission, etc.
Configurations
LLMForge workloads are specified using YAML configurations (documentation here). The library offers two main modes: default
and custom
.
- Default
- Custom
Similar to OpenAI's fine-tuning experience, the default
mode provides a minimal and efficient setup. It allows you to quickly start a fine-tuning job by setting just a few parameters (model_id
and train_path
). All other settings are optional and Anyscale automatically selects them based on dataset statistics and predefined configurations.
The custom
mode offers more flexibility and control over the fine-tuning process, allowing for advanced optimizations and customizations. You need to provide more configurations to set up this mode (for example prompt format, hardware, batch size, etc.)
Here's a comparison of the two modes:
Feature | Default Mode | Custom Mode |
---|---|---|
Ideal For | Prototyping what's possible, focusing on dataset cleaning, fine-tuning, and evaluation pipeline | Optimizing model quality by controlling more parameters, hardware control |
Command | llmforge anyscale finetune config.yaml --default | llmforge anyscale finetune config.yaml |
Model Support | Popular models with their prompt format (for example meta-llama/Meta-Llama-3-8B-Instruct )* | Any HuggingFace model, any prompt format (for example meta-llama/Meta-Llama-Guard-2-8B ) |
Task Support | Instruction tuning for multi-turn chat | Causal language modeling, Instruction tuning, Classification |
Data Format | Supports chat-style datasets, with fixed prompt formats per model | Supports chat-style datasets, with flexible prompt format |
Hardware | Automatically selected (limited by availability) | User-configurable |
Fine-tuning type | Only supports LoRA (Rank-8, all linear layers) | User-defined LoRA and Full-parameter |
*NOTE: old models will get deprecated
Choose the mode that best fits your project requirements and level of customization needed.
Note for default mode:
- Cluster type for all models: 8xA100-80G
- Supported context length for models: 512 up to max. context length of each model in powers of 2.
Models Supported in Default Mode
Default mode supports a select list of models, with a fixed cluster type of 8xA100-80G. For each model we only support context lengths of 512 up to Max. context length in increments of 2x (that is, 512, 1024, and so on). Here are the supported models and their configurations:
Model family | model_id | Max. context lengths |
---|---|---|
Llama-3.1 | meta-llama/Meta-Llama-3.1-8B-Instruct | 4096 |
Llama-3.1 | meta-llama/Meta-Llama-3.1-70B-Instruct | 4096 |
Llama-3 | meta-llama/Meta-Llama-3-8B-Instruct | 4096 |
Llama-3 | meta-llama/Meta-Llama-3-70B-Instruct | 4096 |
Mistral | mistralai/Mistral-Nemo-Instruct-2407 | 4096 |
Mistral | mistralai/Mistral-7B-Instruct-v0.3 | 4096 |
Mixtral | mistralai/Mixtral-8x7B-Instruct-v0.1 | 4096 |
Summary of Features in Custom Mode
✅ Support both Full parameter and LoRA
- LoRA with different configurations, ranks, layers, etc. (Anything supported by Hugging Face transformers)
- Full-parameter with multi-node training support
✅ State of the art performance related features:
- Gradient checkpointing
- Mixed precision training
- Flash attention v2
- DeepSpeed support (zero-DDP sharding)
- Liger Kernel integration
torch.compile
support
✅ Unified chat data format with flexible prompt format support enabling fine tuning for:
Use-case: Multi-turn chat, Instruction tuning, Classification
- Data format
- Prompt format
{
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Hi"},
{"role": "assistant", "content": "Howdy!"},
{"role": "user", "content": "What is the type of this model?"},
{"role": "assistant", "content": "[[1]]"},
]
}
Prompt Format for llama-3-instruct
:
system: "<|start_header_id|>system<|end_header_id|>\n\n{instruction}<|eot_id|>"
user: "<|start_header_id|>user<|end_header_id|>\n\n{instruction}<|eot_id|>"
assistant: "<|start_header_id|>assistant<|end_header_id|>\n\n{instruction}<|eot_id|>"
system_in_user: False
Use-case: Casual language modeling (aka continued pre-training), custom prompt formats (for example Llama-guard)
Example Continued pre-training (JSON):
{
"messages": [
{"role": "user", "content": "Once upon a time ..."},
],
},
{
"messages": [
{"role": "user", "content": "..."},
],
}
Prompt Format for doing nothing except concatenation:
system: "{instruction}"
user: "{instruction}"
assistant: "{instruction}"
system_in_user: False
✅ Flexible task support:
- Causal language modeling: Each token predicted based on all past tokens.
- Instruction tuning: Only assistant tokens are predicted based on past tokens.
- Classification: Only special tokens in the assistant message are predicted based on past tokens.
- Preference tuning: Use the contrast between chosen and rejected messages to improve the model.
✅ Support for multi-stage continuous fine-tuning
- Fine-tune on one dataset, then continue fine-tuning on another dataset, for iterative improvements.
- Do continued pre-training on one dataset, then chat-style fine-tuning on another dataset.
- Do continued pre-training on one dataset followed by iterations of supervised-finetuning and preference tuning on independent datasets.
✅ Support for context length extension
- Extend the context length of the model using methods like RoPE scaling.
✅ Configurability of hyper-parameters
- Full control over learning hyperparameters such as learning rate, number of epochs, batch size, etc.
✅ Anyscale and third-party integrations
- (Coming soon) Model registry:
- SDK for accessing fine-tuned models for creating automated pipelines
- More streamlined deployment flow when you fine-tune on Anyscale
- Monitoring and observability:
- Take advantage of standard logging frameworks such as Weights & Biases and MLflow
- Use of Ray dashboard and Anyscale loggers for debugging and monitoring the training process
- Anyscale jobs integration: Use Anyscale's job submission API to programitically submit long-running jobs through LLMForge
Example Configs
Here are some examples for default mode and custom mode. To run these examples, you can open-up the fine-tuning template as workspace on Anyscale and run the commands in the terminal. The example configs can be found under ./training_configs
. Outside of a workspace you can also find them here.
- Default Mode
- Custom Mode
Fine-tune llama-3-8b-instruct in default mode (LoRA rank 8). Just giving the dataset.
Command:
llmforge anyscale finetune training_configs/default/meta-llama/Meta-Llama-3-8B-Instruct-simple.yaml --default
Config:
model_id: meta-llama/Meta-Llama-3-8B-Instruct
train_path: s3://...
Fine-tune llama-3-8b-instruct in default mode but also control parameters like learning_rate
and num_epochs
.
Command:
llmforge anyscale finetune training_configs/custom/meta-llama--Meta-Llama-3-8B-Instruct/lora/16xA10-512.yaml
Config:
model_id: meta-llama/Meta-Llama-3-8B-Instruct
train_path: s3://...
valid_path: s3://...
context_length: 512
deepspeed:
config_path: deepspeed_configs/zero_3_offload_optim+param.json
worker_resources:
accelerator_type:A10G: 0.001
Fine-tune llama-3-8b-instruct in custom mode (model is supported in default mode) on 16xA10s (auto mode uses 8xA100-80G) with context length of 512.
Command:
llmforge anyscale finetune training_configs/custom/meta-llama--Meta-Llama-3-8B-Instruct/lora/16xA10-512.yaml
Config:
Note: liger_kernel
flag requires llmforge
>=0.5.6
model_id: meta-llama/Meta-Llama-3-8B-Instruct
train_path: s3://...
valid_path: s3://...
context_length: 512
deepspeed:
config_path: deepspeed_configs/zero_3_offload_optim+param.json
liger_kernel: True
worker_resources:
accelerator_type:A10G: 0.001
Fine-tune gemma-2-27b in custom mode (model is not supported in default-mode) on 8xA100-80G.
Command:
llmforge anyscale finetune training_configs/custom/google--gemma-2-27b-it/lora/8xA100-80G-512.yaml
Config:
Note: liger_kernel
flag requires llmforge
>=0.5.6
model_id: google/gemma-2-27b-it
train_path: s3://...
valid_path: s3://...
num_devices: 8
worker_resources:
accelerator_type:A100-80G: 0.001
liger_kernel: True
generation_config:
prompt_format:
system: "{instruction} + "
assistant: "<start_of_turn>model\n{instruction}<end_of_turn>\n"
trailing_assistant: "<start_of_turn>model\n"
user: "<start_of_turn>user\n{system}{instruction}<end_of_turn>\n"
system_in_user: True
bos: "<bos>"
default_system_message: ""
stopping_sequences: ["<end_of_turn>"]