Save and resume checkpoints and finalize models
This guide covers the lifecycle of fine-tuning or post-training artifacts when using LLaMA-Factory on Anyscale. You learn how to locate training artifacts and checkpoints, resume a paused or failed run, and merge a LoRA adapter with a pretrained model into a single, low-latency inference model.
Understand the training artifacts directory
Given this configuration:
# qwen_lora_sft.yaml
### output
save_strategy: steps # Or 'epoch' to save at the end of every epoch.
save_steps: 100
### Train
per_device_train_batch_size: 2
gradient_accumulation_steps: 4
num_train_epochs: 3
### Ray
ray_run_name: qwen2_7b_sft_lora # Save directory name.
ray_storage_path: /mnt/cluster_storage/ # Shared storage between head and worker nodes.
ray_num_workers: 4
Training writes checkpoints, configs, and logs to the ray_storage_path
and ray_run_name
that you specify in the YAML config.
Locate checkpoint paths
Checkpoints are snapshots of training state that let you resume runs without starting over. These checkpoints usually include optimizer state, schedulers, and weight tensors.
The weight tensors depend on the fine-tuning method:
-
LoRA or QLoRA → adapter weights only (
adapter_model.safetensors
), about 1% of the model weights. -
Full or freeze tuning outputs the full model weights (
model.safetensors
).- Full fine-tuning: Saves all parameters in a uniform, high-precision format (FP32).
- Freeze tuning: Saves a mix of data types, with updated layers in FP32 and frozen layers in their original precision (for example, BF16).
For both methods, consider applying post-training quantization (PTQ). It unifies precision, reduces model size, and accelerates inference. See Post-training quantization (PTQ).
The output directory contains a folder named {ray_storage_path}/{ray_run_name}/TorchTrainer_{x}/checkpoint_{x}/checkpoint
with the following structure:
checkpoint/
├── adapter_model.safetensors # LoRA adapter weights for LoRA / QLoRA
├── model.safetensors # Full model weights for Full fine-tuning or Freeze tuning
├── optimizer.pt # States of the trainer including the current step, epoch, and metric history
├── scheduler.pt # States of the learning rate scheduler
├── trainer_state.json # State for resuming training
├── tokenizer.json # Tokenizer data
└── ... and other files
Choose checkpoint frequency
You can control checkpoint frequency using the save_strategy
parameter, which accepts two options: 'epoch' or 'steps'.
Save by epoch
When you set save_strategy: 'epoch'
, the system saves a checkpoint after each complete pass through the training data. This is straightforward but can be infrequent for very large datasets. The system ignores the save_steps
parameter with this strategy.
Save by step
Setting save_strategy: 'steps'
provides more granular control. The system saves checkpoints every save_steps
optimizer updates.
The save_steps
parameter controls the frequency of saved checkpoints. A "step" isn't a single batch, but one optimizer update.
Each update processes a batch of examples. The total number of examples in one step is the effective batch size, calculated as:
effective_batch_size = per_device_train_batch_size × gradient_accumulation_steps × world_size
In this case, world_size
is the total number of GPUs you're using (for example, ray_num_workers
× gpus_per_worker
).
Example calculation
Using the settings from the configuration:
per_device_train_batch_size
: 2gradient_accumulation_steps
: 4world_size
: 4
The effective batch size is 2 × 4 × 4 = 32
examples per step.
With save_steps: 100
, the system saves a checkpoint every 100 optimizer steps. This saves progress after approximately every 3,200 examples.
Resume training from checkpoints
Update the YAML config to point to the saved checkpoint path:
# qwen_lora_sft_resume.yaml
resume_from_checkpoint: /mnt/cluster_storage/qwen2_7b_sft_lora/TorchTrainer_x/checkpoint_x/checkpoint # Path to the previously trained checkpoints directory; leave empty to start fresh.
Resuming is primarily for recovering from interrupted runs. If you resume from a checkpoint of a completed run to train for more epochs, ensure that you configure the number of epochs and learning rate scheduler correctly.
Increase disk space for checkpoints
If you need more storage capacity for saved checkpoints including model weights, adapters, and logs, increase the default disk size of your instance. For Google Cloud, see Change the default disk size. For AWS, see Change the default disk size.
Finalize the post-trained or fine-tuned model
Merge LoRA adapters
After fine-tuning or post-training with LoRA, a standalone model provides more efficient inference than loading a base model and an adapter separately.
Merge the adapters when you want a single-purpose deployment with the highest throughput. Keep the adapters and model separate if you have multi-tenant endpoints, A/B tests, or any scenario where you need to swap LoRA specializations on the fly. See Deploy multi-LoRA adapters on LLMs.
Use the LLaMA-Factory CLI to merge a trained LoRA adapter into the base model weights.
Create a YAML file (for example, merge_config.yaml
) to define the paths for the merge.
# merge_config.yaml
### model
model_name_or_path: Qwen/Qwen2.5-7B-Instruct
adapter_name_or_path: /mnt/cluster_storage/qwen2_7b_sft_lora/TorchTrainer_x/checkpoint_x/checkpoint # Path to the previously trained checkpoints directory.
template: qwen
finetuning_type: lora
### export
export_dir: /mnt/cluster_storage/models/qwen_lora_sft # Where to save the merged model.
export_size: 2 # GiB per shard.
export_device: cpu
export_legacy_format: false
model_name_or_path
must exist and match the chosentemplate
.- When merging a LoRA adapter, don't use a quantized base model. Merge with the original, unquantized model.
- Ensure your
export_device
has enough memory to fit 2 copies of the pretrained model.
Run this command:
llamafactory-cli export merge_config.yaml
Post-training quantization (PTQ)
Apply an additional post-training quantization step to your merged model. This makes the final model significantly smaller and can lead to faster inference speeds, which is ideal for production deployment.
FP8 quantization is a recommended option for reducing model size while keeping quality loss minimal.
The specific type of FP8 quantization you can use depends on your GPU architecture:
- Full FP8 (W8A8) is supported on newer NVIDIA Ada-generation GPUs, such as the L4.
- W8A16 (8-bit weights, 16-bit activation) is used for Ampere-generation GPUs, such as the A10G.
Use this script to quantize your model from 16-bit to 8-bit using llmcompressor
in an Anyscale workspace:
import ray
@ray.remote(num_gpus=1, accelerator_type="L4", memory=16*1024**3)
def quantize():
from transformers import AutoTokenizer, AutoModelForCausalLM
from llmcompressor.transformers import oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier
MODEL_ID = "Qwen/Qwen2.5-14B-Instruct"
model = AutoModelForCausalLM.from_pretrained(
MODEL_ID, device_map="auto", torch_dtype="auto",
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
# Configure simple PTQ quantization.
recipe = QuantizationModifier(
targets="Linear", scheme="FP8_DYNAMIC", ignore=["lm_head"]
)
# Apply the quantization algorithm.
oneshot(model=model, recipe=recipe)
# Save the model: {MODEL_ID}-FP8-Dynamic under shared storage.
SAVE_DIR = "/mnt/cluster_storage/" + MODEL_ID.split("/")[-1] + "-FP8-Dynamic"
model.save_pretrained(SAVE_DIR)
tokenizer.save_pretrained(SAVE_DIR)
ray.get(quantize.remote())
When you enable Auto-select worker nodes, the @ray.remote()
decorator provisions and runs the quantize
function on a dedicated worker node that matches your specified hardware requirements.
- GPU requirement: The script requests a single GPU per worker (
num_gpus=1
). Distributed quantization across multiple GPUs isn't supported. - Memory requirement: The entire model must fit into a combination of GPU memory and host (CPU) memory. When GPU VRAM is insufficient, the framework offloads tensors to CPU RAM. Request a node with sufficient total memory (
memory=16*1024**3
ensures at least 16 GB of CPU RAM). As a rule of thumb, allocate at least 2× the model's parameter size in GB. For example, a 14B parameter model (about 14 GB) requires at least 28 GB of memory.
For a detailed, step-by-step guide on how to create an FP8-quantized model, follow the official documentation: