Fine-tune LLMs on Anyscale
Post-training adapts a general-purpose foundation model to excel at your specific application, domain, or behavioral requirements. This guide provides a comprehensive overview of the core methodologies, from supervised fine-tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF) to more recently developed efficient alignment techniques. By understanding these approaches, you can select the right strategy to build a powerful and specialized LLM on Anyscale.
LLM post-training capabilities on Anyscale
- Various open-source models: A curated collection of models ready for training, including Llama-3, Mistral, Qwen, Gemma, and multimodal models such as LLaVA-Next.
- Integrated training methods: Easily switch between supervised fine-tuning (SFT) and advanced alignment algorithms such as PPO, DPO, KTO, and ORPO with a single flag.
- Scalable compute: Support for both full fine-tuning and parameter-efficient methods (LoRA, QLoRA, freeze-tuning).
- Distributed GPU acceleration: Memory-efficient scaling with DeepSpeed.
- Monitoring and observability: Integrations with Weights & Biases, MLflow, and TensorBoard for tracking performance and debugging.
- Evaluation and serving: Evaluate checkpoints with Ray Data and deploy them for inference with Ray Serve.
Feature availability and exact configurations can vary by model family and framework. Consult the latest Ray and Anyscale docs for specifics.
Understand pre-training vs. post-training
Post-training is the process of adapting a pre-trained Large Language Model (LLM) to align it with your specific domain, tasks, and behavioral goals. It starts with a general-purpose base model and applies specialized training techniques to enhance its performance, safety, and reliability for a particular application.
Unlike pre-training, which builds general knowledge from massive, unlabeled text corpora, post-training efficiently specializes the model using smaller, targeted datasets. This allows you to shape the model's behavior without the immense cost of training from scratch.
Choose the right approach: fine-tuning vs. RAG vs. prompt engineering
Before committing to post-training, it's crucial to select the right tool for your problem.
Approach | When to use | Advantages | Trade-offs |
---|---|---|---|
Prompt engineering | Quick prototypes, simple tasks, single queries. | Zero training cost, instant iteration. | Brittle, prompt length grows, can't inject new knowledge. |
Retrieval-Augmented Generation (RAG) | Answering questions over a changing knowledge base. | Keeps model weights frozen, allows real-time data and updates. | Requires vector store (in-memory or external DB), relies on retrieval accuracy. |
Fine-tuning (post-training) | Adapting to a specific style, domain, or behavior. | Lowest inference latency, strongest control over model. | Requires training data and GPUs; weights are static until retrained. |
Core post-training methodologies
Post-training encompasses several key techniques. The two foundational stages are Supervised Fine-Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF).
Supervised fine-tuning (SFT)
SFT trains the model to map prompts to desired responses using labeled examples, minimizing cross-entropy between the model's output and target text. High-quality, diverse instruction data is the main driver of SFT quality. SFT is often used to establish baseline instruction-following before preference optimization.
Traditional RLHF: Reward modeling and PPO
RLHF refines a model's behavior (for example, helpfulness, safety, politeness) based on human preferences. Instead of learning from a single "correct" answer, the model learns from feedback on which of two or more responses is better.
The traditional RLHF process involves three steps:
- Collect preference data: For a given prompt, generate multiple responses and have human labelers choose the preferred response over the rejected ones.
- Train a reward model (RM): Train a separate model to predict a scalar "reward" score that reflects the human preference. Preferred responses should receive higher scores.
- Optimize the LLM with PPO: Apply a reinforcement learning algorithm such as Proximal Policy Optimization (PPO) with the trained RM to adjust the LLM's weights while regularizing with a KL penalty, guiding it to produce responses that achieve higher predicted reward scores.
Other RLHF algorithms
Researchers have developed several simpler and more stable alternatives to the classic RM and PPO-based RLHF pipeline, including:
- Direct Preference Optimization (DPO): Removes the explicit reward model and RL loop by directly optimizing a pairwise loss on chosen versus rejected responses, using a fixed reference model to implicitly regularize the policy.
- Simple Preference Optimization (SimPO): Uses a reference-free objective that treats the (average) log-probability of a sequence as an implicit reward and adds a target-margin term—eliminating the frozen reference model and reducing memory while retaining strong performance.
- Odds-Ratio Preference Optimization (ORPO): A single-stage, reference-free method that integrates preference learning into SFT by adding an odds-ratio penalty to the standard NLL loss—so you fine-tune once while contrasting favored versus disfavored responses.
- Kahneman-Tversky Optimization (KTO): Trains from unary "thumbs-up/down" feedback using a prospect-theoretic (Kahneman—Tversky) utility, avoiding pairwise comparisons while matching or exceeding preference-based methods at various scales.
Anyscale supports DPO, SimPO, ORPO, and KTO—providing a flexible suite of preference optimization algorithms that make advanced alignment techniques more accessible to developers.
Compare full vs. freeze vs. parameter-efficient fine-tuning (PEFT)
Once you choose a methodology, you must decide how to apply the weight updates.
Full fine-tuning
Full fine-tuning updates all parameters of the model. It offers maximum control over the model's behavior but is computationally expensive and requires a significant amount of data.
- Use case: Critical applications such as safety alignment for a public model release.
- Requires: Significant GPU memory (often requiring distributed training with DeepSpeed ZeRO-3) and careful evaluation to prevent "catastrophic forgetting" of the model's original capabilities.
Freeze tuning
Layer freezing is a straightforward fine-tuning strategy that freezes some layers of the model and trains only selected layers—commonly freezing lower layers that capture fundamental language patterns while adapting upper layers that learn more abstract, task-specific representations. The intuition is that freezing helps preserve core capabilities, mitigates catastrophic forgetting, and reduces compute.
Parameter-efficient fine-tuning (PEFT)
PEFT methods freeze the vast majority of the base model's weights and only train a small number of new parameters (often less than 1% of the total). This dramatically reduces memory requirements and training time.
Low-rank adaptation (LoRA)
LoRA is a widely adopted PEFT method. Instead of updating the original weights (W) it learns a low-rank decomposition of the weight update (ΔW). It injects small, trainable matrices (A and B) into the model layers and learns the update as ΔW ≈ A × B.
Why it matters: The backpropagation runs only through the small A and B matrices, slashing VRAM usage and compute costs. Because you can insert adapters across layers (including lower ones responsible for basic linguistic features), LoRA can handle large domain shifts or tasks requiring new representations without touching most base weights.
LoRA adapter deployment: You can merge the learned ΔW back into the base weights for zero inference overhead, or load multiple LoRA adapters simultaneously to serve different tasks with a single base model. See Deploy multi-LoRA adapters on LLMs.
Quantized LoRA (QLoRA)
QLoRA extends LoRA by applying quantization to the frozen base model weights. This further reduces memory footprint, making it possible to fine-tune very large models using fewer GPUs.
Tips for LLM post-training
- Start with prompt engineering. Add RAG when you need up-to-date knowledge and traceable sources. Move to fine-tuning when you want more consistent behavior and lower latency.
- For most teams, begin with supervised fine-tuning (SFT), then add a light preference-optimization pass (such as DPO, ORPO, KTO, or SimPO). Reserve RLHF with PPO for cases that truly need strong behavioral shaping.
- Use parameter-efficient fine-tuning (PEFT, such as LoRA or QLoRA) for cost efficiency. Use full-model fine-tuning only when there's a large distribution shift and you've collected substantial training data.
- Invest in data quality—diverse instructions, hard/negative examples, and safety red-teaming. Good data yields better results than extra epochs.
- Always evaluate task performance alongside safety and robustness, and monitor for regressions and model drift.
Summary
In this guide, you learned about the key concepts and methodologies for LLM post-training, including supervised fine-tuning, RLHF techniques, and parameter-efficient methods. You also learned how to choose between different approaches based on your specific requirements and constraints.