Post-training for LLMs on Anyscale
Post-training for LLMs on Anyscale
LLM development typically has two phases: pretraining and post-training. Pretraining uses self-supervised next-token prediction on large, diverse corpora to learn general-purpose capabilities (a "base" model). Post-training adapts that model to excel at your specific application, domain, or behavioral requirements. This guide provides a comprehensive overview of the core LLM post-training methodologies, from supervised fine-tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF) to Reinforcement Learning from Verifiable Rewards (RLVR) and agentic tuning. By understanding these approaches, you can select the right strategy to build a powerful and specialized LLM on Anyscale.
LLM post-training capabilities on Anyscale
- Various open-source models: A curated collection of models ready for training, including gpt-oss, Llama, Mistral, Qwen, Gemma, and multimodal models such as LLaVA-Next.
- Integrated training methods: Easily switch between supervised fine-tuning (SFT) and RLHF with algorithms such as PPO, DPO, KTO, and ORPO, or RLVR with GRPO and DAPO.
- Scalable compute: Support for both full fine-tuning and parameter-efficient methods (LoRA, QLoRA, freeze-tuning).
- Distributed GPU acceleration: Memory-efficient scaling compatible with FSDP, DeepSpeed, and Megatron.
- Monitoring and observability: Integrations with Weights & Biases, MLflow, and TensorBoard for tracking performance and debugging.
- Evaluation and serving: Evaluate checkpoints with Ray Data and deploy them for inference with Ray Serve.
Feature availability and exact configurations can vary by model family and framework. Consult the latest Ray and Anyscale docs for specifics.
Choose your framework
Anyscale supports multiple frameworks for LLM post-training, including LLaMA-Factory, SkyRL, and Ray Train. Each framework has different strengths for various use cases, from RLHF to RLVR and agentic tuning. To compare frameworks and choose the best fit for your needs, see Choose a framework for LLM post-training.
For hands-on tutorials, see the following:
- Fine-tuning with LLaMA-Factory
- Train LLMs with reinforcement learning using SkyRL
- Train LLMs with reinforcement learning using verl
Understand pre-training vs. post-training
Post-training is the process of adapting a pre-trained Large Language Model (LLM) to align it with your specific domain, tasks, and behavioral goals. It starts with a general-purpose base model and applies specialized training techniques to enhance its performance, safety, and reliability for a particular application.
Pre-training builds general knowledge by training a model from scratch on massive, unlabeled text corpora, often containing trillions of tokens. This process is computationally extensive and costly. In contrast, post-training efficiently specializes the model using smaller, targeted datasets, allowing you to shape the model's behavior without the immense cost of training from scratch.
Choose the right approach: fine-tuning vs. RAG vs. prompt engineering
Before committing to post-training, it's crucial to select the right approach for your problem.
| Approach | When to use | Advantages | Trade-offs |
|---|---|---|---|
| Prompt engineering | Quick prototypes, simple tasks, single queries. | Zero training cost, instant iteration. | Brittle, prompt length grows, can't inject new knowledge. |
| Retrieval-Augmented Generation (RAG) | Answering questions over a changing knowledge base. | Keeps model weights frozen, allows real-time data and updates. | Requires vector store (in-memory or external DB), relies on retrieval accuracy. |
| Fine-tuning (post-training) | Adapting to a specific style, domain, or behavior. | Lowest inference latency, strongest control over model. | Requires training data and GPUs; weights are static until retrained. |