Skip to main content

Preference tuning With DPO

Alignment of LLMs has traditionally been broken down into two post-training stages: Supervised fine-tuning (SFT) followed by preference tuning (aka RLHF). Preference tuning is a powerful tool that can optimize LLMs towards complex preferences that cannot be easily captured through supervised fine-tuning.

This guide shows how to do direct preference optimization (DPO) fine-tuning of open-source language models on Anyscale.

note

LLMForge supports DPO only with llmforge versions >= 0.5.3. See LLMForge Versions for the full list of available versions and images.

Example config

You can see all the config options in the reference docs. For preference tuning in llmforge, specify the preference_tuning_config in the config YAML:

model_id: meta-llama/Meta-Llama-3-8B-Instruct # Any HF model ID.
task: preference_tuning # Optional: LLMForge can infer the task when you provide `preference_tuning_config`
preference_tuning_config:
# Beta parameter in DPO, controlling the amount of regularization.
beta: 0.01
logprob_processor_scaling_config:
custom_resources:
acceleratorx_type:A10G: 0.001 # custom resource per worker.
# Runs reference model logp calculation on 4 GPUs
concurrency: 4
# Batch size per worker
batch_size: 2
...
  • beta : beta is a parameter controlling the amount of regularization, typically set between 0 and 1. A higher beta implies that the trained model (output distribution) is closer to the reference model (output distribution), while the reference model is ignored as beta tends to zero. See the DPO paper for more details.
  • logprob_processor_scaling_config: We calculate log probabilities for the chosen and rejected samples from the reference model for the DPO loss. Internally, this is implemented as a ray.data.Dataset.map_batches operation. logprob_processor_scaling_config is the scaling config for the reference model. In the above example, we specify a batch size of 2 per reference model instance, with a custom resource label indicating that each reference model should be run on an A10G GPU. A total of 4 concurrent instances are used.

Our implementation of DPO can overlap model forward pass with reference model log probability calculation, while also decoupling the compute configuration of both. For example, you can run the reference model on smaller, cheaper A10G GPUs (since reference model is only used for inference) while running training on A100s.

assets

For more on preference tuning, and for complete example configs, see our end-to-end-example for DPO with synthetic data on Anyscale. Note that our current implementation requires that each reference model instance fit on one GPU. We're working on supporting a tensor parallel implementation for memory efficient inference.

tip

With DPO, set num_data_blocks_per_device higher than the defaults to avoid bottlenecking the reference model log probability calculations. Make sure to go over the DPO end-to-end example and the full llmforge config reference for details.

Example dataset

{
"chosen":
[
{
"content": "Given the following text, create a very short summary that is at most 2 sentences.\n\nText:\n...",
"role": "user"
},
{
"content": "Donny Osmond advises Justin Bieber to be prepared for the hardships that come with fame and offers insight on how to maintain a strong work ethic and genuine connection with fans. He and wife Marie Osmond also discuss their wildly successful career and their recent remarriage. The interview will air Monday on \"Piers Morgan Tonight\" at 9 p.m. ET.",
"role": "assistant"
}
],
"rejected":
[
{
"content": "Given the following text, create a very short summary that is at most 2 sentences.\n\nText:\n...",
"role": "user"
},
{
"content": "On the upcoming episode of \"Piers Morgan Tonight,\" Donny and Marie Osmond share their experiences and advice for young entertainers like Justin Bieber. They discuss the importance of hard work, humility, and genuine love for one's audience, while also reflecting on their own personal journeys and relationships. The Osmonds' insights are sure to inspire viewers and provide valuable lessons for anyone in the entertainment industry.",
"role": "assistant"
}
]
}