DPO: Direct Preference Optimization
DPO (Direct Preference Optimization) is a simpler, more stable alternative to RLHF that directly optimizes language models from preference data without reward models or reinforcement learning.
The Problem with RLHF
RLHF's complexity creates challenges:
class RLHFvsDP:
"""
Compare RLHF and DPO approaches.
"""
def compare_pipelines(self):
"""Compare RLHF vs DPO pipelines."""
print("RLHF Pipeline:")
print(" 1. Train SFT model")
print(" 2. Collect preference data")
print(" 3. Train reward model")
print(" 4. Use PPO to optimize policy")
print("\n Challenges:")
print(" - Reward model can be inaccurate")
print(" - PPO training is unstable")
print(" - Requires 3 models in memory")
print(" - Hyperparameter sensitive")
print(" - Reward hacking possible")
print()
print("DPO Pipeline:")
print(" 1. Train SFT model")
print(" 2. Collect preference data")
print(" 3. Directly optimize policy on preferences")
print("\n Advantages:")
print(" - No reward model needed")
print(" - No RL needed")
print(" - Only 2 models in memory (policy + reference)")
print(" - Simpler, more stable")
print(" - Direct optimization")
comparer = RLHFvsDP()
comparer.compare_pipelines()
Key Insight:
RLHF's reward model is an intermediate step - we use it to train the policy. DPO asks: Can we skip the reward model and directly optimize for preferences?
Answer: Yes! DPO reparameterizes the RLHF objective to enable direct optimization.
DPO Theory
From RLHF to DPO
RLHF objective:
max E[r(x,y)] - β KL(π_θ || π_ref)
where:
- r(x,y): reward model score for response y to prompt x
- π_θ: policy being optimized
- π_ref: reference policy (SFT model)
- β: KL coefficient
DPO insight: The optimal policy has closed form!
π*(y|x) = 1/Z(x) * π_ref(y|x) * exp(r(x,y)/β)
where Z(x) is partition function
This means we can express the reward in terms of policies:
r(x,y) = β log(π*(y|x) / π_ref(y|x)) + β log Z(x)
DPO Loss Function
Using the Bradley-Terry preference model:
P(y_w > y_l | x) = σ(r(x,y_w) - r(x,y_l))
where:
- y_w: winning (chosen) response
- y_l: losing (rejected) response
- σ: sigmoid function
Substituting the reward reparameterization:
import torch
import torch.nn as nn
import torch.nn.functional as F
def dpo_loss_formula():
"""
Explain DPO loss mathematically.
"""
print("DPO Loss Formula:")
print()
print("L_DPO(π_θ; π_ref) = -E[(x,y_w,y_l) ~ D] [")
print(" log σ(β log(π_θ(y_w|x)/π_ref(y_w|x)) - β log(π_θ(y_l|x)/π_ref(y_l|x)))")
print("]")
print()
print("Where:")
print(" - π_θ: policy being optimized")
print(" - π_ref: reference policy (frozen SFT model)")
print(" - y_w: chosen response")
print(" - y_l: rejected response")
print(" - β: temperature parameter")
print(" - σ: sigmoid function")
print()
print("Intuition:")
print(" Increase probability ratio π_θ/π_ref for chosen responses")
print(" Decrease probability ratio π_θ/π_ref for rejected responses")
print(" β controls how much policy can deviate from reference")
dpo_loss_formula()
DPO Loss Intuition:
The loss encourages the policy to:
- Increase likelihood of chosen responses (y_w) relative to reference
- Decrease likelihood of rejected responses (y_l) relative to reference
- Maintain balance controlled by β
All without explicitly computing rewards!
DPO Implementation
Complete DPO Trainer
from transformers import AutoModelForCausalLM, AutoTokenizer
from torch.utils.data import Dataset, DataLoader
from dataclasses import dataclass
from typing import List, Dict
import torch
@dataclass
class PreferenceExample:
"""Preference data example."""
prompt: str
chosen: str
rejected: str
class DPODataset(Dataset):
"""Dataset for DPO training."""
def __init__(
self,
examples: List[PreferenceExample],
tokenizer,
max_length: int = 512
):
"""
Args:
examples: List of preference examples
tokenizer: Tokenizer
max_length: Maximum sequence length
"""
self.examples = examples
self.tokenizer = tokenizer
self.max_length = max_length
def __len__(self):
return len(self.examples)
def __getitem__(self, idx):
"""
Tokenize prompt, chosen, and rejected responses.
"""
example = self.examples[idx]
# Combine prompt with responses
chosen_text = f"{example.prompt}\n{example.chosen}"
rejected_text = f"{example.prompt}\n{example.rejected}"
# Tokenize
chosen_tokens = self.tokenizer(
chosen_text,
max_length=self.max_length,
truncation=True,
padding='max_length',
return_tensors='pt'
)
rejected_tokens = self.tokenizer(
rejected_text,
max_length=self.max_length,
truncation=True,
padding='max_length',
return_tensors='pt'
)
return {
'chosen_input_ids': chosen_tokens['input_ids'].squeeze(),
'chosen_attention_mask': chosen_tokens['attention_mask'].squeeze(),
'rejected_input_ids': rejected_tokens['input_ids'].squeeze(),
'rejected_attention_mask': rejected_tokens['attention_mask'].squeeze(),
}
class DPOTrainer:
"""
Direct Preference Optimization trainer.
Simpler alternative to RLHF - no reward model or RL needed!
"""
def __init__(
self,
model_name: str,
beta: float = 0.1,
use_lora: bool = True,
lora_rank: int = 8
):
"""
Args:
model_name: Base model (should be SFT model)
beta: DPO temperature parameter
use_lora: Whether to use LoRA
lora_rank: LoRA rank if using LoRA
"""
self.beta = beta
self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
# Load tokenizer
self.tokenizer = AutoTokenizer.from_pretrained(model_name)
if self.tokenizer.pad_token is None:
self.tokenizer.pad_token = self.tokenizer.eos_token
# Load policy model (will be trained)
self.policy_model = AutoModelForCausalLM.from_pretrained(model_name)
# Apply LoRA if requested
if use_lora:
from peft import LoraConfig, get_peft_model
lora_config = LoraConfig(
r=lora_rank,
lora_alpha=16,
target_modules=["q_proj", "v_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)
self.policy_model = get_peft_model(self.policy_model, lora_config)
self.policy_model.print_trainable_parameters()
# Load reference model (frozen)
self.ref_model = AutoModelForCausalLM.from_pretrained(model_name)
for param in self.ref_model.parameters():
param.requires_grad = False
# Move to device
self.policy_model.to(self.device)
self.ref_model.to(self.device)
def get_log_probs(
self,
model,
input_ids: torch.Tensor,
attention_mask: torch.Tensor
) -> torch.Tensor:
"""
Get log probabilities of sequences under model.
Args:
model: Language model
input_ids: Token IDs (batch, seq_len)
attention_mask: Attention mask
Returns:
Log probability of each sequence
"""
# Forward pass
outputs = model(input_ids=input_ids, attention_mask=attention_mask)
logits = outputs.logits
# Shift logits and labels for next-token prediction
shift_logits = logits[:, :-1, :].contiguous()
shift_labels = input_ids[:, 1:].contiguous()
shift_attention_mask = attention_mask[:, 1:].contiguous()
# Get log probabilities
log_probs = F.log_softmax(shift_logits, dim=-1)
# Gather log probs for actual tokens
# Shape: (batch, seq_len - 1)
token_log_probs = torch.gather(
log_probs,
dim=2,
index=shift_labels.unsqueeze(-1)
).squeeze(-1)
# Mask padding tokens
token_log_probs = token_log_probs * shift_attention_mask
# Sum log probs over sequence
seq_log_probs = token_log_probs.sum(dim=1) / shift_attention_mask.sum(dim=1)
return seq_log_probs
def compute_dpo_loss(
self,
chosen_input_ids: torch.Tensor,
chosen_attention_mask: torch.Tensor,
rejected_input_ids: torch.Tensor,
rejected_attention_mask: torch.Tensor
) -> Dict[str, torch.Tensor]:
"""
Compute DPO loss.
L = -E[log σ(β * log(π/π_ref)_chosen - β * log(π/π_ref)_rejected)]
Args:
chosen_input_ids: Chosen response token IDs
chosen_attention_mask: Chosen response attention mask
rejected_input_ids: Rejected response token IDs
rejected_attention_mask: Rejected response attention mask
Returns:
Dict with loss and metrics
"""
# Get log probs from policy model
policy_chosen_log_probs = self.get_log_probs(
self.policy_model, chosen_input_ids, chosen_attention_mask
)
policy_rejected_log_probs = self.get_log_probs(
self.policy_model, rejected_input_ids, rejected_attention_mask
)
# Get log probs from reference model
with torch.no_grad():
ref_chosen_log_probs = self.get_log_probs(
self.ref_model, chosen_input_ids, chosen_attention_mask
)
ref_rejected_log_probs = self.get_log_probs(
self.ref_model, rejected_input_ids, rejected_attention_mask
)
# Compute log ratios: log(π_θ / π_ref)
chosen_log_ratio = policy_chosen_log_probs - ref_chosen_log_probs
rejected_log_ratio = policy_rejected_log_probs - ref_rejected_log_probs
# DPO loss: -log σ(β * (log_ratio_chosen - log_ratio_rejected))
logits = self.beta * (chosen_log_ratio - rejected_log_ratio)
loss = -F.logsigmoid(logits).mean()
# Compute metrics
with torch.no_grad():
# Implicit reward
chosen_rewards = self.beta * chosen_log_ratio
rejected_rewards = self.beta * rejected_log_ratio
# Accuracy: how often chosen > rejected
accuracy = (chosen_rewards > rejected_rewards).float().mean()
return {
'loss': loss,
'chosen_rewards': chosen_rewards.mean(),
'rejected_rewards': rejected_rewards.mean(),
'accuracy': accuracy
}
def train(
self,
train_examples: List[PreferenceExample],
val_examples: List[PreferenceExample],
epochs: int = 3,
batch_size: int = 4,
learning_rate: float = 5e-7
):
"""
Train policy with DPO.
Args:
train_examples: Training preference data
val_examples: Validation preference data
epochs: Number of epochs
batch_size: Batch size
learning_rate: Learning rate (typically lower than SFT)
"""
# Create datasets
train_dataset = DPODataset(train_examples, self.tokenizer)
val_dataset = DPODataset(val_examples, self.tokenizer)
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=batch_size)
# Optimizer
optimizer = torch.optim.AdamW(
[p for p in self.policy_model.parameters() if p.requires_grad],
lr=learning_rate
)
best_val_loss = float('inf')
for epoch in range(epochs):
# Training
self.policy_model.train()
train_metrics = {
'loss': 0,
'chosen_rewards': 0,
'rejected_rewards': 0,
'accuracy': 0
}
num_batches = 0
for batch in train_loader:
# Move to device
chosen_input_ids = batch['chosen_input_ids'].to(self.device)
chosen_attention_mask = batch['chosen_attention_mask'].to(self.device)
rejected_input_ids = batch['rejected_input_ids'].to(self.device)
rejected_attention_mask = batch['rejected_attention_mask'].to(self.device)
# Compute DPO loss
metrics = self.compute_dpo_loss(
chosen_input_ids,
chosen_attention_mask,
rejected_input_ids,
rejected_attention_mask
)
loss = metrics['loss']
# Backward pass
optimizer.zero_grad()
loss.backward()
# Gradient clipping
torch.nn.utils.clip_grad_norm_(
[p for p in self.policy_model.parameters() if p.requires_grad],
max_norm=1.0
)
optimizer.step()
# Accumulate metrics
for key in train_metrics:
train_metrics[key] += metrics[key].item()
num_batches += 1
# Average metrics
for key in train_metrics:
train_metrics[key] /= num_batches
# Validation
val_metrics = self.validate(val_loader)
print(f"\nEpoch {epoch+1}/{epochs}")
print(f" Train Loss: {train_metrics['loss']:.4f}")
print(f" Train Accuracy: {train_metrics['accuracy']:.2%}")
print(f" Train Chosen Rewards: {train_metrics['chosen_rewards']:.4f}")
print(f" Train Rejected Rewards: {train_metrics['rejected_rewards']:.4f}")
print(f" Val Loss: {val_metrics['loss']:.4f}")
print(f" Val Accuracy: {val_metrics['accuracy']:.2%}")
if val_metrics['loss'] < best_val_loss:
best_val_loss = val_metrics['loss']
self.save_model('best_dpo_model')
print(" Saved best model!")
def validate(self, val_loader):
"""Validate the model."""
self.policy_model.eval()
val_metrics = {
'loss': 0,
'chosen_rewards': 0,
'rejected_rewards': 0,
'accuracy': 0
}
num_batches = 0
with torch.no_grad():
for batch in val_loader:
chosen_input_ids = batch['chosen_input_ids'].to(self.device)
chosen_attention_mask = batch['chosen_attention_mask'].to(self.device)
rejected_input_ids = batch['rejected_input_ids'].to(self.device)
rejected_attention_mask = batch['rejected_attention_mask'].to(self.device)
metrics = self.compute_dpo_loss(
chosen_input_ids,
chosen_attention_mask,
rejected_input_ids,
rejected_attention_mask
)
for key in val_metrics:
val_metrics[key] += metrics[key].item()
num_batches += 1
for key in val_metrics:
val_metrics[key] /= num_batches
return val_metrics
def save_model(self, path: str):
"""Save the DPO-trained model."""
self.policy_model.save_pretrained(path)
self.tokenizer.save_pretrained(path)
# Example usage
print("Creating DPO trainer...")
# trainer = DPOTrainer("gpt2", beta=0.1, use_lora=True)
# Example preference data
example_preferences = [
PreferenceExample(
prompt="Explain gravity to a child.",
chosen="Gravity is like an invisible force that pulls things down to Earth! It's why when you drop a ball, it falls to the ground instead of floating away. Everything with mass has gravity - even you have a tiny bit! The Earth is so big and heavy that its gravity is strong enough to keep us and everything else from floating into space.",
rejected="Gravity is a fundamental force described by Einstein's general relativity as the curvature of spacetime caused by mass-energy. The gravitational field strength is proportional to mass and inversely proportional to the square of the distance."
),
# ... more examples
]
# trainer.train(train_examples, val_examples, epochs=3)
DPO Hyperparameters:
-
β (beta): Temperature parameter (0.1 - 0.5)
- Lower β: More conservative, stays closer to reference
- Higher β: More aggressive optimization
- Start with β=0.1
-
Learning rate: Lower than SFT (1e-7 to 5e-6)
- Too high: Unstable, policy diverges
- Too low: Slow convergence
-
Batch size: Larger is better (4-16)
- More stable gradient estimates
- Better use of preference data
DPO vs RLHF Comparison
import pandas as pd
comparison = pd.DataFrame({
'Aspect': [
'Stages',
'Models needed',
'Training stability',
'Implementation complexity',
'Memory usage',
'Training speed',
'Performance',
'Hyperparameter sensitivity'
],
'RLHF (PPO)': [
'3 (SFT, RM, PPO)',
'3 (policy, ref, reward)',
'Unstable',
'High',
'High (3 models)',
'Slow',
'Strong',
'Very sensitive'
],
'DPO': [
'2 (SFT, DPO)',
'2 (policy, ref)',
'Stable',
'Medium',
'Medium (2 models)',
'Fast',
'Comparable',
'Less sensitive'
]
})
print("\nRLHF vs DPO Comparison:")
print(comparison.to_string(index=False))
print("\n" + "="*70)
print("When to use each:")
print("="*70)
print("Use RLHF when:")
print(" - You need explicit reward signals")
print(" - You have complex reward shaping requirements")
print(" - You want to combine multiple reward sources")
print()
print("Use DPO when:")
print(" - You want simplicity and stability")
print(" - You have good preference data")
print(" - You want faster training")
print(" - You have limited compute resources")
Summary
DPO simplifies alignment by:
- Eliminating reward model: Directly optimize from preferences
- Removing RL: Supervised learning on preference data
- Maintaining performance: Comparable to RLHF
- Improving stability: More stable training dynamics
DPO is becoming the preferred method for preference-based alignment due to its simplicity and effectiveness.