RLHF: Reinforcement Learning from Human Feedback
RLHF (Reinforcement Learning from Human Feedback) is the secret sauce behind ChatGPT's helpfulness and safety. It aligns models with human preferences through a three-stage process.
The Three Stages of RLHF
import torch
import torch.nn as nn
from typing import List, Tuple
class RLHFPipeline:
"""
Complete RLHF pipeline overview.
Stage 1: Supervised Fine-Tuning (SFT)
Stage 2: Reward Model Training
Stage 3: RL Fine-Tuning with PPO
"""
def __init__(self):
self.stages = {
1: "Supervised Fine-Tuning (SFT)",
2: "Reward Model Training",
3: "PPO Optimization"
}
def describe_pipeline(self):
"""Describe the three-stage RLHF pipeline."""
print("RLHF Three-Stage Pipeline:\n")
print("Stage 1: Supervised Fine-Tuning (SFT)")
print(" - Start with pre-trained base model")
print(" - Fine-tune on high-quality demonstrations")
print(" - Creates initial 'helpful' policy")
print(" - Example: GPT-3 → GPT-3.5-Instruct (SFT)")
print()
print("Stage 2: Reward Model Training")
print(" - Collect human preferences (A vs B comparisons)")
print(" - Train reward model to predict human preferences")
print(" - Reward model assigns scores to responses")
print(" - Example: Given response, output reward score 0-1")
print()
print("Stage 3: RL Fine-Tuning (PPO)")
print(" - Use reward model as environment")
print(" - Optimize policy with PPO to maximize reward")
print(" - Add KL penalty to prevent drift from SFT model")
print(" - Example: ChatGPT = GPT-3.5-Instruct (SFT) + RLHF")
pipeline = RLHFPipeline()
pipeline.describe_pipeline()
Why Three Stages?
- SFT first: Provides good starting point, teaches basic instruction-following
- Reward model: Captures nuanced human preferences (what makes a response better)
- PPO: Optimizes for reward while preventing model from degenerating
Can't skip stages: RL from scratch fails, SFT alone misses nuanced preferences.
Stage 1: Supervised Fine-Tuning (SFT)
Create initial policy from demonstrations:
from transformers import AutoModelForCausalLM, AutoTokenizer
class SFTStage:
"""
Stage 1: Supervised Fine-Tuning on demonstrations.
Creates initial policy π_SFT that follows instructions.
"""
def __init__(self, base_model_name: str):
"""
Args:
base_model_name: Pre-trained model to start from
"""
self.model = AutoModelForCausalLM.from_pretrained(base_model_name)
self.tokenizer = AutoTokenizer.from_pretrained(base_model_name)
def create_sft_dataset(self):
"""
Create high-quality demonstration dataset.
For ChatGPT, this was ~13k human-written demonstrations.
"""
demonstrations = [
{
"prompt": "Explain quantum computing to a beginner.",
"response": "Quantum computing is a new type of computing that uses quantum mechanics. Unlike regular computers that use bits (0 or 1), quantum computers use qubits that can be 0, 1, or both at once (superposition). This allows them to solve certain problems much faster than classical computers..."
},
{
"prompt": "Write a professional email declining a job offer.",
"response": "Dear [Hiring Manager],\n\nThank you for offering me the [Position] role at [Company]. After careful consideration, I have decided to decline the offer. This was a difficult decision, as I was impressed by the team and the opportunity. However, I have accepted another position that better aligns with my current career goals...\n\nBest regards,\n[Your name]"
}
# ... thousands more high-quality examples
]
return demonstrations
def train_sft(self, demonstrations, epochs=3):
"""
Train SFT model on demonstrations.
Standard supervised fine-tuning (covered in instruction tuning).
"""
print(f"Training SFT model on {len(demonstrations)} demonstrations...")
print("This creates π_SFT: initial instruction-following policy")
# Implementation same as instruction tuning
# (see instruction-tuning.mdx)
# Stage 1 creates π_SFT
# sft = SFTStage("gpt2")
# demonstrations = sft.create_sft_dataset()
# sft.train_sft(demonstrations)
Stage 2: Reward Model Training
The most crucial and novel stage:
Collecting Preference Data
from dataclasses import dataclass
from typing import Tuple
@dataclass
class PreferenceExample:
"""
Single preference comparison.
Human labelers rank multiple responses to same prompt.
"""
prompt: str
response_chosen: str # Preferred response
response_rejected: str # Less preferred response
def collect_preference_data(
sft_model,
prompts: List[str],
k_responses: int = 4
) -> List[PreferenceExample]:
"""
Collect preference data from human labelers.
Process:
1. Sample prompts from dataset
2. Generate k responses per prompt using SFT model
3. Human labelers rank responses
4. Create pairwise comparisons
Args:
sft_model: SFT model to generate responses
prompts: List of prompts
k_responses: Number of responses to generate per prompt
Returns:
List of preference examples
"""
preference_data = []
for prompt in prompts:
# Generate k different responses
responses = []
for i in range(k_responses):
response = sft_model.generate(
prompt,
temperature=0.7 + i*0.1, # Vary temperature for diversity
max_length=256
)
responses.append(response)
# Human labelers rank: response_1 > response_2 > response_3 > response_4
# Create pairwise comparisons from ranking
# If ranking is [A, B, C, D], we get:
# - A > B, A > C, A > D
# - B > C, B > D
# - C > D
# Simulated ranking (in practice, done by humans)
ranked_responses = rank_responses_human(prompt, responses)
# Create pairwise examples
for i in range(len(ranked_responses)):
for j in range(i+1, len(ranked_responses)):
preference_data.append(
PreferenceExample(
prompt=prompt,
response_chosen=ranked_responses[i],
response_rejected=ranked_responses[j]
)
)
return preference_data
def rank_responses_human(prompt: str, responses: List[str]) -> List[str]:
"""
Simulate human ranking of responses.
In practice, this is done by human labelers who rank
responses based on helpfulness, harmlessness, and honesty.
"""
# Placeholder: actual ranking done by humans
# For demonstration, return in order
return responses
# For ChatGPT: ~33k preference comparisons collected
# preference_data = collect_preference_data(sft_model, prompts, k_responses=4)
Preference Data Collection Challenges:
- Expensive: Requires many human labelers ($100k-$1M+)
- Subjective: Labeler disagreement on what's "better"
- Biased: Reflects labeler demographics and values
- Time-consuming: Weeks to months of labeling
- Quality critical: Bad preferences → bad reward model → bad final model
OpenAI uses detailed labeling guidelines and quality checks.
Reward Model Architecture
class RewardModel(nn.Module):
"""
Reward model: predicts human preference score.
Architecture: LLM backbone → scalar reward head
"""
def __init__(self, base_model_name: str):
"""
Args:
base_model_name: Base model (usually same as SFT model)
"""
super().__init__()
# Load base model (without LM head)
from transformers import AutoModel
self.backbone = AutoModel.from_pretrained(base_model_name)
# Get hidden size
hidden_size = self.backbone.config.hidden_size
# Reward head: projects to scalar score
self.reward_head = nn.Sequential(
nn.Linear(hidden_size, hidden_size),
nn.ReLU(),
nn.Dropout(0.1),
nn.Linear(hidden_size, 1) # Scalar reward
)
def forward(self, input_ids, attention_mask):
"""
Compute reward for a response.
Args:
input_ids: Tokenized prompt + response
attention_mask: Attention mask
Returns:
Scalar reward score
"""
# Get hidden states from backbone
outputs = self.backbone(
input_ids=input_ids,
attention_mask=attention_mask
)
# Use last token's hidden state (like value function)
# Find last non-padding token
sequence_lengths = attention_mask.sum(dim=1) - 1
batch_size = input_ids.shape[0]
# Get last hidden state for each sequence
last_hidden_states = outputs.last_hidden_state[
torch.arange(batch_size, device=input_ids.device),
sequence_lengths
]
# Project to scalar reward
reward = self.reward_head(last_hidden_states)
return reward.squeeze(-1) # (batch_size,)
# Test reward model
reward_model = RewardModel("gpt2")
print(f"Reward model parameters: {sum(p.numel() for p in reward_model.parameters()):,}")
Training the Reward Model
class RewardModelTrainer:
"""
Train reward model on preference data.
Loss: maximize log probability that chosen response has higher reward.
"""
def __init__(self, model: RewardModel, tokenizer):
"""
Args:
model: Reward model
tokenizer: Tokenizer
"""
self.model = model
self.tokenizer = tokenizer
self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
self.model.to(self.device)
def compute_loss(
self,
reward_chosen: torch.Tensor,
reward_rejected: torch.Tensor
) -> torch.Tensor:
"""
Compute pairwise ranking loss.
Loss = -log(sigmoid(r_chosen - r_rejected))
This maximizes the probability that chosen response
has higher reward than rejected.
Args:
reward_chosen: Rewards for chosen responses
reward_rejected: Rewards for rejected responses
Returns:
Loss scalar
"""
# Binary cross-entropy between rewards
# We want: r_chosen > r_rejected
# Equivalent to: sigmoid(r_chosen - r_rejected) → 1
loss = -torch.log(torch.sigmoid(reward_chosen - reward_rejected)).mean()
return loss
def train(
self,
preference_data: List[PreferenceExample],
epochs: int = 1,
batch_size: int = 4,
learning_rate: float = 1e-5
):
"""
Train reward model on preference comparisons.
Args:
preference_data: List of preference examples
epochs: Number of epochs
batch_size: Batch size
learning_rate: Learning rate
"""
optimizer = torch.optim.AdamW(self.model.parameters(), lr=learning_rate)
for epoch in range(epochs):
total_loss = 0
num_batches = 0
# Process in batches
for i in range(0, len(preference_data), batch_size):
batch = preference_data[i:i+batch_size]
# Tokenize chosen and rejected responses
chosen_texts = [
f"{ex.prompt}\n{ex.response_chosen}" for ex in batch
]
rejected_texts = [
f"{ex.prompt}\n{ex.response_rejected}" for ex in batch
]
chosen_encodings = self.tokenizer(
chosen_texts,
padding=True,
truncation=True,
max_length=512,
return_tensors='pt'
).to(self.device)
rejected_encodings = self.tokenizer(
rejected_texts,
padding=True,
truncation=True,
max_length=512,
return_tensors='pt'
).to(self.device)
# Compute rewards
reward_chosen = self.model(
chosen_encodings['input_ids'],
chosen_encodings['attention_mask']
)
reward_rejected = self.model(
rejected_encodings['input_ids'],
rejected_encodings['attention_mask']
)
# Compute loss
loss = self.compute_loss(reward_chosen, reward_rejected)
# Backward pass
optimizer.zero_grad()
loss.backward()
optimizer.step()
total_loss += loss.item()
num_batches += 1
avg_loss = total_loss / num_batches
print(f"Epoch {epoch+1}/{epochs}, Loss: {avg_loss:.4f}")
# Evaluate accuracy
accuracy = self.evaluate_accuracy(preference_data[:100])
print(f" Accuracy on preferences: {accuracy:.2%}")
def evaluate_accuracy(self, preference_data: List[PreferenceExample]) -> float:
"""
Evaluate how often reward model correctly ranks preferences.
"""
self.model.eval()
correct = 0
total = 0
with torch.no_grad():
for ex in preference_data:
# Tokenize
chosen_text = f"{ex.prompt}\n{ex.response_chosen}"
rejected_text = f"{ex.prompt}\n{ex.response_rejected}"
chosen_enc = self.tokenizer(
chosen_text, return_tensors='pt', truncation=True, max_length=512
).to(self.device)
rejected_enc = self.tokenizer(
rejected_text, return_tensors='pt', truncation=True, max_length=512
).to(self.device)
# Get rewards
r_chosen = self.model(chosen_enc['input_ids'], chosen_enc['attention_mask'])
r_rejected = self.model(rejected_enc['input_ids'], rejected_enc['attention_mask'])
# Check if chosen > rejected
if r_chosen > r_rejected:
correct += 1
total += 1
self.model.train()
return correct / total if total > 0 else 0.0
# Example training
# trainer = RewardModelTrainer(reward_model, tokenizer)
# trainer.train(preference_data, epochs=1)
Reward Model Success Metrics:
- Accuracy: % of preferences correctly ranked (target: >70%)
- Agreement with humans: Inter-rater reliability
- Calibration: Reward magnitudes meaningful
- Generalization: Works on out-of-distribution prompts
A good reward model is critical - garbage in, garbage out!
Stage 3: PPO Optimization
Use reward model to optimize policy with PPO (Proximal Policy Optimization):
PPO Algorithm
class PPOTrainer:
"""
PPO (Proximal Policy Optimization) for RLHF.
Objective: max E[reward(y|x)] - β * KL(π_θ || π_SFT)
Where:
- π_θ: Current policy (being optimized)
- π_SFT: Reference policy (from Stage 1)
- β: KL penalty coefficient
"""
def __init__(
self,
policy_model, # Model being optimized
ref_model, # Reference model (frozen SFT model)
reward_model, # Reward model (frozen)
tokenizer,
kl_coef: float = 0.1
):
"""
Args:
policy_model: Policy being optimized
ref_model: Reference policy (SFT model, frozen)
reward_model: Reward model (frozen)
tokenizer: Tokenizer
kl_coef: KL divergence penalty coefficient
"""
self.policy = policy_model
self.ref_model = ref_model
self.reward_model = reward_model
self.tokenizer = tokenizer
self.kl_coef = kl_coef
self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
# Freeze reference and reward models
for param in self.ref_model.parameters():
param.requires_grad = False
for param in self.reward_model.parameters():
param.requires_grad = False
def compute_rewards(
self,
prompts: List[str],
responses: List[str]
) -> torch.Tensor:
"""
Compute rewards for prompt-response pairs.
Args:
prompts: List of prompts
responses: List of generated responses
Returns:
Tensor of rewards
"""
# Combine prompts and responses
texts = [f"{p}\n{r}" for p, r in zip(prompts, responses)]
# Tokenize
encodings = self.tokenizer(
texts,
padding=True,
truncation=True,
return_tensors='pt'
).to(self.device)
# Get rewards from reward model
with torch.no_grad():
rewards = self.reward_model(
encodings['input_ids'],
encodings['attention_mask']
)
return rewards
def compute_kl_penalty(
self,
input_ids: torch.Tensor,
attention_mask: torch.Tensor
) -> torch.Tensor:
"""
Compute KL divergence between policy and reference.
KL(π_θ || π_ref) prevents policy from drifting too far from SFT.
Args:
input_ids: Token IDs
attention_mask: Attention mask
Returns:
KL divergence
"""
# Get logits from both models
policy_logits = self.policy(
input_ids=input_ids,
attention_mask=attention_mask
).logits
with torch.no_grad():
ref_logits = self.ref_model(
input_ids=input_ids,
attention_mask=attention_mask
).logits
# Convert to log probabilities
policy_log_probs = torch.log_softmax(policy_logits, dim=-1)
ref_log_probs = torch.log_softmax(ref_logits, dim=-1)
# KL divergence: KL(π || π_ref) = E_π[log π - log π_ref]
# Compute per-token KL
kl_div = (
torch.exp(policy_log_probs) *
(policy_log_probs - ref_log_probs)
).sum(dim=-1)
# Average over sequence (only non-padding tokens)
kl_div = (kl_div * attention_mask).sum(dim=1) / attention_mask.sum(dim=1)
return kl_div.mean()
def ppo_step(
self,
prompts: List[str],
batch_size: int = 4,
ppo_epochs: int = 4,
clip_range: float = 0.2
):
"""
Single PPO update step.
Args:
prompts: List of prompts
batch_size: Batch size
ppo_epochs: Number of PPO epochs per batch
clip_range: PPO clipping range
"""
# Generate responses from current policy
responses = []
for prompt in prompts:
# Generate response
input_ids = self.tokenizer.encode(prompt, return_tensors='pt').to(self.device)
output_ids = self.policy.generate(
input_ids,
max_new_tokens=128,
temperature=0.7,
do_sample=True,
pad_token_id=self.tokenizer.eos_token_id
)
# Decode response (remove prompt)
response = self.tokenizer.decode(
output_ids[0][input_ids.shape[1]:],
skip_special_tokens=True
)
responses.append(response)
# Compute rewards
rewards = self.compute_rewards(prompts, responses)
# Prepare full sequences for training
texts = [f"{p}\n{r}" for p, r in zip(prompts, responses)]
encodings = self.tokenizer(
texts,
padding=True,
truncation=True,
return_tensors='pt'
).to(self.device)
# PPO updates
for epoch in range(ppo_epochs):
# Forward pass
outputs = self.policy(
input_ids=encodings['input_ids'],
attention_mask=encodings['attention_mask']
)
# Compute KL penalty
kl_penalty = self.compute_kl_penalty(
encodings['input_ids'],
encodings['attention_mask']
)
# Total reward: reward - β * KL
total_reward = rewards.mean() - self.kl_coef * kl_penalty
# PPO loss (simplified)
loss = -total_reward
# Backward pass
loss.backward()
print(f" Epoch {epoch+1}/{ppo_epochs}: Reward={rewards.mean():.4f}, KL={kl_penalty:.4f}, Loss={loss:.4f}")
return {
'reward': rewards.mean().item(),
'kl': kl_penalty.item(),
'responses': responses
}
# Example PPO training
# ppo_trainer = PPOTrainer(policy_model, ref_model, reward_model, tokenizer)
# prompts = ["Explain machine learning", "Write a poem about nature"]
# results = ppo_trainer.ppo_step(prompts)
RLHF Training Challenges:
- Instability: RL training can be unstable, requires careful tuning
- Reward hacking: Model exploits weaknesses in reward model
- Mode collapse: Model learns to generate safe but boring responses
- Computational cost: Requires running 3 models (policy, ref, reward)
- Hyperparameter sensitivity: KL coefficient, learning rate critical
Tips:
- Start with small KL coefficient (0.01-0.1)
- Use LoRA for policy to reduce memory
- Monitor both reward and KL closely
- Validate outputs frequently
Complete RLHF Pipeline
def run_complete_rlhf_pipeline():
"""
Full RLHF pipeline from start to finish.
"""
print("="*70)
print("Complete RLHF Pipeline")
print("="*70)
# Stage 1: SFT
print("\nStage 1: Supervised Fine-Tuning")
print(" - Training on ~13k demonstrations")
print(" - Creates π_SFT: helpful instruction-following model")
print(" - Duration: ~1 day on 8 GPUs")
# Stage 2: Reward Model
print("\nStage 2: Reward Model Training")
print(" - Collecting ~33k preference comparisons")
print(" - Training reward model to predict preferences")
print(" - Duration: ~1 week (data collection) + 1 day (training)")
# Stage 3: PPO
print("\nStage 3: PPO Optimization")
print(" - Optimizing policy with PPO")
print(" - Using reward model as environment")
print(" - Duration: ~1 day on 8 GPUs")
print("\n" + "="*70)
print("Result: ChatGPT-style model!")
print("="*70)
# Comparison
print("\nBefore RLHF (SFT only):")
print(" Prompt: 'How do I make a bomb?'")
print(" Response: 'To make a bomb, you need...' [continues]")
print("\nAfter RLHF:")
print(" Prompt: 'How do I make a bomb?'")
print(" Response: 'I cannot and will not provide information on creating weapons or explosives. This is dangerous and illegal. If you're interested in chemistry or engineering, I'd be happy to suggest safe, legal resources.'")
run_complete_rlhf_pipeline()
Summary
RLHF aligns LLMs with human values through three stages:
- SFT: Teach basic instruction-following
- Reward Model: Learn human preferences from comparisons
- PPO: Optimize for reward while preventing drift
This created the helpful, harmless, honest assistants we use today.