Model Scaling Laws: The Science of Size
Why is GPT-4 better than GPT-3? Why do companies spend millions training ever-larger models? The answer lies in scaling laws - mathematical relationships that predict how model performance improves with scale.
The Scaling Law Revolution
In 2020, OpenAI published groundbreaking research showing that language model performance follows predictable power laws. This transformed LLM development from art to science.
The Core Insight
Model performance (measured by loss) improves predictably as you scale three factors:
- Model size (N): Number of parameters
- Dataset size (D): Number of training tokens
- Compute (C): Total FLOPs used for training
Power Law Behavior:
Performance follows a power law:
Loss ∝ 1/Scale^αUnderstanding the Three Axes
1. Model Size (Parameters N)
def count_transformer_parameters(num_layers, d_model, num_heads, d_ff, vocab_size):
"""
Calculate parameters in a transformer model.
Args:
num_layers: Number of transformer layers
d_model: Model dimension
num_heads: Number of attention heads
d_ff: Feed-forward hidden dimension
vocab_size: Size of token vocabulary
"""
# Embedding layer
embedding_params = vocab_size * d_model
# Single transformer layer:
# - Multi-head attention: 4 * (d_model * d_model) for Q, K, V, O projections
# - Layer norm 1: 2 * d_model (gamma and beta)
# - Feed-forward: 2 * (d_model * d_ff) + d_ff + d_model (weights and biases)
# - Layer norm 2: 2 * d_model
attn_params = 4 * (d_model ** 2)
norm1_params = 2 * d_model
ff_params = (d_model * d_ff) + d_ff + (d_ff * d_model) + d_model
norm2_params = 2 * d_model
layer_params = attn_params + norm1_params + ff_params + norm2_params
transformer_params = num_layers * layer_params
# Output layer (usually shares weights with embedding)
output_params = 0 # Shared with embedding
total_params = embedding_params + transformer_params + output_params
return total_params
# GPT-2 Small (124M parameters)
gpt2_small = count_transformer_parameters(
num_layers=12,
d_model=768,
num_heads=12,
d_ff=3072,
vocab_size=50257
)
print(f"GPT-2 Small: {gpt2_small:,} parameters")
# GPT-2 Large (774M parameters)
gpt2_large = count_transformer_parameters(
num_layers=36,
d_model=1280,
num_heads=20,
d_ff=5120,
vocab_size=50257
)
print(f"GPT-2 Large: {gpt2_large:,} parameters")
# GPT-3 (175B parameters - approximate)
gpt3 = count_transformer_parameters(
num_layers=96,
d_model=12288,
num_heads=96,
d_ff=49152,
vocab_size=50257
)
print(f"GPT-3: {gpt3:,} parameters")
print(f"GPT-3 / GPT-2 Small ratio: {gpt3 / gpt2_small:.1f}x")
2. Dataset Size (Tokens D)
def estimate_training_tokens(parameters, compute_optimal=True):
"""
Estimate optimal training tokens based on model size.
Chinchilla scaling laws suggest tokens ≈ 20 × parameters
for compute-optimal training.
Args:
parameters: Number of model parameters
compute_optimal: If True, use Chinchilla ratio (20:1)
If False, use common practice (300B tokens)
"""
if compute_optimal:
# Chinchilla: tokens ≈ 20 × parameters
optimal_tokens = 20 * parameters
return optimal_tokens
else:
# Common practice: fixed dataset (e.g., 300B tokens)
return 300_000_000_000
# GPT-3 175B
gpt3_params = 175_000_000_000
chinchilla_tokens = estimate_training_tokens(gpt3_params, compute_optimal=True)
actual_tokens = 300_000_000_000 # GPT-3 was trained on 300B tokens
print(f"GPT-3 parameters: {gpt3_params:,}")
print(f"Actual training tokens: {actual_tokens:,}")
print(f"Chinchilla optimal tokens: {chinchilla_tokens:,}")
print(f"GPT-3 was {'under' if actual_tokens < chinchilla_tokens else 'over'}trained")
3. Compute Budget (FLOPs C)
def estimate_training_compute(parameters, tokens, forward_pass_only=False):
"""
Estimate training compute in FLOPs.
Args:
parameters: Number of model parameters (N)
tokens: Number of training tokens (D)
forward_pass_only: If False, includes backward pass (~2x forward)
Returns:
Total FLOPs
"""
# Approximate FLOPs per token ≈ 6N (forward + backward)
# Forward pass: ~2N FLOPs
# Backward pass: ~4N FLOPs (2x forward for gradients)
if forward_pass_only:
flops_per_token = 2 * parameters
else:
flops_per_token = 6 * parameters
total_flops = flops_per_token * tokens
return total_flops
# GPT-3 training compute
gpt3_compute = estimate_training_compute(
parameters=175_000_000_000,
tokens=300_000_000_000
)
print(f"GPT-3 training compute: {gpt3_compute:.2e} FLOPs")
print(f"GPT-3 training compute: ~{gpt3_compute / 1e23:.1f} × 10^23 FLOPs")
# Compare compute budgets
models = {
"GPT-2": (1.5e9, 10e9), # 1.5B params, 10B tokens
"GPT-3": (175e9, 300e9), # 175B params, 300B tokens
"Chinchilla": (70e9, 1.4e12), # 70B params, 1.4T tokens
}
for name, (params, tokens) in models.items():
compute = estimate_training_compute(params, tokens)
print(f"{name:12s}: {compute:.2e} FLOPs ({params/1e9:.0f}B params, {tokens/1e9:.0f}B tokens)")
Compute Costs:
Training GPT-3 cost an estimated $4.6M in compute. GPT-4 likely cost tens of millions. The compute budget is often the limiting factor in modern LLM development.
The Scaling Laws
OpenAI Scaling Laws (2020)
Performance (measured by cross-entropy loss L) scales as:
1. With Model Size (N):
L(N) ≈ (Nc / N)^αN
where:
- N = number of parameters
- Nc ≈ 8.8 × 10^13 (critical parameter count)
- αN ≈ 0.076 (scaling exponent)
2. With Dataset Size (D):
L(D) ≈ (Dc / D)^αD
where:
- D = number of tokens
- Dc ≈ 5.4 × 10^13 (critical dataset size)
- αD ≈ 0.095
3. With Compute (C):
L(C) ≈ (Cc / C)^αC
where:
- C = total compute in FLOPs
- Cc ≈ 3.1 × 10^8 FLOPs
- αC ≈ 0.050
Implementation: Predicting Loss
import numpy as np
import matplotlib.pyplot as plt
class ScalingLaws:
"""OpenAI scaling laws implementation."""
def __init__(self):
# Fitted constants from OpenAI paper
self.Nc = 8.8e13 # Critical parameters
self.Dc = 5.4e13 # Critical dataset size
self.Cc = 3.1e8 # Critical compute
self.alpha_N = 0.076 # Parameter scaling exponent
self.alpha_D = 0.095 # Data scaling exponent
self.alpha_C = 0.050 # Compute scaling exponent
def loss_from_params(self, N):
"""Predict loss from number of parameters."""
return (self.Nc / N) ** self.alpha_N
def loss_from_data(self, D):
"""Predict loss from dataset size."""
return (self.Dc / D) ** self.alpha_D
def loss_from_compute(self, C):
"""Predict loss from compute budget."""
return (self.Cc / C) ** self.alpha_C
def optimal_allocation(self, C):
"""
Given compute budget C, find optimal N and D.
Returns:
(optimal_params, optimal_tokens)
"""
# From the paper: C ≈ 6ND (training FLOPs)
# Optimal allocation minimizes loss given C
# Simplified: allocate compute to balance parameter and data scaling
# Optimal N ∝ C^a, D ∝ C^b where a + b ≈ 1
# Approximate optimal allocation (derived from paper)
optimal_N = (C / 6) ** 0.73
optimal_D = C / (6 * optimal_N)
return optimal_N, optimal_D
# Test scaling laws
scaling = ScalingLaws()
# Predict loss for different model sizes
param_sizes = np.logspace(6, 12, 50) # 1M to 1T parameters
losses_params = [scaling.loss_from_params(N) for N in param_sizes]
# Predict loss for different dataset sizes
data_sizes = np.logspace(6, 12, 50) # 1M to 1T tokens
losses_data = [scaling.loss_from_data(D) for D in data_sizes]
# Plot
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))
# Parameter scaling
ax1.loglog(param_sizes, losses_params, 'b-', linewidth=2)
ax1.axvline(175e9, color='r', linestyle='--', label='GPT-3 (175B)')
ax1.axvline(1.5e9, color='g', linestyle='--', label='GPT-2 (1.5B)')
ax1.set_xlabel('Parameters (N)')
ax1.set_ylabel('Loss')
ax1.set_title('Scaling with Model Size')
ax1.legend()
ax1.grid(True, alpha=0.3)
# Data scaling
ax2.loglog(data_sizes, losses_data, 'b-', linewidth=2)
ax2.axvline(300e9, color='r', linestyle='--', label='GPT-3 (300B)')
ax2.set_xlabel('Dataset Size (tokens)')
ax2.set_ylabel('Loss')
ax2.set_title('Scaling with Dataset Size')
ax2.legend()
ax2.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
# Predict GPT-3 loss
gpt3_loss_params = scaling.loss_from_params(175e9)
gpt3_loss_data = scaling.loss_from_data(300e9)
print(f"\nGPT-3 predicted loss (from params): {gpt3_loss_params:.4f}")
print(f"GPT-3 predicted loss (from data): {gpt3_loss_data:.4f}")
Chinchilla Scaling Laws (2022)
DeepMind's Chinchilla paper showed that most models are undertrained.
Key Finding: Data Matters More Than We Thought
Previous approach: Make models bigger, use fixed dataset Chinchilla insight: For a given compute budget, use smaller models trained on more data
class ChinchillaScaling:
"""Chinchilla optimal scaling laws."""
def optimal_model_size(self, compute_budget):
"""
Compute-optimal model size given budget.
From Chinchilla paper:
N_opt ∝ C^a where a ≈ 0.50
D_opt ∝ C^b where b ≈ 0.50
"""
# Simplified: N and D should scale equally with compute
# For compute budget C, optimal N ≈ C^0.5 / 20
optimal_N = (compute_budget / 6) ** 0.5
optimal_D = optimal_N * 20 # 20 tokens per parameter
return optimal_N, optimal_D
def compare_strategies(self, compute_budget):
"""Compare Chinchilla-optimal vs parameter-focused strategies."""
# Chinchilla-optimal
N_opt, D_opt = self.optimal_model_size(compute_budget)
# Old approach: maximize parameters
N_large = compute_budget / (6 * 300e9) # Fixed 300B tokens
D_large = 300e9
return {
'chinchilla': (N_opt, D_opt),
'old_approach': (N_large, D_large)
}
chinchilla = ChinchillaScaling()
# Same compute budget as GPT-3
gpt3_compute = estimate_training_compute(175e9, 300e9)
strategies = chinchilla.compare_strategies(gpt3_compute)
print("\nFor GPT-3's compute budget:")
print(f"\nOld approach (GPT-3):")
print(f" Parameters: {strategies['old_approach'][0]:,.0f}")
print(f" Tokens: {strategies['old_approach'][1]:,.0f}")
print(f"\nChinchilla-optimal (Chinchilla):")
print(f" Parameters: {strategies['chinchilla'][0]:,.0f}")
print(f" Tokens: {strategies['chinchilla'][1]:,.0f}")
print(f"\nChinchilla has {strategies['old_approach'][0] / strategies['chinchilla'][0]:.1f}x fewer parameters")
print(f"but {strategies['chinchilla'][1] / strategies['old_approach'][1]:.1f}x more training tokens")
Chinchilla's Impact:
Chinchilla (70B parameters, 1.4T tokens) outperforms GPT-3 (175B parameters, 300B tokens) using the same compute budget. The lesson: train smaller models longer rather than larger models briefly.
Practical Implications
1. Budgeting Your Training Run
def plan_training_run(compute_budget_dollars, cost_per_petaflop=1.0):
"""
Plan optimal training configuration given budget.
Args:
compute_budget_dollars: Budget in dollars
cost_per_petaflop: Cost per petaflop-day (default $1)
Returns:
Training plan with N, D, estimated performance
"""
# Convert dollars to FLOPs
petaflop_days = compute_budget_dollars / cost_per_petaflop
total_flops = petaflop_days * (1e15 * 86400) # petaflops to FLOPs
# Chinchilla-optimal allocation
chinchilla = ChinchillaScaling()
N_opt, D_opt = chinchilla.optimal_model_size(total_flops)
# Predict performance
scaling = ScalingLaws()
predicted_loss = scaling.loss_from_params(N_opt)
# Estimate training time (assuming 300 petaflop/s GPU cluster)
training_time_seconds = total_flops / (300e15)
training_time_days = training_time_seconds / 86400
return {
'parameters': N_opt,
'tokens': D_opt,
'predicted_loss': predicted_loss,
'training_days': training_time_days,
'total_flops': total_flops
}
# Example: $100k budget
plan = plan_training_run(compute_budget_dollars=100_000)
print("\nTraining plan for $100k budget:")
print(f" Model size: {plan['parameters']:,.0f} parameters ({plan['parameters']/1e9:.1f}B)")
print(f" Training tokens: {plan['tokens']:,.0f} ({plan['tokens']/1e9:.1f}B)")
print(f" Predicted loss: {plan['predicted_loss']:.4f}")
print(f" Training time: {plan['training_days']:.1f} days")
print(f" Total compute: {plan['total_flops']:.2e} FLOPs")
2. The Bitter Lesson
def compare_approaches(compute_budget):
"""
Compare: better algorithms vs. more compute.
Shows why scaling often beats clever algorithms.
"""
# Baseline: small model with algorithm improvement
baseline_N = 1e9 # 1B parameters
baseline_D = 100e9 # 100B tokens
baseline_compute = estimate_training_compute(baseline_N, baseline_D)
scaling_laws = ScalingLaws()
baseline_loss = scaling_laws.loss_from_params(baseline_N)
# Algorithm improvement: assume 10% better loss (optimistic)
algorithm_improved_loss = baseline_loss * 0.9
# Scaling approach: use full compute budget
chinchilla = ChinchillaScaling()
scaled_N, scaled_D = chinchilla.optimal_model_size(compute_budget)
scaled_loss = scaling_laws.loss_from_params(scaled_N)
return {
'algorithm': algorithm_improved_loss,
'scaling': scaled_loss,
'winner': 'scaling' if scaled_loss < algorithm_improved_loss else 'algorithm'
}
# Compare for GPT-3 budget
gpt3_budget = estimate_training_compute(175e9, 300e9)
result = compare_approaches(gpt3_budget)
print("\nAlgorithm vs. Scaling:")
print(f" Improved algorithm (1B model): {result['algorithm']:.4f} loss")
print(f" Scaled model (Chinchilla-optimal): {result['scaling']:.4f} loss")
print(f" Winner: {result['winner']}")
print(f" Improvement: {(1 - result['scaling']/result['algorithm']) * 100:.1f}%")
The Bitter Lesson (Rich Sutton):
"The biggest lesson that can be read from 70 years of AI research is that general methods that leverage computation are ultimately the most effective."
Scaling laws formalize this: throwing compute at the problem often beats clever architectural innovations.
Modern Scaling Trends
1. Inference-Time Compute Scaling
Recent research (e.g., o1 model) shows scaling test-time compute:
def inference_time_scaling(base_performance, compute_multiplier):
"""
Model performance scaling with inference-time compute.
Recent findings: using more compute at inference (e.g., chain-of-thought,
multiple samples, tree search) also follows scaling laws.
Args:
base_performance: Baseline accuracy
compute_multiplier: Multiple of base inference compute
Returns:
Improved performance
"""
# Approximate scaling: performance ∝ log(compute)
# This is less efficient than training-time scaling but still valuable
improvement_factor = np.log1p(compute_multiplier) / np.log1p(1)
return base_performance * (1 + 0.1 * improvement_factor)
# Example: using 10x more inference compute
base_acc = 0.80
compute_10x_acc = inference_time_scaling(base_acc, 10)
compute_100x_acc = inference_time_scaling(base_acc, 100)
print(f"\nInference-time scaling:")
print(f" Base (1x compute): {base_acc:.2%} accuracy")
print(f" 10x compute: {compute_10x_acc:.2%} accuracy")
print(f" 100x compute: {compute_100x_acc:.2%} accuracy")
2. Downstream Task Scaling
def downstream_task_scaling(pretraining_loss):
"""
Predict downstream task performance from pretraining loss.
Lower pretraining loss → better downstream performance.
Relationship is approximately linear on log scale.
"""
# Approximate: accuracy ∝ -log(loss)
# This varies by task but holds qualitatively
# Example calibration for MMLU benchmark
# (Multi-task Language Understanding)
mmlu_accuracy = max(0.25, min(0.90, 0.9 - 0.2 * np.log(pretraining_loss)))
return mmlu_accuracy
# Predict downstream performance
losses = [3.0, 2.5, 2.0, 1.8, 1.6]
for loss in losses:
acc = downstream_task_scaling(loss)
print(f"Pretraining loss {loss:.2f} → MMLU accuracy {acc:.1%}")
Summary
Scaling laws reveal the predictable relationship between compute, model size, data, and performance:
Key Insights:
- Power law scaling: Performance ∝ Scale^α for parameters, data, and compute
- Chinchilla optimal: For a given budget, balance model size and training data (20:1 token:parameter ratio)
- Compute is king: More compute reliably improves performance
- Data efficiency matters: Better to train smaller models longer
- Predictability: Can forecast model performance before training
Practical Takeaways:
- Don't just scale parameters - scale data proportionally
- Inference-time compute also scales (but less efficiently)
- Downstream task performance correlates with pretraining loss
- Plan training runs based on compute budget, not target model size
These laws guide development of modern LLMs from GPT-4 to Gemini to Claude.