Back
advanced
Advanced Transformer Concepts

Model Scaling Laws: The Science of Size

Understand how model performance scales with compute, parameters, and data. Learn the principles that guide modern LLM development from GPT-3 to GPT-4.

20 min read· Scaling Laws· Training· Compute· Parameters

Model Scaling Laws: The Science of Size

Why is GPT-4 better than GPT-3? Why do companies spend millions training ever-larger models? The answer lies in scaling laws - mathematical relationships that predict how model performance improves with scale.

The Scaling Law Revolution

In 2020, OpenAI published groundbreaking research showing that language model performance follows predictable power laws. This transformed LLM development from art to science.

The Core Insight

Model performance (measured by loss) improves predictably as you scale three factors:

  1. Model size (N): Number of parameters
  2. Dataset size (D): Number of training tokens
  3. Compute (C): Total FLOPs used for training

Power Law Behavior:

Performance follows a power law:

Loss ∝ 1/Scale^α
where α depends on what you're scaling. This means doubling your compute gives a predictable improvement, not a random one.

Understanding the Three Axes

1. Model Size (Parameters N)

python
def count_transformer_parameters(num_layers, d_model, num_heads, d_ff, vocab_size):
    """
    Calculate parameters in a transformer model.

    Args:
        num_layers: Number of transformer layers
        d_model: Model dimension
        num_heads: Number of attention heads
        d_ff: Feed-forward hidden dimension
        vocab_size: Size of token vocabulary
    """
    # Embedding layer
    embedding_params = vocab_size * d_model

    # Single transformer layer:
    # - Multi-head attention: 4 * (d_model * d_model) for Q, K, V, O projections
    # - Layer norm 1: 2 * d_model (gamma and beta)
    # - Feed-forward: 2 * (d_model * d_ff) + d_ff + d_model (weights and biases)
    # - Layer norm 2: 2 * d_model

    attn_params = 4 * (d_model ** 2)
    norm1_params = 2 * d_model
    ff_params = (d_model * d_ff) + d_ff + (d_ff * d_model) + d_model
    norm2_params = 2 * d_model

    layer_params = attn_params + norm1_params + ff_params + norm2_params
    transformer_params = num_layers * layer_params

    # Output layer (usually shares weights with embedding)
    output_params = 0  # Shared with embedding

    total_params = embedding_params + transformer_params + output_params

    return total_params


# GPT-2 Small (124M parameters)
gpt2_small = count_transformer_parameters(
    num_layers=12,
    d_model=768,
    num_heads=12,
    d_ff=3072,
    vocab_size=50257
)
print(f"GPT-2 Small: {gpt2_small:,} parameters")

# GPT-2 Large (774M parameters)
gpt2_large = count_transformer_parameters(
    num_layers=36,
    d_model=1280,
    num_heads=20,
    d_ff=5120,
    vocab_size=50257
)
print(f"GPT-2 Large: {gpt2_large:,} parameters")

# GPT-3 (175B parameters - approximate)
gpt3 = count_transformer_parameters(
    num_layers=96,
    d_model=12288,
    num_heads=96,
    d_ff=49152,
    vocab_size=50257
)
print(f"GPT-3: {gpt3:,} parameters")
print(f"GPT-3 / GPT-2 Small ratio: {gpt3 / gpt2_small:.1f}x")

2. Dataset Size (Tokens D)

python
def estimate_training_tokens(parameters, compute_optimal=True):
    """
    Estimate optimal training tokens based on model size.

    Chinchilla scaling laws suggest tokens ≈ 20 × parameters
    for compute-optimal training.

    Args:
        parameters: Number of model parameters
        compute_optimal: If True, use Chinchilla ratio (20:1)
                        If False, use common practice (300B tokens)
    """
    if compute_optimal:
        # Chinchilla: tokens ≈ 20 × parameters
        optimal_tokens = 20 * parameters
        return optimal_tokens
    else:
        # Common practice: fixed dataset (e.g., 300B tokens)
        return 300_000_000_000


# GPT-3 175B
gpt3_params = 175_000_000_000

chinchilla_tokens = estimate_training_tokens(gpt3_params, compute_optimal=True)
actual_tokens = 300_000_000_000  # GPT-3 was trained on 300B tokens

print(f"GPT-3 parameters: {gpt3_params:,}")
print(f"Actual training tokens: {actual_tokens:,}")
print(f"Chinchilla optimal tokens: {chinchilla_tokens:,}")
print(f"GPT-3 was {'under' if actual_tokens < chinchilla_tokens else 'over'}trained")

3. Compute Budget (FLOPs C)

python
def estimate_training_compute(parameters, tokens, forward_pass_only=False):
    """
    Estimate training compute in FLOPs.

    Args:
        parameters: Number of model parameters (N)
        tokens: Number of training tokens (D)
        forward_pass_only: If False, includes backward pass (~2x forward)

    Returns:
        Total FLOPs
    """
    # Approximate FLOPs per token ≈ 6N (forward + backward)
    # Forward pass: ~2N FLOPs
    # Backward pass: ~4N FLOPs (2x forward for gradients)

    if forward_pass_only:
        flops_per_token = 2 * parameters
    else:
        flops_per_token = 6 * parameters

    total_flops = flops_per_token * tokens

    return total_flops


# GPT-3 training compute
gpt3_compute = estimate_training_compute(
    parameters=175_000_000_000,
    tokens=300_000_000_000
)

print(f"GPT-3 training compute: {gpt3_compute:.2e} FLOPs")
print(f"GPT-3 training compute: ~{gpt3_compute / 1e23:.1f} × 10^23 FLOPs")

# Compare compute budgets
models = {
    "GPT-2": (1.5e9, 10e9),      # 1.5B params, 10B tokens
    "GPT-3": (175e9, 300e9),     # 175B params, 300B tokens
    "Chinchilla": (70e9, 1.4e12), # 70B params, 1.4T tokens
}

for name, (params, tokens) in models.items():
    compute = estimate_training_compute(params, tokens)
    print(f"{name:12s}: {compute:.2e} FLOPs ({params/1e9:.0f}B params, {tokens/1e9:.0f}B tokens)")

Compute Costs:

Training GPT-3 cost an estimated $4.6M in compute. GPT-4 likely cost tens of millions. The compute budget is often the limiting factor in modern LLM development.

The Scaling Laws

OpenAI Scaling Laws (2020)

Performance (measured by cross-entropy loss L) scales as:

1. With Model Size (N):

L(N) ≈ (Nc / N)^αN

where:
- N = number of parameters
- Nc ≈ 8.8 × 10^13 (critical parameter count)
- αN ≈ 0.076 (scaling exponent)

2. With Dataset Size (D):

L(D) ≈ (Dc / D)^αD

where:
- D = number of tokens
- Dc ≈ 5.4 × 10^13 (critical dataset size)
- αD ≈ 0.095

3. With Compute (C):

L(C) ≈ (Cc / C)^αC

where:
- C = total compute in FLOPs
- Cc ≈ 3.1 × 10^8 FLOPs
- αC ≈ 0.050

Implementation: Predicting Loss

python
import numpy as np
import matplotlib.pyplot as plt

class ScalingLaws:
    """OpenAI scaling laws implementation."""

    def __init__(self):
        # Fitted constants from OpenAI paper
        self.Nc = 8.8e13  # Critical parameters
        self.Dc = 5.4e13  # Critical dataset size
        self.Cc = 3.1e8   # Critical compute

        self.alpha_N = 0.076  # Parameter scaling exponent
        self.alpha_D = 0.095  # Data scaling exponent
        self.alpha_C = 0.050  # Compute scaling exponent

    def loss_from_params(self, N):
        """Predict loss from number of parameters."""
        return (self.Nc / N) ** self.alpha_N

    def loss_from_data(self, D):
        """Predict loss from dataset size."""
        return (self.Dc / D) ** self.alpha_D

    def loss_from_compute(self, C):
        """Predict loss from compute budget."""
        return (self.Cc / C) ** self.alpha_C

    def optimal_allocation(self, C):
        """
        Given compute budget C, find optimal N and D.

        Returns:
            (optimal_params, optimal_tokens)
        """
        # From the paper: C ≈ 6ND (training FLOPs)
        # Optimal allocation minimizes loss given C

        # Simplified: allocate compute to balance parameter and data scaling
        # Optimal N ∝ C^a, D ∝ C^b where a + b ≈ 1

        # Approximate optimal allocation (derived from paper)
        optimal_N = (C / 6) ** 0.73
        optimal_D = C / (6 * optimal_N)

        return optimal_N, optimal_D


# Test scaling laws
scaling = ScalingLaws()

# Predict loss for different model sizes
param_sizes = np.logspace(6, 12, 50)  # 1M to 1T parameters
losses_params = [scaling.loss_from_params(N) for N in param_sizes]

# Predict loss for different dataset sizes
data_sizes = np.logspace(6, 12, 50)  # 1M to 1T tokens
losses_data = [scaling.loss_from_data(D) for D in data_sizes]

# Plot
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# Parameter scaling
ax1.loglog(param_sizes, losses_params, 'b-', linewidth=2)
ax1.axvline(175e9, color='r', linestyle='--', label='GPT-3 (175B)')
ax1.axvline(1.5e9, color='g', linestyle='--', label='GPT-2 (1.5B)')
ax1.set_xlabel('Parameters (N)')
ax1.set_ylabel('Loss')
ax1.set_title('Scaling with Model Size')
ax1.legend()
ax1.grid(True, alpha=0.3)

# Data scaling
ax2.loglog(data_sizes, losses_data, 'b-', linewidth=2)
ax2.axvline(300e9, color='r', linestyle='--', label='GPT-3 (300B)')
ax2.set_xlabel('Dataset Size (tokens)')
ax2.set_ylabel('Loss')
ax2.set_title('Scaling with Dataset Size')
ax2.legend()
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Predict GPT-3 loss
gpt3_loss_params = scaling.loss_from_params(175e9)
gpt3_loss_data = scaling.loss_from_data(300e9)
print(f"\nGPT-3 predicted loss (from params): {gpt3_loss_params:.4f}")
print(f"GPT-3 predicted loss (from data): {gpt3_loss_data:.4f}")

Chinchilla Scaling Laws (2022)

DeepMind's Chinchilla paper showed that most models are undertrained.

Key Finding: Data Matters More Than We Thought

Previous approach: Make models bigger, use fixed dataset Chinchilla insight: For a given compute budget, use smaller models trained on more data

python
class ChinchillaScaling:
    """Chinchilla optimal scaling laws."""

    def optimal_model_size(self, compute_budget):
        """
        Compute-optimal model size given budget.

        From Chinchilla paper:
        N_opt ∝ C^a where a ≈ 0.50
        D_opt ∝ C^b where b ≈ 0.50
        """
        # Simplified: N and D should scale equally with compute
        # For compute budget C, optimal N ≈ C^0.5 / 20

        optimal_N = (compute_budget / 6) ** 0.5
        optimal_D = optimal_N * 20  # 20 tokens per parameter

        return optimal_N, optimal_D

    def compare_strategies(self, compute_budget):
        """Compare Chinchilla-optimal vs parameter-focused strategies."""

        # Chinchilla-optimal
        N_opt, D_opt = self.optimal_model_size(compute_budget)

        # Old approach: maximize parameters
        N_large = compute_budget / (6 * 300e9)  # Fixed 300B tokens
        D_large = 300e9

        return {
            'chinchilla': (N_opt, D_opt),
            'old_approach': (N_large, D_large)
        }


chinchilla = ChinchillaScaling()

# Same compute budget as GPT-3
gpt3_compute = estimate_training_compute(175e9, 300e9)

strategies = chinchilla.compare_strategies(gpt3_compute)

print("\nFor GPT-3's compute budget:")
print(f"\nOld approach (GPT-3):")
print(f"  Parameters: {strategies['old_approach'][0]:,.0f}")
print(f"  Tokens: {strategies['old_approach'][1]:,.0f}")

print(f"\nChinchilla-optimal (Chinchilla):")
print(f"  Parameters: {strategies['chinchilla'][0]:,.0f}")
print(f"  Tokens: {strategies['chinchilla'][1]:,.0f}")

print(f"\nChinchilla has {strategies['old_approach'][0] / strategies['chinchilla'][0]:.1f}x fewer parameters")
print(f"but {strategies['chinchilla'][1] / strategies['old_approach'][1]:.1f}x more training tokens")

Chinchilla's Impact:

Chinchilla (70B parameters, 1.4T tokens) outperforms GPT-3 (175B parameters, 300B tokens) using the same compute budget. The lesson: train smaller models longer rather than larger models briefly.

Practical Implications

1. Budgeting Your Training Run

python
def plan_training_run(compute_budget_dollars, cost_per_petaflop=1.0):
    """
    Plan optimal training configuration given budget.

    Args:
        compute_budget_dollars: Budget in dollars
        cost_per_petaflop: Cost per petaflop-day (default $1)

    Returns:
        Training plan with N, D, estimated performance
    """
    # Convert dollars to FLOPs
    petaflop_days = compute_budget_dollars / cost_per_petaflop
    total_flops = petaflop_days * (1e15 * 86400)  # petaflops to FLOPs

    # Chinchilla-optimal allocation
    chinchilla = ChinchillaScaling()
    N_opt, D_opt = chinchilla.optimal_model_size(total_flops)

    # Predict performance
    scaling = ScalingLaws()
    predicted_loss = scaling.loss_from_params(N_opt)

    # Estimate training time (assuming 300 petaflop/s GPU cluster)
    training_time_seconds = total_flops / (300e15)
    training_time_days = training_time_seconds / 86400

    return {
        'parameters': N_opt,
        'tokens': D_opt,
        'predicted_loss': predicted_loss,
        'training_days': training_time_days,
        'total_flops': total_flops
    }


# Example: $100k budget
plan = plan_training_run(compute_budget_dollars=100_000)

print("\nTraining plan for $100k budget:")
print(f"  Model size: {plan['parameters']:,.0f} parameters ({plan['parameters']/1e9:.1f}B)")
print(f"  Training tokens: {plan['tokens']:,.0f} ({plan['tokens']/1e9:.1f}B)")
print(f"  Predicted loss: {plan['predicted_loss']:.4f}")
print(f"  Training time: {plan['training_days']:.1f} days")
print(f"  Total compute: {plan['total_flops']:.2e} FLOPs")

2. The Bitter Lesson

python
def compare_approaches(compute_budget):
    """
    Compare: better algorithms vs. more compute.

    Shows why scaling often beats clever algorithms.
    """
    # Baseline: small model with algorithm improvement
    baseline_N = 1e9  # 1B parameters
    baseline_D = 100e9  # 100B tokens
    baseline_compute = estimate_training_compute(baseline_N, baseline_D)

    scaling_laws = ScalingLaws()
    baseline_loss = scaling_laws.loss_from_params(baseline_N)

    # Algorithm improvement: assume 10% better loss (optimistic)
    algorithm_improved_loss = baseline_loss * 0.9

    # Scaling approach: use full compute budget
    chinchilla = ChinchillaScaling()
    scaled_N, scaled_D = chinchilla.optimal_model_size(compute_budget)
    scaled_loss = scaling_laws.loss_from_params(scaled_N)

    return {
        'algorithm': algorithm_improved_loss,
        'scaling': scaled_loss,
        'winner': 'scaling' if scaled_loss < algorithm_improved_loss else 'algorithm'
    }


# Compare for GPT-3 budget
gpt3_budget = estimate_training_compute(175e9, 300e9)
result = compare_approaches(gpt3_budget)

print("\nAlgorithm vs. Scaling:")
print(f"  Improved algorithm (1B model): {result['algorithm']:.4f} loss")
print(f"  Scaled model (Chinchilla-optimal): {result['scaling']:.4f} loss")
print(f"  Winner: {result['winner']}")
print(f"  Improvement: {(1 - result['scaling']/result['algorithm']) * 100:.1f}%")

The Bitter Lesson (Rich Sutton):

"The biggest lesson that can be read from 70 years of AI research is that general methods that leverage computation are ultimately the most effective."

Scaling laws formalize this: throwing compute at the problem often beats clever architectural innovations.

1. Inference-Time Compute Scaling

Recent research (e.g., o1 model) shows scaling test-time compute:

python
def inference_time_scaling(base_performance, compute_multiplier):
    """
    Model performance scaling with inference-time compute.

    Recent findings: using more compute at inference (e.g., chain-of-thought,
    multiple samples, tree search) also follows scaling laws.

    Args:
        base_performance: Baseline accuracy
        compute_multiplier: Multiple of base inference compute

    Returns:
        Improved performance
    """
    # Approximate scaling: performance ∝ log(compute)
    # This is less efficient than training-time scaling but still valuable

    improvement_factor = np.log1p(compute_multiplier) / np.log1p(1)
    return base_performance * (1 + 0.1 * improvement_factor)


# Example: using 10x more inference compute
base_acc = 0.80
compute_10x_acc = inference_time_scaling(base_acc, 10)
compute_100x_acc = inference_time_scaling(base_acc, 100)

print(f"\nInference-time scaling:")
print(f"  Base (1x compute): {base_acc:.2%} accuracy")
print(f"  10x compute: {compute_10x_acc:.2%} accuracy")
print(f"  100x compute: {compute_100x_acc:.2%} accuracy")

2. Downstream Task Scaling

python
def downstream_task_scaling(pretraining_loss):
    """
    Predict downstream task performance from pretraining loss.

    Lower pretraining loss → better downstream performance.
    Relationship is approximately linear on log scale.
    """
    # Approximate: accuracy ∝ -log(loss)
    # This varies by task but holds qualitatively

    # Example calibration for MMLU benchmark
    # (Multi-task Language Understanding)
    mmlu_accuracy = max(0.25, min(0.90, 0.9 - 0.2 * np.log(pretraining_loss)))

    return mmlu_accuracy


# Predict downstream performance
losses = [3.0, 2.5, 2.0, 1.8, 1.6]
for loss in losses:
    acc = downstream_task_scaling(loss)
    print(f"Pretraining loss {loss:.2f} → MMLU accuracy {acc:.1%}")

Summary

Scaling laws reveal the predictable relationship between compute, model size, data, and performance:

Key Insights:

  1. Power law scaling: Performance ∝ Scale^α for parameters, data, and compute
  2. Chinchilla optimal: For a given budget, balance model size and training data (20:1 token:parameter ratio)
  3. Compute is king: More compute reliably improves performance
  4. Data efficiency matters: Better to train smaller models longer
  5. Predictability: Can forecast model performance before training

Practical Takeaways:

  • Don't just scale parameters - scale data proportionally
  • Inference-time compute also scales (but less efficiently)
  • Downstream task performance correlates with pretraining loss
  • Plan training runs based on compute budget, not target model size

These laws guide development of modern LLMs from GPT-4 to Gemini to Claude.