Paper: LLaMA - Open and Efficient Foundation Models
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, Guillaume Lample ()
Read PaperIn February 2023, Meta AI released LLaMA, challenging the assumption that bigger is always better. The paper showed that smaller models trained on more data with careful optimization can match or exceed much larger models.
Motivation and Context
The Problem with Scale
Before LLaMA, the trend was clear: bigger models = better performance.
- GPT-3: 175B parameters
- Gopher: 280B parameters
- Megatron-Turing NLG: 530B parameters
- PaLM: 540B parameters
Issues:
- Inference cost: Large models expensive to run
- Accessibility: Only big labs can afford training
- Environmental impact: Massive energy consumption
- Deployment: Hard to deploy 100B+ models in production
LLaMA's Hypothesis:
Following Chinchilla scaling laws, smaller models trained on more data should be more efficient. A 13B model trained on 1T tokens might outperform a 175B model trained on 300B tokens - and be much cheaper to run.
Key Contributions
1. Model Sizes and Training Data
LLaMA released 4 sizes: 7B, 13B, 33B, and 65B parameters.
Training data:
- 1.4 trillion tokens for all models
- Only publicly available data (no proprietary datasets)
- Diverse sources: CommonCrawl, C4, GitHub, Wikipedia, Books, ArXiv, StackExchange
# LLaMA training data composition
data_sources = {
'CommonCrawl': 0.670, # 67.0% - web pages
'C4': 0.150, # 15.0% - cleaned CommonCrawl
'GitHub': 0.045, # 4.5% - code
'Wikipedia': 0.045, # 4.5% - encyclopedic
'Books': 0.045, # 4.5% - long-form
'ArXiv': 0.025, # 2.5% - scientific papers
'StackExchange': 0.020, # 2.0% - Q&A
}
import matplotlib.pyplot as plt
plt.figure(figsize=(10, 6))
plt.pie(
data_sources.values(),
labels=[f"{k}\n({v*100:.1f}%)" for k, v in data_sources.items()],
autopct='',
startangle=90
)
plt.title('LLaMA Training Data Composition (1.4T tokens)')
plt.axis('equal')
plt.show()
total_tokens = 1.4e12
print("\nTraining data breakdown:")
for source, fraction in data_sources.items():
tokens = total_tokens * fraction
print(f" {source:15s}: {tokens/1e9:6.1f}B tokens ({fraction*100:4.1f}%)")
2. Architecture Improvements
LLaMA incorporated several modern improvements over the original transformer:
"""
LLaMA architectural choices:
1. Pre-normalization (GPT-3 style)
- Apply normalization BEFORE each sub-layer
- Better training stability
2. RMSNorm instead of LayerNorm
- Simpler, faster normalization
- No mean subtraction, no bias
3. SwiGLU activation (PaLM style)
- Replace ReLU in FFN
- Better performance
4. Rotary Embeddings (GPTNeo/GPT-J style)
- Relative positional encoding
- Better length generalization
5. Increased context length
- 2048 tokens (up from 512 in early GPT)
- Enables longer-form understanding
"""
# LLaMA model configurations
llama_configs = {
'7B': {
'dim': 4096,
'n_layers': 32,
'n_heads': 32,
'n_kv_heads': 32, # LLaMA 1 uses standard MHA
'ffn_dim': 11008, # ~2.7 × dim (for SwiGLU)
'vocab_size': 32000,
'context_len': 2048,
},
'13B': {
'dim': 5120,
'n_layers': 40,
'n_heads': 40,
'n_kv_heads': 40,
'ffn_dim': 13824,
'vocab_size': 32000,
'context_len': 2048,
},
'33B': {
'dim': 6656,
'n_layers': 60,
'n_heads': 52,
'n_kv_heads': 52,
'ffn_dim': 17920,
'vocab_size': 32000,
'context_len': 2048,
},
'65B': {
'dim': 8192,
'n_layers': 80,
'n_heads': 64,
'n_kv_heads': 64,
'ffn_dim': 22016,
'vocab_size': 32000,
'context_len': 2048,
},
}
# Display configurations
import pandas as pd
df = pd.DataFrame(llama_configs).T
print("\nLLaMA Model Configurations:")
print(df)
# Calculate actual parameter counts
def calculate_params(config):
"""Estimate parameter count."""
d = config['dim']
n = config['n_layers']
v = config['vocab_size']
ffn = config['ffn_dim']
# Embedding
embed_params = v * d
# Attention per layer: Q, K, V, O projections
attn_params_per_layer = 4 * (d * d)
# FFN per layer (SwiGLU has 3 matrices)
ffn_params_per_layer = 3 * (d * ffn)
# Layer norms (RMSNorm has only weight, no bias)
norm_params_per_layer = 2 * d # 2 norms per layer
# Total per layer
params_per_layer = attn_params_per_layer + ffn_params_per_layer + norm_params_per_layer
# Total model
total = embed_params + (n * params_per_layer) + d # +d for final norm
return total
print("\nParameter Counts:")
for name, config in llama_configs.items():
params = calculate_params(config)
print(f" LLaMA-{name}: {params / 1e9:.2f}B parameters")
Training Efficiency:
LLaMA-13B trained on 1.4T tokens uses approximately the same compute as GPT-3 (175B) trained on 300B tokens, but is much cheaper to run at inference due to smaller size.
3. Training Details
"""
Training hyperparameters (from paper):
Optimizer: AdamW
- β₁ = 0.9
- β₂ = 0.95
- eps = 10⁻⁸
Learning rate schedule:
- Warmup: 2000 steps to max LR
- Cosine decay to 10% of max LR
- Max LR varies by model size:
- 7B: 3 × 10⁻⁴
- 13B: 3 × 10⁻⁴
- 33B: 1.5 × 10⁻⁴
- 65B: 1.5 × 10⁻⁴
Weight decay: 0.1
Gradient clipping: 1.0
Batch size: 4M tokens
- Increased gradually during training
Context length: 2048 tokens
Training duration:
- 7B: ~1T tokens (140k steps)
- 13B: ~1T tokens
- 33B: ~1.4T tokens
- 65B: ~1.4T tokens
"""
import numpy as np
import matplotlib.pyplot as plt
def cosine_learning_rate_schedule(
step,
max_steps,
max_lr,
warmup_steps=2000,
min_lr_ratio=0.1
):
"""
Cosine learning rate schedule with warmup (used in LLaMA).
Args:
step: Current training step
max_steps: Total training steps
max_lr: Maximum learning rate
warmup_steps: Number of warmup steps
min_lr_ratio: Minimum LR as ratio of max_lr
Returns:
Current learning rate
"""
if step < warmup_steps:
# Linear warmup
return max_lr * (step / warmup_steps)
else:
# Cosine decay
progress = (step - warmup_steps) / (max_steps - warmup_steps)
cosine_decay = 0.5 * (1 + np.cos(np.pi * progress))
min_lr = max_lr * min_lr_ratio
return min_lr + (max_lr - min_lr) * cosine_decay
# Visualize LR schedule
max_steps = 140000 # LLaMA-7B
max_lr = 3e-4
steps = np.arange(max_steps)
lrs = [cosine_learning_rate_schedule(s, max_steps, max_lr) for s in steps]
plt.figure(figsize=(12, 6))
plt.plot(steps, lrs, linewidth=2)
plt.xlabel('Training Steps')
plt.ylabel('Learning Rate')
plt.title('LLaMA Learning Rate Schedule (7B model)')
plt.grid(True, alpha=0.3)
plt.axvline(2000, color='r', linestyle='--', alpha=0.5, label='End of warmup')
plt.legend()
plt.show()
print(f"Peak learning rate: {max_lr}")
print(f"Warmup steps: 2000")
print(f"Final learning rate: {max_lr * 0.1}")
4. Training Infrastructure
Compute resources:
- 2048 A100 80GB GPUs
- Training time: ~21 days for 65B model
- Total compute: ~2.5 × 10²³ FLOPs for 65B
def estimate_training_cost(params, tokens, flops_per_param_token=6):
"""
Estimate training compute and cost.
Args:
params: Model parameters
tokens: Training tokens
flops_per_param_token: FLOPs per parameter per token (typically 6)
Returns:
Dictionary with compute metrics
"""
# Total FLOPs
total_flops = params * tokens * flops_per_param_token
# Assuming A100 GPU: ~312 TFLOPS peak
# But effective utilization ~50% → ~150 TFLOPS
effective_tflops_per_gpu = 150e12
# GPU-hours needed
gpu_hours = total_flops / (effective_tflops_per_gpu * 3600)
# Cost estimate (AWS p4d.24xlarge: ~$32/hour for 8 A100s)
cost_per_gpu_hour = 32 / 8 # $4 per GPU-hour
total_cost = gpu_hours * cost_per_gpu_hour
return {
'total_flops': total_flops,
'gpu_hours': gpu_hours,
'estimated_cost': total_cost,
'days_on_single_gpu': gpu_hours / 24
}
# Estimate for each LLaMA model
training_tokens = 1.4e12 # 1.4T tokens
print("LLaMA Training Cost Estimates:\n")
for name, config in llama_configs.items():
params = calculate_params(config)
metrics = estimate_training_cost(params, training_tokens)
print(f"LLaMA-{name}:")
print(f" Total FLOPs: {metrics['total_flops']:.2e}")
print(f" GPU-hours: {metrics['gpu_hours']:,.0f}")
print(f" Estimated cost: ${metrics['estimated_cost']:,.0f}")
print(f" Days on 2048 GPUs: {metrics['gpu_hours'] / (24 * 2048):.1f}")
print()
Training Cost Reality:
The paper reports LLaMA-65B took 21 days on 2048 A100 GPUs. At $4/GPU-hour, that's approximately $4.1M in compute costs. This is why open-sourcing the weights was valuable - the community can use the models without repeating this expense.
Performance Results
Zero-Shot and Few-Shot Benchmarks
LLaMA compared favorably to much larger models:
# Approximate results from the paper
# (Numbers are illustrative based on paper figures)
benchmark_results = {
'Model': ['GPT-3 175B', 'Gopher 280B', 'Chinchilla 70B', 'LLaMA-65B', 'LLaMA-13B'],
'Params (B)': [175, 280, 70, 65, 13],
'MMLU (5-shot)': [43.9, 60.0, 67.5, 63.4, 46.9],
'HellaSwag (0-shot)': [78.9, 79.2, 80.8, 79.2, 76.1],
'TriviaQA (1-shot)': [64.3, 73.0, 72.3, 68.9, 61.3],
'NaturalQuestions (1-shot)': [29.9, 31.8, 31.5, 33.0, 26.0],
}
import pandas as pd
df_results = pd.DataFrame(benchmark_results).set_index('Model')
print("LLaMA Benchmark Performance:\n")
print(df_results.to_string())
# Visualize
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
benchmarks = ['MMLU (5-shot)', 'HellaSwag (0-shot)', 'TriviaQA (1-shot)', 'NaturalQuestions (1-shot)']
for idx, (ax, bench) in enumerate(zip(axes.flat, benchmarks)):
models = df_results.index
scores = df_results[bench]
colors = ['gray', 'gray', 'lightblue', 'blue', 'darkblue']
ax.barh(models, scores, color=colors)
ax.set_xlabel('Score')
ax.set_title(bench)
ax.grid(axis='x', alpha=0.3)
plt.tight_layout()
plt.show()
Key observations:
- LLaMA-13B matches or exceeds GPT-3 175B on many tasks (13x smaller!)
- LLaMA-65B competitive with Chinchilla-70B and Gopher-280B
- Smaller models trained longer can compete with much larger models
Efficiency Wins:
LLaMA-13B achieves similar performance to GPT-3 while being:
- 13x smaller (13B vs 175B parameters)
- 4.6x more token-efficient at training (1.4T vs 300B tokens)
- Much cheaper to run at inference
This validates the Chinchilla scaling hypothesis.
Code Generation
# HumanEval benchmark (coding)
coding_results = {
'Model': ['GPT-3 175B', 'PaLM 62B', 'PaLM 540B', 'LLaMA-65B', 'LLaMA-13B'],
'Params (B)': [175, 62, 540, 65, 13],
'HumanEval Pass@1': [0.0, 26.2, 26.2, 23.7, 15.8],
'HumanEval Pass@100': [0.0, 76.2, 76.2, 79.3, 52.5],
}
df_coding = pd.DataFrame(coding_results)
print("\nCode Generation Performance (HumanEval):\n")
print(df_coding.to_string(index=False))
LLaMA shows strong coding ability despite no code-specific tuning (just 4.5% GitHub in training data).
Impact and Legacy
Open Source Revolution
LLaMA's release (weights leaked, then officially released) sparked:
- Alpaca: Stanford instruction-tuned LLaMA-7B ($600 training cost)
- Vicuna: Chatbot fine-tune competitive with GPT-3.5
- WizardLM, Orca, etc.: Many derivatives
- LLaMA 2: Official follow-up with commercial license
# LLaMA derivative timeline (2023)
derivatives = {
'Feb 24': 'LLaMA weights leaked',
'Mar 13': 'Alpaca (Stanford)',
'Mar 30': 'Vicuna (LMSYS)',
'Apr 12': 'Koala (Berkeley)',
'Apr 24': 'WizardLM',
'May 22': 'Orca (Microsoft Research)',
'Jul 18': 'LLaMA 2 official release',
}
print("LLaMA's Impact on Open Source LLMs:\n")
for date, event in derivatives.items():
print(f"{date}: {event}")
Scientific Contributions
1. Validated Chinchilla scaling: Empirically confirmed that smaller models + more data > larger models + less data
2. Architectural best practices: Showed combination of RMSNorm, SwiGLU, and RoPE works well
3. Training efficiency: Demonstrated efficient training with careful hyperparameter tuning
4. Open science: Made state-of-the-art models accessible for research
LLaMA's Lesson:
"Bigger is not always better." With the right data, training recipe, and architecture, a 13B model can match a 175B model while being far more practical to deploy.
This shifted focus from "how big can we go?" to "how efficiently can we train?"
LLaMA 2 Improvements
LLaMA 2 (July 2023) built on the original:
llama2_improvements = {
'Context Length': '2048 → 4096 tokens',
'Training Data': '1.4T → 2.0T tokens',
'Attention': 'MHA → Grouped-Query Attention (GQA)',
'License': 'Research only → Commercial use allowed',
'Variants': 'Base models → Base + Chat models',
'Safety': 'None → Extensive safety training',
}
print("LLaMA 2 Improvements:\n")
for aspect, change in llama2_improvements.items():
print(f" {aspect:15s}: {change}")
# LLaMA 2 configurations (notable change: GQA)
llama2_configs = {
'7B': {'n_heads': 32, 'n_kv_heads': 32}, # No GQA
'13B': {'n_heads': 40, 'n_kv_heads': 40}, # No GQA
'70B': {'n_heads': 64, 'n_kv_heads': 8}, # 8:1 GQA ratio
}
print("\nLLaMA 2 Grouped-Query Attention:")
for size, config in llama2_configs.items():
ratio = config['n_heads'] / config['n_kv_heads']
print(f" {size}: {config['n_heads']} query heads, {config['n_kv_heads']} KV heads (ratio: {ratio:.1f}:1)")
Summary
LLaMA's Main Contributions:
- Efficiency over size: 13B model matching 175B model performance
- Data matters: Training on 1.4T tokens (not just 300B)
- Architectural improvements: RMSNorm, SwiGLU, RoPE combination
- Open science: Released weights enabled research community
- Validated scaling laws: Chinchilla-style training works
Impact:
- Sparked open-source LLM revolution
- Shifted focus to training efficiency
- Made powerful models accessible
- Inspired LLaMA 2, Mistral, and many others
Quote from the paper:
"We focus on training models that achieve the best possible performance at various inference budgets, by training on more tokens than what is typically used."
This philosophy - optimizing for inference efficiency rather than training efficiency - has become the new paradigm in LLM development.