Distributed Training Systems
Large LLMs do not train on one GPU. They train across clusters.
The engineering challenge is to keep thousands of accelerators busy while moving data, gradients, activations, optimizer states, and checkpoints reliably.
Why one GPU is not enough
Training stores:
- model weights
- gradients
- optimizer states
- activations
- KV or attention intermediates
- batches of tokenized data
For very large models, this exceeds a single device. Training must be distributed.
Types of parallelism
| Technique | What is split |
|---|---|
| Data parallelism | different examples across devices |
| Tensor parallelism | matrix operations across devices |
| Pipeline parallelism | layers across devices |
| Sequence parallelism | long sequences across devices |
| Expert parallelism | MoE experts across devices |
| ZeRO-style sharding | optimizer states, gradients, and parameters |
Real systems combine several of these.
Mixed precision
Training usually uses lower precision math where safe:
- BF16 or FP16 for speed and memory
- FP32 master weights or accumulators where needed
- loss scaling for stability
- FP8 in newer training/inference stacks where supported
Lower precision saves memory and bandwidth, but it can make training unstable if used carelessly.
Checkpointing
Training runs fail. Hardware fails, jobs preempt, networks break.
Checkpointing saves:
- model weights
- optimizer states
- scheduler state
- dataloader position
- random seeds
- sharding metadata
Good checkpoint design lets training resume without losing days of compute.
Monitoring training health
Track:
- training loss
- validation loss
- gradient norm
- learning rate
- tokens per second
- GPU utilization
- memory utilization
- data loader stalls
- checkpoint time
- failed workers
Common failure modes
- loss spikes
- silent data corruption
- duplicated data shards
- unstable mixed precision
- poor GPU utilization
- network bottlenecks
- checkpoint incompatibility
- training on contaminated eval data
Knowledge check
Q1: Why do large models need multiple parallelism strategies?
Because weights, activations, optimizer states, and batches are too large for one device or one simple split.
Q2: Why are checkpoints part of training quality?
Without reliable checkpoints, long runs are fragile and expensive failures can destroy progress.