Distributed Training Systems

Large LLMs do not train on one GPU. They train across clusters.

The engineering challenge is to keep thousands of accelerators busy while moving data, gradients, activations, optimizer states, and checkpoints reliably.

Why one GPU is not enough

Training stores:

model weights
gradients
optimizer states
activations
KV or attention intermediates
batches of tokenized data

For very large models, this exceeds a single device. Training must be distributed.

Types of parallelism

Technique	What is split
Data parallelism	different examples across devices
Tensor parallelism	matrix operations across devices
Pipeline parallelism	layers across devices
Sequence parallelism	long sequences across devices
Expert parallelism	MoE experts across devices
ZeRO-style sharding	optimizer states, gradients, and parameters

Real systems combine several of these.

Mixed precision

Training usually uses lower precision math where safe:

BF16 or FP16 for speed and memory
FP32 master weights or accumulators where needed
loss scaling for stability
FP8 in newer training/inference stacks where supported

Lower precision saves memory and bandwidth, but it can make training unstable if used carelessly.

Checkpointing

Training runs fail. Hardware fails, jobs preempt, networks break.

Checkpointing saves:

model weights
optimizer states
scheduler state
dataloader position
random seeds
sharding metadata

Good checkpoint design lets training resume without losing days of compute.

Monitoring training health

Track:

training loss
validation loss
gradient norm
learning rate
tokens per second
GPU utilization
memory utilization
data loader stalls
checkpoint time
failed workers

Common failure modes

loss spikes
silent data corruption
duplicated data shards
unstable mixed precision
poor GPU utilization
network bottlenecks
checkpoint incompatibility
training on contaminated eval data

Knowledge check

Q1: Why do large models need multiple parallelism strategies?

Because weights, activations, optimizer states, and batches are too large for one device or one simple split.

Q2: Why are checkpoints part of training quality?

Without reliable checkpoints, long runs are fragile and expensive failures can destroy progress.