Back
advanced
Foundation Model Training

Distributed Training Systems

Learn how large models train across GPU clusters using parallelism, mixed precision, checkpointing, and fault tolerance

34 min read· distributed training· GPUs· parallelism· checkpoints

Distributed Training Systems

Large LLMs do not train on one GPU. They train across clusters.

The engineering challenge is to keep thousands of accelerators busy while moving data, gradients, activations, optimizer states, and checkpoints reliably.

Why one GPU is not enough

Training stores:

  • model weights
  • gradients
  • optimizer states
  • activations
  • KV or attention intermediates
  • batches of tokenized data

For very large models, this exceeds a single device. Training must be distributed.

Types of parallelism

TechniqueWhat is split
Data parallelismdifferent examples across devices
Tensor parallelismmatrix operations across devices
Pipeline parallelismlayers across devices
Sequence parallelismlong sequences across devices
Expert parallelismMoE experts across devices
ZeRO-style shardingoptimizer states, gradients, and parameters

Real systems combine several of these.

Mixed precision

Training usually uses lower precision math where safe:

  • BF16 or FP16 for speed and memory
  • FP32 master weights or accumulators where needed
  • loss scaling for stability
  • FP8 in newer training/inference stacks where supported

Lower precision saves memory and bandwidth, but it can make training unstable if used carelessly.

Checkpointing

Training runs fail. Hardware fails, jobs preempt, networks break.

Checkpointing saves:

  • model weights
  • optimizer states
  • scheduler state
  • dataloader position
  • random seeds
  • sharding metadata

Good checkpoint design lets training resume without losing days of compute.

Monitoring training health

Track:

  • training loss
  • validation loss
  • gradient norm
  • learning rate
  • tokens per second
  • GPU utilization
  • memory utilization
  • data loader stalls
  • checkpoint time
  • failed workers

Common failure modes

  • loss spikes
  • silent data corruption
  • duplicated data shards
  • unstable mixed precision
  • poor GPU utilization
  • network bottlenecks
  • checkpoint incompatibility
  • training on contaminated eval data

Knowledge check

Q1: Why do large models need multiple parallelism strategies?

Because weights, activations, optimizer states, and batches are too large for one device or one simple split.

Q2: Why are checkpoints part of training quality?

Without reliable checkpoints, long runs are fragile and expensive failures can destroy progress.