How Foundation Models Are Trained
Most developers use LLMs through APIs, but serious AI engineers need to understand how those models are made.
As of June 17, 2026, the modern training pipeline is best understood as five stages:
data collection -> pretraining -> post-training -> evaluation -> deployment
Stage 1: data collection and filtering
Training starts with a massive mixture of text, code, math, scientific documents, web pages, books, conversations, images, audio, and synthetic data.
The raw data is not used directly. Teams filter and transform it:
- remove duplicates
- remove spam and boilerplate
- filter malware and low-quality code
- redact or remove sensitive personal data
- balance domains and languages
- score examples with classifiers or stronger models
- keep provenance and license metadata where possible
Stage 2: tokenizer training
The tokenizer defines how text becomes tokens. It affects cost, multilingual quality, code quality, and context usage.
Tokenizer decisions include:
| Decision | Why it matters |
|---|---|
| vocabulary size | larger vocab can reduce token count but increases embedding size |
| byte fallback | helps with arbitrary unicode and rare text |
| code handling | impacts programming performance |
| multilingual balance | affects non-English efficiency |
| chat template | determines how messages are represented |
Stage 3: pretraining
Pretraining teaches the base model to predict tokens.
For decoder-only LLMs, the objective is usually:
given tokens 1..n-1, predict token n
This simple objective creates broad capabilities because the model sees enormous varied data.
During pretraining, teams tune:
- model size
- data mixture
- learning rate schedule
- batch size
- context length curriculum
- optimizer
- checkpoint frequency
- loss curves and validation sets
Stage 4: post-training
A base model completes text. A helpful assistant follows instructions.
Post-training usually adds:
- supervised fine-tuning on instruction examples
- preference tuning such as RLHF, DPO, or related methods
- tool-use examples
- safety refusals
- structured-output behavior
- domain-specific style
- reasoning examples for hard tasks
Stage 5: evaluation and deployment
Before release, teams test:
- general knowledge
- coding
- math
- reasoning
- multilingual behavior
- safety
- jailbreak resistance
- tool use
- latency and cost
- regression against older versions
Then serving teams optimize inference with quantization, batching, caching, routing, and monitoring.
Important truth
The model is not just the architecture. It is:
architecture + tokenizer + data + training recipe + post-training + evals + serving stack
Change any part and behavior can change.
Knowledge check
Q1: What is the difference between pretraining and post-training?
Pretraining teaches broad next-token prediction. Post-training teaches useful assistant behavior, safety, tools, and preferences.
Q2: Why is data mixture important?
It strongly shapes what the model is good at, where it fails, and which languages/domains it serves well.