How Foundation Models Are Trained

Most developers use LLMs through APIs, but serious AI engineers need to understand how those models are made.

As of June 17, 2026, the modern training pipeline is best understood as five stages:

text

data collection -> pretraining -> post-training -> evaluation -> deployment

Stage 1: data collection and filtering

Training starts with a massive mixture of text, code, math, scientific documents, web pages, books, conversations, images, audio, and synthetic data.

The raw data is not used directly. Teams filter and transform it:

The tokenizer defines how text becomes tokens. It affects cost, multilingual quality, code quality, and context usage.

Tokenizer decisions include:

Decision	Why it matters
vocabulary size	larger vocab can reduce token count but increases embedding size
byte fallback	helps with arbitrary unicode and rare text
code handling	impacts programming performance
multilingual balance	affects non-English efficiency
chat template	determines how messages are represented

Pretraining teaches the base model to predict tokens.

For decoder-only LLMs, the objective is usually:

text

given tokens 1..n-1, predict token n

This simple objective creates broad capabilities because the model sees enormous varied data.

During pretraining, teams tune:

A base model completes text. A helpful assistant follows instructions.

Post-training usually adds:

Before release, teams test:

Then serving teams optimize inference with quantization, batching, caching, routing, and monitoring.

The model is not just the architecture. It is:

text

architecture + tokenizer + data + training recipe + post-training + evals + serving stack

Change any part and behavior can change.

Q1: What is the difference between pretraining and post-training?

Pretraining teaches broad next-token prediction. Post-training teaches useful assistant behavior, safety, tools, and preferences.

Q2: Why is data mixture important?

It strongly shapes what the model is good at, where it fails, and which languages/domains it serves well.