Back
advanced
Foundation Model Training

How Foundation Models Are Trained

Trace the full LLM training pipeline from raw data to base model, post-trained assistant, evals, and deployment

32 min read· pretraining· foundation models· training pipeline· LLM training

How Foundation Models Are Trained

Most developers use LLMs through APIs, but serious AI engineers need to understand how those models are made.

As of June 17, 2026, the modern training pipeline is best understood as five stages:

text
data collection -> pretraining -> post-training -> evaluation -> deployment

Stage 1: data collection and filtering

Training starts with a massive mixture of text, code, math, scientific documents, web pages, books, conversations, images, audio, and synthetic data.

The raw data is not used directly. Teams filter and transform it:

  • remove duplicates
  • remove spam and boilerplate
  • filter malware and low-quality code
  • redact or remove sensitive personal data
  • balance domains and languages
  • score examples with classifiers or stronger models
  • keep provenance and license metadata where possible

Stage 2: tokenizer training

The tokenizer defines how text becomes tokens. It affects cost, multilingual quality, code quality, and context usage.

Tokenizer decisions include:

DecisionWhy it matters
vocabulary sizelarger vocab can reduce token count but increases embedding size
byte fallbackhelps with arbitrary unicode and rare text
code handlingimpacts programming performance
multilingual balanceaffects non-English efficiency
chat templatedetermines how messages are represented

Stage 3: pretraining

Pretraining teaches the base model to predict tokens.

For decoder-only LLMs, the objective is usually:

text
given tokens 1..n-1, predict token n

This simple objective creates broad capabilities because the model sees enormous varied data.

During pretraining, teams tune:

  • model size
  • data mixture
  • learning rate schedule
  • batch size
  • context length curriculum
  • optimizer
  • checkpoint frequency
  • loss curves and validation sets

Stage 4: post-training

A base model completes text. A helpful assistant follows instructions.

Post-training usually adds:

  • supervised fine-tuning on instruction examples
  • preference tuning such as RLHF, DPO, or related methods
  • tool-use examples
  • safety refusals
  • structured-output behavior
  • domain-specific style
  • reasoning examples for hard tasks

Stage 5: evaluation and deployment

Before release, teams test:

  • general knowledge
  • coding
  • math
  • reasoning
  • multilingual behavior
  • safety
  • jailbreak resistance
  • tool use
  • latency and cost
  • regression against older versions

Then serving teams optimize inference with quantization, batching, caching, routing, and monitoring.

Important truth

The model is not just the architecture. It is:

text
architecture + tokenizer + data + training recipe + post-training + evals + serving stack

Change any part and behavior can change.

Knowledge check

Q1: What is the difference between pretraining and post-training?

Pretraining teaches broad next-token prediction. Post-training teaches useful assistant behavior, safety, tools, and preferences.

Q2: Why is data mixture important?

It strongly shapes what the model is good at, where it fails, and which languages/domains it serves well.