Post-Training and Alignment Pipeline
Pretraining creates a base model. Post-training turns it into a useful assistant.
The base model problem
A base model may know a lot, but it is not automatically:
- helpful
- safe
- concise
- tool-aware
- schema-aware
- calibrated
- conversational
- aligned with user intent
Post-training teaches these behaviors.
Stage 1: supervised fine-tuning
Supervised fine-tuning uses high-quality instruction-response examples.
Examples include:
- question answering
- summarization
- coding help
- refusal examples
- tool-call examples
- JSON output examples
- multi-turn conversations
Quality beats quantity. Bad examples teach bad habits.
Stage 2: preference optimization
Preference data says which response is better.
Common methods include:
| Method | Idea |
|---|---|
| RLHF | train reward model, then optimize policy |
| DPO | directly optimize chosen vs rejected responses |
| RLAIF | use AI feedback for preference labels |
| constitutional AI | critique and revise against written principles |
Stage 3: tool and agent training
Modern assistants need to learn:
- when to call a tool
- which tool to call
- how to format arguments
- when not to call tools
- how to use observations
- how to stop after success
Tool traces should include failure cases, not just perfect runs.
Stage 4: safety tuning
Safety tuning covers:
- policy refusals
- privacy boundaries
- self-harm and dangerous content
- prompt injection resilience
- data exfiltration resistance
- high-impact decision escalation
Stage 5: regression evals
Every post-training change can improve one behavior and hurt another.
Use evals for:
- helpfulness
- instruction following
- refusal correctness
- over-refusal
- tool success
- JSON/schema adherence
- hallucination
- jailbreak resistance
Knowledge check
Q1: Why is a base model not enough for a chat assistant?
It predicts text but has not necessarily learned instruction following, safety, tool use, or conversational preferences.
Q2: What is preference optimization trying to teach?
Which answers humans or evaluators prefer for a given prompt.