Post-Training and Alignment Pipeline

Pretraining creates a base model. Post-training turns it into a useful assistant.

The base model problem

A base model may know a lot, but it is not automatically:

Post-training teaches these behaviors.

Supervised fine-tuning uses high-quality instruction-response examples.

Examples include:

Quality beats quantity. Bad examples teach bad habits.

Preference data says which response is better.

Common methods include:

Method	Idea
RLHF	train reward model, then optimize policy
DPO	directly optimize chosen vs rejected responses
RLAIF	use AI feedback for preference labels
constitutional AI	critique and revise against written principles

Modern assistants need to learn:

Tool traces should include failure cases, not just perfect runs.

Safety tuning covers:

Every post-training change can improve one behavior and hurt another.

Use evals for:

Q1: Why is a base model not enough for a chat assistant?

It predicts text but has not necessarily learned instruction following, safety, tool use, or conversational preferences.

Q2: What is preference optimization trying to teach?

Which answers humans or evaluators prefer for a given prompt.