Back
advanced
Foundation Model Training

Post-Training and Alignment Pipeline

Understand how base models become helpful assistants through SFT, preference optimization, tool training, safety, and evals

32 min read· post-training· alignment· SFT· RLHF

Post-Training and Alignment Pipeline

Pretraining creates a base model. Post-training turns it into a useful assistant.

The base model problem

A base model may know a lot, but it is not automatically:

  • helpful
  • safe
  • concise
  • tool-aware
  • schema-aware
  • calibrated
  • conversational
  • aligned with user intent

Post-training teaches these behaviors.

Stage 1: supervised fine-tuning

Supervised fine-tuning uses high-quality instruction-response examples.

Examples include:

  • question answering
  • summarization
  • coding help
  • refusal examples
  • tool-call examples
  • JSON output examples
  • multi-turn conversations

Quality beats quantity. Bad examples teach bad habits.

Stage 2: preference optimization

Preference data says which response is better.

Common methods include:

MethodIdea
RLHFtrain reward model, then optimize policy
DPOdirectly optimize chosen vs rejected responses
RLAIFuse AI feedback for preference labels
constitutional AIcritique and revise against written principles

Stage 3: tool and agent training

Modern assistants need to learn:

  • when to call a tool
  • which tool to call
  • how to format arguments
  • when not to call tools
  • how to use observations
  • how to stop after success

Tool traces should include failure cases, not just perfect runs.

Stage 4: safety tuning

Safety tuning covers:

  • policy refusals
  • privacy boundaries
  • self-harm and dangerous content
  • prompt injection resilience
  • data exfiltration resistance
  • high-impact decision escalation

Stage 5: regression evals

Every post-training change can improve one behavior and hurt another.

Use evals for:

  • helpfulness
  • instruction following
  • refusal correctness
  • over-refusal
  • tool success
  • JSON/schema adherence
  • hallucination
  • jailbreak resistance

Knowledge check

Q1: Why is a base model not enough for a chat assistant?

It predicts text but has not necessarily learned instruction following, safety, tool use, or conversational preferences.

Q2: What is preference optimization trying to teach?

Which answers humans or evaluators prefer for a given prompt.