Distillation and Synthetic Data

Distillation transfers behavior from a stronger teacher model into a smaller or cheaper student system.

Synthetic data uses models to create examples for training, evaluation, or red-teaming.

Why this matters

Frontier models are powerful but can be expensive for high-volume narrow tasks. A smaller model trained on excellent examples may be:

faster
cheaper
easier to deploy privately
more consistent for a narrow format
good enough for routine cases

Distillation workflow

text

collect real task examples
  -> ask teacher model for ideal outputs
  -> filter and deduplicate
  -> add hard negatives and edge cases
  -> fine-tune student model
  -> evaluate against held-out data
  -> route only suitable traffic to student

Synthetic data quality checks

remove duplicates
verify labels
balance classes
include refusals/no-answer cases
include adversarial examples
keep a human-reviewed holdout set
avoid training on test examples

Good uses

Use	Example
Format imitation	convert support tickets into a standard JSON shape
Domain tone	write in a company's support style
Tool calling	learn when to call which function
Edge cases	generate rare failure scenarios for evals
Cost reduction	route easy cases to a smaller model

Bad uses

fabricating facts that should come from a database
replacing expert review for high-stakes labels
training on private data without permission
mixing synthetic train and test data
assuming the teacher is always correct

Knowledge check

Q1: What is the biggest risk of synthetic data?
It can amplify teacher mistakes or create unrealistic examples if not filtered.

Q2: When is distillation useful?
When a narrower, cheaper student model can pass evals for a repeated task.