Distillation and Synthetic Data
Distillation transfers behavior from a stronger teacher model into a smaller or cheaper student system.
Synthetic data uses models to create examples for training, evaluation, or red-teaming.
Why this matters
Frontier models are powerful but can be expensive for high-volume narrow tasks. A smaller model trained on excellent examples may be:
- faster
- cheaper
- easier to deploy privately
- more consistent for a narrow format
- good enough for routine cases
Distillation workflow
text
collect real task examples
-> ask teacher model for ideal outputs
-> filter and deduplicate
-> add hard negatives and edge cases
-> fine-tune student model
-> evaluate against held-out data
-> route only suitable traffic to student
Synthetic data quality checks
- remove duplicates
- verify labels
- balance classes
- include refusals/no-answer cases
- include adversarial examples
- keep a human-reviewed holdout set
- avoid training on test examples
Good uses
| Use | Example |
|---|---|
| Format imitation | convert support tickets into a standard JSON shape |
| Domain tone | write in a company's support style |
| Tool calling | learn when to call which function |
| Edge cases | generate rare failure scenarios for evals |
| Cost reduction | route easy cases to a smaller model |
Bad uses
- fabricating facts that should come from a database
- replacing expert review for high-stakes labels
- training on private data without permission
- mixing synthetic train and test data
- assuming the teacher is always correct
Knowledge check
Q1: What is the biggest risk of synthetic data?
It can amplify teacher mistakes or create unrealistic examples if not filtered.
Q2: When is distillation useful?
When a narrower, cheaper student model can pass evals for a repeated task.