Training Data Mixtures and Quality
Model quality is often data quality at scale.
Two models with the same architecture can behave very differently if their data mixtures differ.
What goes into a data mixture
Modern LLM training data may include:
- web text
- books and articles
- code repositories
- math and theorem data
- scientific papers
- multilingual corpora
- instructions and conversations
- tool-use traces
- multimodal pairs such as image-text or audio-transcript data
- synthetic examples generated by stronger models
Data mixture is a product choice
| More of this | Improves | Risk |
|---|---|---|
| code | programming | license and vulnerable code issues |
| math | reasoning | narrow style or synthetic artifacts |
| multilingual data | global usefulness | uneven quality across languages |
| web text | broad coverage | spam and misinformation |
| instruction data | assistant behavior | overfitting to canned style |
| synthetic data | rare skills and scale | teacher-model mistakes |
Deduplication
Duplicate data wastes compute and can cause memorization.
Common dedup strategies:
- exact hash dedup
- near-duplicate detection
- MinHash or locality-sensitive hashing
- semantic dedup with embeddings
- benchmark contamination scans
Benchmark contamination
If training data contains test questions, benchmark scores become misleading.
Teams should:
- scan for known benchmark examples
- keep time-based holdouts
- use private eval sets
- test with live user-like tasks
- avoid optimizing only public leaderboards
Quality filtering
Quality filters can include:
- language ID
- toxicity filters
- code compile/test filters
- document structure checks
- perplexity or classifier scores
- LLM judge scoring
- source reputation
Synthetic data
Synthetic data is powerful when used carefully.
Good uses:
- generate rare edge cases
- create step-by-step reasoning examples
- build tool-call traces
- expand multilingual coverage
- create eval adversarial examples
Bad uses:
- replacing domain experts for high-stakes labels
- mixing synthetic test examples into training
- trusting teacher answers without filtering
- creating a model that imitates teacher mistakes
Knowledge check
Q1: Why can public benchmark scores be misleading?
Because examples may leak into training data or teams may over-optimize for benchmark style.
Q2: What does deduplication reduce?
Compute waste, memorization risk, and overrepresentation of repeated sources.