Back
advanced
Foundation Model Training

Training Data Mixtures and Quality

Understand how data quality, mixture design, deduplication, synthetic data, and contamination shape model behavior

30 min read· training data· data mixtures· deduplication· synthetic data

Training Data Mixtures and Quality

Model quality is often data quality at scale.

Two models with the same architecture can behave very differently if their data mixtures differ.

What goes into a data mixture

Modern LLM training data may include:

  • web text
  • books and articles
  • code repositories
  • math and theorem data
  • scientific papers
  • multilingual corpora
  • instructions and conversations
  • tool-use traces
  • multimodal pairs such as image-text or audio-transcript data
  • synthetic examples generated by stronger models

Data mixture is a product choice

More of thisImprovesRisk
codeprogramminglicense and vulnerable code issues
mathreasoningnarrow style or synthetic artifacts
multilingual dataglobal usefulnessuneven quality across languages
web textbroad coveragespam and misinformation
instruction dataassistant behavioroverfitting to canned style
synthetic datarare skills and scaleteacher-model mistakes

Deduplication

Duplicate data wastes compute and can cause memorization.

Common dedup strategies:

  • exact hash dedup
  • near-duplicate detection
  • MinHash or locality-sensitive hashing
  • semantic dedup with embeddings
  • benchmark contamination scans

Benchmark contamination

If training data contains test questions, benchmark scores become misleading.

Teams should:

  • scan for known benchmark examples
  • keep time-based holdouts
  • use private eval sets
  • test with live user-like tasks
  • avoid optimizing only public leaderboards

Quality filtering

Quality filters can include:

  • language ID
  • toxicity filters
  • code compile/test filters
  • document structure checks
  • perplexity or classifier scores
  • LLM judge scoring
  • source reputation

Synthetic data

Synthetic data is powerful when used carefully.

Good uses:

  • generate rare edge cases
  • create step-by-step reasoning examples
  • build tool-call traces
  • expand multilingual coverage
  • create eval adversarial examples

Bad uses:

  • replacing domain experts for high-stakes labels
  • mixing synthetic test examples into training
  • trusting teacher answers without filtering
  • creating a model that imitates teacher mistakes

Knowledge check

Q1: Why can public benchmark scores be misleading?

Because examples may leak into training data or teams may over-optimize for benchmark style.

Q2: What does deduplication reduce?

Compute waste, memorization risk, and overrepresentation of repeated sources.