Training Data Mixtures and Quality

Model quality is often data quality at scale.

Two models with the same architecture can behave very differently if their data mixtures differ.

What goes into a data mixture

Modern LLM training data may include:

More of this	Improves	Risk
code	programming	license and vulnerable code issues
math	reasoning	narrow style or synthetic artifacts
multilingual data	global usefulness	uneven quality across languages
web text	broad coverage	spam and misinformation
instruction data	assistant behavior	overfitting to canned style
synthetic data	rare skills and scale	teacher-model mistakes

Duplicate data wastes compute and can cause memorization.

Common dedup strategies:

If training data contains test questions, benchmark scores become misleading.

Teams should:

Quality filters can include:

Synthetic data is powerful when used carefully.

Good uses:

Bad uses:

Q1: Why can public benchmark scores be misleading?

Because examples may leak into training data or teams may over-optimize for benchmark style.

Q2: What does deduplication reduce?

Compute waste, memorization risk, and overrepresentation of repeated sources.