Research Paper Roadmap: From NLP Foundations to Modern LLMs

This is a map, not a replacement for the papers.

Use this lesson to know what to read and why. Open the papers yourself, read the abstract, skim the figures, then read deeply only when the idea matters for what you are building.

How to read papers without getting lost

For each paper, answer four questions:

What problem was the paper solving?
What was the core idea?
What changed after this paper?
What should I use from it as an engineer?

Do not try to memorize every equation. First build the timeline.

Era 1: Neural language foundations

Year	Paper	Why it matters
1986	Learning representations by back-propagating errors	Made gradient-trained neural networks practical.
1997	Long Short-Term Memory	Introduced memory gates before transformer context.
2003	A Neural Probabilistic Language Model	Early neural LM with learned word representations.
2013	Efficient Estimation of Word Representations in Vector Space	Popularized word2vec embeddings.
2014	GloVe	Used global co-occurrence statistics for word vectors.
2014	GRU sequence modeling	Simplified recurrent gating before transformers.
2014	Sequence to Sequence Learning with Neural Networks	Showed neural encoder-decoder translation at scale.
2014	Neural Machine Translation by Jointly Learning to Align and Translate	Brought attention into sequence modeling.

Era 2: Attention and transformers

Year	Paper	Why it matters
2017	Attention Is All You Need	Introduced the Transformer architecture.
2018	Deep contextualized word representations	ELMo showed contextual embeddings beat static ones.
2018	Improving Language Understanding by Generative Pre-Training	GPT showed decoder pretraining plus adaptation.
2018	BERT: Pre-training of Deep Bidirectional Transformers	Made masked-language-model pretraining dominant for understanding tasks.
2019	Language Models are Unsupervised Multitask Learners	GPT-2 showed scale creates zero-shot behavior.
2019	RoBERTa	Showed better training recipes can beat architecture changes.
2019	Transformer-XL	Added recurrence for longer-range transformer context.
2019	Megatron-LM	Made large transformer training across GPUs practical.
2019	T5: Exploring the Limits of Transfer Learning	Unified NLP tasks as text-to-text.

Era 3: Scaling laws, retrieval, and large models

Year	Paper	Why it matters
2020	Scaling Laws for Neural Language Models	Quantified how loss scales with compute/data/model size.
2020	Retrieval-Augmented Generation	Connected generation to retrieved external knowledge.
2020	Dense Passage Retrieval	Made dense vector retrieval central for open-domain QA.
2020	Language Models are Few-Shot Learners	GPT-3 made in-context learning famous.
2021	WebGPT	Combined browsing, citations, and human feedback.
2021	Switch Transformers	Scaled sparse mixture-of-experts training.
2021	RoFormer: Rotary Position Embedding	RoPE became common for long-context transformer models.
2021	LoRA	Made parameter-efficient fine-tuning practical.
2021	FLAN	Showed instruction tuning improves zero-shot generalization.
2022	RETRO	Showed retrieval can reduce parameter needs.
2022	Training Compute-Optimal Large Language Models	Chinchilla showed data/model balance matters.
2022	PaLM	Demonstrated strong scale and reasoning behavior.

Era 4: Instruction following, reasoning, and alignment

Year	Paper	Why it matters
2022	Chain-of-Thought Prompting	Showed intermediate reasoning helps hard tasks.
2022	Self-Consistency Improves Chain of Thought	Improved reasoning by sampling multiple paths.
2022	Training language models to follow instructions with human feedback	InstructGPT explained RLHF assistant training.
2022	Constitutional AI	Used written principles and AI feedback for safety tuning.
2023	Direct Preference Optimization	Simplified preference tuning without a separate reward model.
2023	QLoRA	Enabled memory-efficient fine-tuning of large models.
2023	Orca	Explored learning from rich explanation traces.
2023	Llama 2	Documented open chat models and safety tuning.
2024	RewardBench	Benchmarked reward models for alignment quality.
2024	DeepSeekMath	Advanced mathematical reasoning with GRPO-style optimization.
2025	DeepSeek-R1	Made reasoning-focused post-training and open reasoning models central.

Era 5: Efficient attention, long context, and serving

Year	Paper	Why it matters
2021	ALiBi	Simple positional bias for length extrapolation.
2022	FlashAttention	Made attention faster and more memory efficient.
2023	FlashAttention-2	Improved attention parallelism and throughput.
2023	YaRN	Extended context length with RoPE interpolation.
2023	PagedAttention / vLLM	Improved serving memory management and throughput.
2023	Mistral 7B	Popularized sliding-window and GQA efficiency in small models.
2023	Mamba	Revived state-space alternatives to attention.
2024	LongRoPE	Extended context windows dramatically.
2024	Mamba-2	Connected structured state spaces and attention mathematically.
2024	Ring Attention	Distributed long-context attention across devices.

Era 6: Agents, tools, and workflows

Year	Paper or spec	Why it matters
2022	ReAct	Combined reasoning traces with tool actions.
2023	Toolformer	Taught models to decide API/tool use.
2023	Generative Agents	Simulated believable agents with memory and reflection.
2023	Reflexion	Used verbal feedback for agent self-improvement.
2023	Voyager	Showed lifelong skill learning in an agent environment.
2023	Tree of Thoughts	Explored search over reasoning paths.
2024	The AI Scientist	Automated parts of scientific ideation and experimentation.
2024	Model Context Protocol	Standardized tool/data connections for AI apps.
2025	Agent2Agent Protocol	Standardized agent-to-agent task handoff patterns.
2025	Survey of AI Agent Protocols	Compared emerging agent interoperability protocols.

Era 7: Multimodal, open models, and modern RAG

Year	Paper	Why it matters
2021	CLIP	Connected image and text representations at scale.
2022	Flamingo	Combined frozen language models with visual inputs.
2023	LLaMA	Popularized strong open-weight foundation models.
2023	LLaVA	Made visual instruction tuning accessible.
2023	Mixtral of Experts	Brought high-quality sparse MoE into open-weight use.
2023	Self-RAG	Trained models to retrieve and critique their own evidence.
2024	GraphRAG	Added graph structure for multi-hop retrieval over corpora.
2024	ColPali	Improved document retrieval with visual document embeddings.
2024	Llama 3	Documented a modern open foundation-model recipe.
2024	DeepSeek-V3	Advanced efficient MoE training at frontier scale.

What to read first

If you only have time for ten:

Attention Is All You Need
BERT
GPT-3
RAG
Scaling Laws
Chinchilla
InstructGPT
LoRA
ReAct
DeepSeek-R1

How this connects to LearnLLM

Architecture lessons explain Transformer, RoPE, FlashAttention, MoE, and long context.
RAG lessons explain retrieval, chunking, GraphRAG, and evaluation.
Agent lessons explain ReAct, Toolformer-style tool use, MCP, A2A, memory, and control loops.
Fine-tuning lessons explain LoRA, QLoRA, DPO, RLHF, synthetic data, and post-training.

Knowledge check

Q1: Which paper introduced the Transformer?

Attention Is All You Need.

Q2: Which papers should you read for production RAG basics?

RAG, Dense Passage Retrieval, GraphRAG, and the retrieval/evaluation papers linked from those.

Q3: What should you do after reading this roadmap?

Open the papers yourself, read abstracts first, then go deeper only for the ideas you need.