Research Paper Roadmap: From NLP Foundations to Modern LLMs
This is a map, not a replacement for the papers.
Use this lesson to know what to read and why. Open the papers yourself, read the abstract, skim the figures, then read deeply only when the idea matters for what you are building.
How to read papers without getting lost
For each paper, answer four questions:
- What problem was the paper solving?
- What was the core idea?
- What changed after this paper?
- What should I use from it as an engineer?
Do not try to memorize every equation. First build the timeline.
Era 1: Neural language foundations
| Year | Paper | Why it matters |
|---|---|---|
| 1986 | Learning representations by back-propagating errors | Made gradient-trained neural networks practical. |
| 1997 | Long Short-Term Memory | Introduced memory gates before transformer context. |
| 2003 | A Neural Probabilistic Language Model | Early neural LM with learned word representations. |
| 2013 | Efficient Estimation of Word Representations in Vector Space | Popularized word2vec embeddings. |
| 2014 | GloVe | Used global co-occurrence statistics for word vectors. |
| 2014 | GRU sequence modeling | Simplified recurrent gating before transformers. |
| 2014 | Sequence to Sequence Learning with Neural Networks | Showed neural encoder-decoder translation at scale. |
| 2014 | Neural Machine Translation by Jointly Learning to Align and Translate | Brought attention into sequence modeling. |
Era 2: Attention and transformers
| Year | Paper | Why it matters |
|---|---|---|
| 2017 | Attention Is All You Need | Introduced the Transformer architecture. |
| 2018 | Deep contextualized word representations | ELMo showed contextual embeddings beat static ones. |
| 2018 | Improving Language Understanding by Generative Pre-Training | GPT showed decoder pretraining plus adaptation. |
| 2018 | BERT: Pre-training of Deep Bidirectional Transformers | Made masked-language-model pretraining dominant for understanding tasks. |
| 2019 | Language Models are Unsupervised Multitask Learners | GPT-2 showed scale creates zero-shot behavior. |
| 2019 | RoBERTa | Showed better training recipes can beat architecture changes. |
| 2019 | Transformer-XL | Added recurrence for longer-range transformer context. |
| 2019 | Megatron-LM | Made large transformer training across GPUs practical. |
| 2019 | T5: Exploring the Limits of Transfer Learning | Unified NLP tasks as text-to-text. |
Era 3: Scaling laws, retrieval, and large models
| Year | Paper | Why it matters |
|---|---|---|
| 2020 | Scaling Laws for Neural Language Models | Quantified how loss scales with compute/data/model size. |
| 2020 | Retrieval-Augmented Generation | Connected generation to retrieved external knowledge. |
| 2020 | Dense Passage Retrieval | Made dense vector retrieval central for open-domain QA. |
| 2020 | Language Models are Few-Shot Learners | GPT-3 made in-context learning famous. |
| 2021 | WebGPT | Combined browsing, citations, and human feedback. |
| 2021 | Switch Transformers | Scaled sparse mixture-of-experts training. |
| 2021 | RoFormer: Rotary Position Embedding | RoPE became common for long-context transformer models. |
| 2021 | LoRA | Made parameter-efficient fine-tuning practical. |
| 2021 | FLAN | Showed instruction tuning improves zero-shot generalization. |
| 2022 | RETRO | Showed retrieval can reduce parameter needs. |
| 2022 | Training Compute-Optimal Large Language Models | Chinchilla showed data/model balance matters. |
| 2022 | PaLM | Demonstrated strong scale and reasoning behavior. |
Era 4: Instruction following, reasoning, and alignment
| Year | Paper | Why it matters |
|---|---|---|
| 2022 | Chain-of-Thought Prompting | Showed intermediate reasoning helps hard tasks. |
| 2022 | Self-Consistency Improves Chain of Thought | Improved reasoning by sampling multiple paths. |
| 2022 | Training language models to follow instructions with human feedback | InstructGPT explained RLHF assistant training. |
| 2022 | Constitutional AI | Used written principles and AI feedback for safety tuning. |
| 2023 | Direct Preference Optimization | Simplified preference tuning without a separate reward model. |
| 2023 | QLoRA | Enabled memory-efficient fine-tuning of large models. |
| 2023 | Orca | Explored learning from rich explanation traces. |
| 2023 | Llama 2 | Documented open chat models and safety tuning. |
| 2024 | RewardBench | Benchmarked reward models for alignment quality. |
| 2024 | DeepSeekMath | Advanced mathematical reasoning with GRPO-style optimization. |
| 2025 | DeepSeek-R1 | Made reasoning-focused post-training and open reasoning models central. |
Era 5: Efficient attention, long context, and serving
| Year | Paper | Why it matters |
|---|---|---|
| 2021 | ALiBi | Simple positional bias for length extrapolation. |
| 2022 | FlashAttention | Made attention faster and more memory efficient. |
| 2023 | FlashAttention-2 | Improved attention parallelism and throughput. |
| 2023 | YaRN | Extended context length with RoPE interpolation. |
| 2023 | PagedAttention / vLLM | Improved serving memory management and throughput. |
| 2023 | Mistral 7B | Popularized sliding-window and GQA efficiency in small models. |
| 2023 | Mamba | Revived state-space alternatives to attention. |
| 2024 | LongRoPE | Extended context windows dramatically. |
| 2024 | Mamba-2 | Connected structured state spaces and attention mathematically. |
| 2024 | Ring Attention | Distributed long-context attention across devices. |
Era 6: Agents, tools, and workflows
| Year | Paper or spec | Why it matters |
|---|---|---|
| 2022 | ReAct | Combined reasoning traces with tool actions. |
| 2023 | Toolformer | Taught models to decide API/tool use. |
| 2023 | Generative Agents | Simulated believable agents with memory and reflection. |
| 2023 | Reflexion | Used verbal feedback for agent self-improvement. |
| 2023 | Voyager | Showed lifelong skill learning in an agent environment. |
| 2023 | Tree of Thoughts | Explored search over reasoning paths. |
| 2024 | The AI Scientist | Automated parts of scientific ideation and experimentation. |
| 2024 | Model Context Protocol | Standardized tool/data connections for AI apps. |
| 2025 | Agent2Agent Protocol | Standardized agent-to-agent task handoff patterns. |
| 2025 | Survey of AI Agent Protocols | Compared emerging agent interoperability protocols. |
Era 7: Multimodal, open models, and modern RAG
| Year | Paper | Why it matters |
|---|---|---|
| 2021 | CLIP | Connected image and text representations at scale. |
| 2022 | Flamingo | Combined frozen language models with visual inputs. |
| 2023 | LLaMA | Popularized strong open-weight foundation models. |
| 2023 | LLaVA | Made visual instruction tuning accessible. |
| 2023 | Mixtral of Experts | Brought high-quality sparse MoE into open-weight use. |
| 2023 | Self-RAG | Trained models to retrieve and critique their own evidence. |
| 2024 | GraphRAG | Added graph structure for multi-hop retrieval over corpora. |
| 2024 | ColPali | Improved document retrieval with visual document embeddings. |
| 2024 | Llama 3 | Documented a modern open foundation-model recipe. |
| 2024 | DeepSeek-V3 | Advanced efficient MoE training at frontier scale. |
What to read first
If you only have time for ten:
- Attention Is All You Need
- BERT
- GPT-3
- RAG
- Scaling Laws
- Chinchilla
- InstructGPT
- LoRA
- ReAct
- DeepSeek-R1
How this connects to LearnLLM
- Architecture lessons explain Transformer, RoPE, FlashAttention, MoE, and long context.
- RAG lessons explain retrieval, chunking, GraphRAG, and evaluation.
- Agent lessons explain ReAct, Toolformer-style tool use, MCP, A2A, memory, and control loops.
- Fine-tuning lessons explain LoRA, QLoRA, DPO, RLHF, synthetic data, and post-training.
Knowledge check
Q1: Which paper introduced the Transformer?
Attention Is All You Need.
Q2: Which papers should you read for production RAG basics?
RAG, Dense Passage Retrieval, GraphRAG, and the retrieval/evaluation papers linked from those.
Q3: What should you do after reading this roadmap?
Open the papers yourself, read abstracts first, then go deeper only for the ideas you need.