Back
advanced
Research Paper Roadmap

Research Paper Roadmap: From NLP Foundations to Modern LLMs

A chronological reading path of important NLP, transformer, LLM, RAG, agent, alignment, and reasoning papers with short explanations and links

45 min read· Research Papers· LLM History· Transformers· RAG

Research Paper Roadmap: From NLP Foundations to Modern LLMs

This is a map, not a replacement for the papers.

Use this lesson to know what to read and why. Open the papers yourself, read the abstract, skim the figures, then read deeply only when the idea matters for what you are building.

How to read papers without getting lost

For each paper, answer four questions:

  1. What problem was the paper solving?
  2. What was the core idea?
  3. What changed after this paper?
  4. What should I use from it as an engineer?

Do not try to memorize every equation. First build the timeline.

Era 1: Neural language foundations

YearPaperWhy it matters
1986Learning representations by back-propagating errorsMade gradient-trained neural networks practical.
1997Long Short-Term MemoryIntroduced memory gates before transformer context.
2003A Neural Probabilistic Language ModelEarly neural LM with learned word representations.
2013Efficient Estimation of Word Representations in Vector SpacePopularized word2vec embeddings.
2014GloVeUsed global co-occurrence statistics for word vectors.
2014GRU sequence modelingSimplified recurrent gating before transformers.
2014Sequence to Sequence Learning with Neural NetworksShowed neural encoder-decoder translation at scale.
2014Neural Machine Translation by Jointly Learning to Align and TranslateBrought attention into sequence modeling.

Era 2: Attention and transformers

YearPaperWhy it matters
2017Attention Is All You NeedIntroduced the Transformer architecture.
2018Deep contextualized word representationsELMo showed contextual embeddings beat static ones.
2018Improving Language Understanding by Generative Pre-TrainingGPT showed decoder pretraining plus adaptation.
2018BERT: Pre-training of Deep Bidirectional TransformersMade masked-language-model pretraining dominant for understanding tasks.
2019Language Models are Unsupervised Multitask LearnersGPT-2 showed scale creates zero-shot behavior.
2019RoBERTaShowed better training recipes can beat architecture changes.
2019Transformer-XLAdded recurrence for longer-range transformer context.
2019Megatron-LMMade large transformer training across GPUs practical.
2019T5: Exploring the Limits of Transfer LearningUnified NLP tasks as text-to-text.

Era 3: Scaling laws, retrieval, and large models

YearPaperWhy it matters
2020Scaling Laws for Neural Language ModelsQuantified how loss scales with compute/data/model size.
2020Retrieval-Augmented GenerationConnected generation to retrieved external knowledge.
2020Dense Passage RetrievalMade dense vector retrieval central for open-domain QA.
2020Language Models are Few-Shot LearnersGPT-3 made in-context learning famous.
2021WebGPTCombined browsing, citations, and human feedback.
2021Switch TransformersScaled sparse mixture-of-experts training.
2021RoFormer: Rotary Position EmbeddingRoPE became common for long-context transformer models.
2021LoRAMade parameter-efficient fine-tuning practical.
2021FLANShowed instruction tuning improves zero-shot generalization.
2022RETROShowed retrieval can reduce parameter needs.
2022Training Compute-Optimal Large Language ModelsChinchilla showed data/model balance matters.
2022PaLMDemonstrated strong scale and reasoning behavior.

Era 4: Instruction following, reasoning, and alignment

YearPaperWhy it matters
2022Chain-of-Thought PromptingShowed intermediate reasoning helps hard tasks.
2022Self-Consistency Improves Chain of ThoughtImproved reasoning by sampling multiple paths.
2022Training language models to follow instructions with human feedbackInstructGPT explained RLHF assistant training.
2022Constitutional AIUsed written principles and AI feedback for safety tuning.
2023Direct Preference OptimizationSimplified preference tuning without a separate reward model.
2023QLoRAEnabled memory-efficient fine-tuning of large models.
2023OrcaExplored learning from rich explanation traces.
2023Llama 2Documented open chat models and safety tuning.
2024RewardBenchBenchmarked reward models for alignment quality.
2024DeepSeekMathAdvanced mathematical reasoning with GRPO-style optimization.
2025DeepSeek-R1Made reasoning-focused post-training and open reasoning models central.

Era 5: Efficient attention, long context, and serving

YearPaperWhy it matters
2021ALiBiSimple positional bias for length extrapolation.
2022FlashAttentionMade attention faster and more memory efficient.
2023FlashAttention-2Improved attention parallelism and throughput.
2023YaRNExtended context length with RoPE interpolation.
2023PagedAttention / vLLMImproved serving memory management and throughput.
2023Mistral 7BPopularized sliding-window and GQA efficiency in small models.
2023MambaRevived state-space alternatives to attention.
2024LongRoPEExtended context windows dramatically.
2024Mamba-2Connected structured state spaces and attention mathematically.
2024Ring AttentionDistributed long-context attention across devices.

Era 6: Agents, tools, and workflows

YearPaper or specWhy it matters
2022ReActCombined reasoning traces with tool actions.
2023ToolformerTaught models to decide API/tool use.
2023Generative AgentsSimulated believable agents with memory and reflection.
2023ReflexionUsed verbal feedback for agent self-improvement.
2023VoyagerShowed lifelong skill learning in an agent environment.
2023Tree of ThoughtsExplored search over reasoning paths.
2024The AI ScientistAutomated parts of scientific ideation and experimentation.
2024Model Context ProtocolStandardized tool/data connections for AI apps.
2025Agent2Agent ProtocolStandardized agent-to-agent task handoff patterns.
2025Survey of AI Agent ProtocolsCompared emerging agent interoperability protocols.

Era 7: Multimodal, open models, and modern RAG

YearPaperWhy it matters
2021CLIPConnected image and text representations at scale.
2022FlamingoCombined frozen language models with visual inputs.
2023LLaMAPopularized strong open-weight foundation models.
2023LLaVAMade visual instruction tuning accessible.
2023Mixtral of ExpertsBrought high-quality sparse MoE into open-weight use.
2023Self-RAGTrained models to retrieve and critique their own evidence.
2024GraphRAGAdded graph structure for multi-hop retrieval over corpora.
2024ColPaliImproved document retrieval with visual document embeddings.
2024Llama 3Documented a modern open foundation-model recipe.
2024DeepSeek-V3Advanced efficient MoE training at frontier scale.

What to read first

If you only have time for ten:

  1. Attention Is All You Need
  2. BERT
  3. GPT-3
  4. RAG
  5. Scaling Laws
  6. Chinchilla
  7. InstructGPT
  8. LoRA
  9. ReAct
  10. DeepSeek-R1

How this connects to LearnLLM

  • Architecture lessons explain Transformer, RoPE, FlashAttention, MoE, and long context.
  • RAG lessons explain retrieval, chunking, GraphRAG, and evaluation.
  • Agent lessons explain ReAct, Toolformer-style tool use, MCP, A2A, memory, and control loops.
  • Fine-tuning lessons explain LoRA, QLoRA, DPO, RLHF, synthetic data, and post-training.

Knowledge check

Q1: Which paper introduced the Transformer?

Attention Is All You Need.

Q2: Which papers should you read for production RAG basics?

RAG, Dense Passage Retrieval, GraphRAG, and the retrieval/evaluation papers linked from those.

Q3: What should you do after reading this roadmap?

Open the papers yourself, read abstracts first, then go deeper only for the ideas you need.