Back
intermediate
Modern LLM Architectures

Reasoning Models: When AI Thinks Before It Answers

Explore the new generation of AI models designed for deep reasoning — System 2 thinking, chain-of-thought, and models like o1, o3, and DeepSeek R1 that think before they answer

40 min read· Reasoning· Chain-of-Thought· o1· o3

Fast Answers vs. Right Answers

Standard LLMs are System 1 thinkers — they generate tokens one at a time, predicting the most likely next word. This is fast, but it means:

  • Complex math often has errors
  • Multi-step logic can break mid-chain
  • The model can't "go back" and fix a mistake

Reasoning models add System 2 thinking — a deliberate, step-by-step reasoning process before the final answer.

Standard LLM:        Question → [Generate] → Answer
Reasoning Model:     Question → [Think... think... think...] → Answer
                                 (hidden reasoning trace)

This is similar to how humans solve problems: some questions we answer instinctively (2+2=?), others require careful thought (what's 17×23?).


Chain-of-Thought: The Foundation

The key insight behind reasoning models is Chain-of-Thought (CoT) prompting — making the model show its work.

Zero-Shot CoT

Simply adding "Let's think step by step" to a prompt dramatically improves performance on reasoning tasks:

Prompt:  "If a shirt costs $25 after a 20% discount, what was the original price? Let's think step by step."

Model:   Step 1: Let x be the original price.
         Step 2: After 20% discount: x - 0.2x = 0.8x = $25
         Step 3: x = $25 / 0.8 = $31.25
         Answer: $31.25

From Prompting → Training

Reasoning models take this further by training the model to always reason and learning when to reason more deeply:

ApproachHow It WorksExample
CoT PromptingAsk model to show work in prompt"Let's think step by step"
CoT Fine-TuningTrain on reasoning tracesModels trained on math proofs
Reinforcement LearningReward correct reasoning pathsRLHF on reasoning chains
Test-Time ComputeModel decides how long to thinko1/o3 scaling compute at inference

Reasoning Models in 2026

OpenAI o1 / o3

Key Innovation: Test-time compute scaling — the model uses more computation during inference for harder problems.

Featureo1o3
ReleaseLate 20242025-2026
ReasoningStrong CoTEnhanced + tool use
Best ForMath, science, codeComplex multi-step reasoning
SpeedSlower than GPT-4oVariable (adapts to difficulty)
CostHigher per queryHigher per query

How it works: When given a hard problem, o1/o3 generates a hidden "reasoning trace" — potentially thousands of tokens of internal deliberation — before producing the final answer. The harder the problem, the more it thinks.

DeepSeek R1

Key Innovation: Open-source reasoning model trained with pure reinforcement learning (no supervised CoT data).

  • Achieves competitive reasoning performance at a fraction of the cost
  • Open-source — you can run it locally or fine-tune it
  • Demonstrated that reasoning can emerge from RL alone, without human CoT examples

Other Notable Models

ModelLabApproach
Claude with extended thinkingAnthropicVisible reasoning traces
Gemini 2.5 Flash ThinkingGoogleFast reasoning with toggles
Qwen QwQAlibabaOpen-source reasoning model

When to Use Reasoning Models

Use Reasoning Models When:

  • Math and science problems that require multi-step calculation
  • Code generation for complex algorithms or debugging
  • Logic puzzles and constraint satisfaction
  • Research synthesis requiring connecting multiple facts
  • Legal/medical analysis where accuracy > speed

Don't Use Reasoning Models When:

  • Simple lookups or Q&A — waste of compute and money
  • Real-time applications — the thinking latency is too high
  • Creative writing — reasoning doesn't help much with prose
  • Simple classification — a standard model is faster and cheaper

Practical pattern: Route to reasoning models only when needed. Use a fast model (GPT-4o-mini, Claude Haiku) for 90% of queries, and escalate to a reasoning model only when the fast model's confidence is low.


Tradeoffs: Accuracy vs. Latency vs. Cost

DimensionStandard LLMReasoning Model
AccuracyGood for easy tasksExcellent for hard tasks
Latency1-5 seconds10-120+ seconds
Cost$0.01-0.10/query$0.10-2.00/query
Best UseBreadth (many queries)Depth (hard problems)

Try It: Compare System 1 vs System 2

Try this problem with both a standard model and a reasoning model:

Problem: "A farmer has 100 meters of fencing. What is the largest rectangular area they can enclose, and what are the dimensions?"

  1. Ask a standard model (e.g., GPT-4o-mini, Claude Haiku) — note any errors
  2. Ask a reasoning model (e.g., o1, DeepSeek R1) — compare the reasoning trace
  3. Verify: The answer should be 625 m² with dimensions 25m × 25m (a square)

Reflection: Where did the standard model go wrong? What did the reasoning model do differently?


Key Takeaways

  • Reasoning models add System 2 thinking — deliberate step-by-step reasoning before answering
  • Chain-of-thought is the foundation, but reasoning models go further with RL and test-time compute
  • Key models: OpenAI o1/o3 (test-time compute), DeepSeek R1 (open-source RL reasoning)
  • Tradeoff: reasoning models are more accurate but slower and more expensive
  • Best practice: route intelligently — use fast models by default, reasoning models for hard problems