Fast Answers vs. Right Answers
Standard LLMs are System 1 thinkers — they generate tokens one at a time, predicting the most likely next word. This is fast, but it means:
- Complex math often has errors
- Multi-step logic can break mid-chain
- The model can't "go back" and fix a mistake
Reasoning models add System 2 thinking — a deliberate, step-by-step reasoning process before the final answer.
Standard LLM: Question → [Generate] → Answer
Reasoning Model: Question → [Think... think... think...] → Answer
(hidden reasoning trace)
This is similar to how humans solve problems: some questions we answer instinctively (2+2=?), others require careful thought (what's 17×23?).
Chain-of-Thought: The Foundation
The key insight behind reasoning models is Chain-of-Thought (CoT) prompting — making the model show its work.
Zero-Shot CoT
Simply adding "Let's think step by step" to a prompt dramatically improves performance on reasoning tasks:
Prompt: "If a shirt costs $25 after a 20% discount, what was the original price? Let's think step by step."
Model: Step 1: Let x be the original price.
Step 2: After 20% discount: x - 0.2x = 0.8x = $25
Step 3: x = $25 / 0.8 = $31.25
Answer: $31.25
From Prompting → Training
Reasoning models take this further by training the model to always reason and learning when to reason more deeply:
| Approach | How It Works | Example |
|---|---|---|
| CoT Prompting | Ask model to show work in prompt | "Let's think step by step" |
| CoT Fine-Tuning | Train on reasoning traces | Models trained on math proofs |
| Reinforcement Learning | Reward correct reasoning paths | RLHF on reasoning chains |
| Test-Time Compute | Model decides how long to think | o1/o3 scaling compute at inference |
Reasoning Models in 2026
OpenAI o1 / o3
Key Innovation: Test-time compute scaling — the model uses more computation during inference for harder problems.
| Feature | o1 | o3 |
|---|---|---|
| Release | Late 2024 | 2025-2026 |
| Reasoning | Strong CoT | Enhanced + tool use |
| Best For | Math, science, code | Complex multi-step reasoning |
| Speed | Slower than GPT-4o | Variable (adapts to difficulty) |
| Cost | Higher per query | Higher per query |
How it works: When given a hard problem, o1/o3 generates a hidden "reasoning trace" — potentially thousands of tokens of internal deliberation — before producing the final answer. The harder the problem, the more it thinks.
DeepSeek R1
Key Innovation: Open-source reasoning model trained with pure reinforcement learning (no supervised CoT data).
- Achieves competitive reasoning performance at a fraction of the cost
- Open-source — you can run it locally or fine-tune it
- Demonstrated that reasoning can emerge from RL alone, without human CoT examples
Other Notable Models
| Model | Lab | Approach |
|---|---|---|
| Claude with extended thinking | Anthropic | Visible reasoning traces |
| Gemini 2.5 Flash Thinking | Fast reasoning with toggles | |
| Qwen QwQ | Alibaba | Open-source reasoning model |
When to Use Reasoning Models
Use Reasoning Models When:
- Math and science problems that require multi-step calculation
- Code generation for complex algorithms or debugging
- Logic puzzles and constraint satisfaction
- Research synthesis requiring connecting multiple facts
- Legal/medical analysis where accuracy > speed
Don't Use Reasoning Models When:
- Simple lookups or Q&A — waste of compute and money
- Real-time applications — the thinking latency is too high
- Creative writing — reasoning doesn't help much with prose
- Simple classification — a standard model is faster and cheaper
Practical pattern: Route to reasoning models only when needed. Use a fast model (GPT-4o-mini, Claude Haiku) for 90% of queries, and escalate to a reasoning model only when the fast model's confidence is low.
Tradeoffs: Accuracy vs. Latency vs. Cost
| Dimension | Standard LLM | Reasoning Model |
|---|---|---|
| Accuracy | Good for easy tasks | Excellent for hard tasks |
| Latency | 1-5 seconds | 10-120+ seconds |
| Cost | $0.01-0.10/query | $0.10-2.00/query |
| Best Use | Breadth (many queries) | Depth (hard problems) |
Try It: Compare System 1 vs System 2
Try this problem with both a standard model and a reasoning model:
Problem: "A farmer has 100 meters of fencing. What is the largest rectangular area they can enclose, and what are the dimensions?"
- Ask a standard model (e.g., GPT-4o-mini, Claude Haiku) — note any errors
- Ask a reasoning model (e.g., o1, DeepSeek R1) — compare the reasoning trace
- Verify: The answer should be 625 m² with dimensions 25m × 25m (a square)
Reflection: Where did the standard model go wrong? What did the reasoning model do differently?
Key Takeaways
- Reasoning models add System 2 thinking — deliberate step-by-step reasoning before answering
- Chain-of-thought is the foundation, but reasoning models go further with RL and test-time compute
- Key models: OpenAI o1/o3 (test-time compute), DeepSeek R1 (open-source RL reasoning)
- Tradeoff: reasoning models are more accurate but slower and more expensive
- Best practice: route intelligently — use fast models by default, reasoning models for hard problems