1. The Reasoning Model Timeline: o1, o1-mini, o3, o4-mini
OpenAI's reasoning model series has moved fast:
| Model | Released | Key Milestone |
|---|---|---|
| o1-preview | September 2024 | First public reasoning model; 83% on AIME 2024 vs GPT-4o's 13% |
| o1 | December 2024 | Full release; PhD-level science eval scores; $15/$60 per 1M tokens |
| o1-mini | September 2024 | Cost-optimized CoT model; 3× cheaper; strong coding performance |
| o3 | January 2025 (announced); April 2025 (API) | 87.5% ARC-AGI; near-human on most reasoning benchmarks; new SOTA scores |
| o3-mini | February 2025 | Efficient reasoning at $1.10/$4.40 per 1M tokens; exceeds o1 on many tasks |
| o4-mini | April 2025 | Combines o3-level reasoning with multimodal support and tool use; best reasoning model per dollar |
The "o" naming stands for "OpenAI reasoning" internally, though OpenAI has not officially confirmed this etymology. The jump from o1 to o3 (skipping o2, which was apparently reserved for UK regulatory reasons where "O2" is a major telecoms brand) represented OpenAI's biggest capability jump since GPT-4.
2. How Reasoning Models Work
Standard LLMs like GPT-4o use a single forward pass to generate each token. The model looks at the input and immediately predicts what comes next based on learned patterns. This is impressive for many tasks but fails on problems requiring systematic search, multi-step planning, or self-verification.
2.1 Reinforcement Learning on Chain-of-Thought
Reasoning models are trained differently. OpenAI uses reinforcement learning with verifiable rewards — a training approach that rewards the model for reaching correct answers on problems where correctness can be objectively verified (math equations, code that passes tests, formal logic proofs).
During training, the model is encouraged to generate a chain-of-thought — a sequence of intermediate reasoning steps — before producing its final answer. Crucially, the intermediate reasoning is not supervised (not trained to match specific "correct" reasoning paths). The model discovers its own reasoning strategies through RL, optimizing purely for getting the right answer.
This produces emergent reasoning behaviors not seen in traditional supervised fine-tuning: backtracking when a path doesn't work, trying alternative approaches, checking answers against constraints, and explicitly expressing uncertainty before arriving at a conclusion.
2.2 The Scratchpad: Hidden Thinking
When you send a message to o3, the model invisibly generates a private "scratchpad" — the chain-of-thought reasoning — before producing the response visible to you. This scratchpad is not shown to the user by design: OpenAI found that showing raw reasoning output could be misleading (the model might "think out loud" in ways that don't reflect its actual internal process). The scratchpad is, however, visible at an aggregate level via the usage.completion_tokens_details.reasoning_tokens field in the API response.
A response that looks like a single paragraph to the user may involve 500–2,000 reasoning tokens of private deliberation — the model genuinely spent time "thinking" before writing the answer.
3. Thinking Tokens and Computation Budget
One of the most important practical aspects of reasoning models is the thinking budget — a parameter that controls how many tokens the model can spend on the hidden reasoning chain before producing its response.
3.1 The effort Parameter
In the OpenAI API, the thinking budget for o3 and o4-mini is controlled via the reasoning_effort parameter:
import openai
client = openai.OpenAI()
response = client.chat.completions.create(
model="o3",
reasoning_effort="high", # "low", "medium", or "high"
messages=[
{
"role": "user",
"content": "Prove that there are infinitely many primes."
}
]
)
print(response.choices[0].message.content)
# Check reasoning tokens used:
print(response.usage.completion_tokens_details.reasoning_tokens)
3.2 Effort Levels
| Effort Level | Typical Reasoning Tokens | Latency | Best For |
|---|---|---|---|
low | 100–500 | 2–5 sec | Simple reasoning, fast responses |
medium | 500–2,000 | 5–15 sec | Most tasks; default balance |
high | 2,000–10,000+ | 15–90 sec | Hard math, complex code, deep analysis |
Important: Reasoning tokens are billed at the same rate as output tokens. A high effort request generating 8,000 reasoning tokens is significantly more expensive than a low effort request. Always match effort level to task complexity.
4. o3 and o4-mini: Specifications
4.1 o3
- Context window: 200,000 tokens input, 100,000 tokens output
- Modalities: Text and images (vision input)
- Tool use: Function calling, code interpreter, file search, web search
- Knowledge cutoff: March 2025
- API availability: Tier 4+ API users, ChatGPT Pro and Plus (with limits)
- Strengths: Highest-accuracy reasoning; best for problems requiring maximum correctness
4.2 o4-mini
- Context window: 200,000 tokens input, 100,000 tokens output
- Modalities: Text and images (vision input)
- Tool use: Full tool support — identical to o3
- Knowledge cutoff: March 2025
- Strengths: Best reasoning performance per dollar; 3× faster than o3; outperforms o1 on most benchmarks despite significantly lower cost
- Primary use case: Coding, math, and reasoning tasks at scale where cost matters
5. Benchmark Performance
Benchmark scores provide the clearest objective comparison of reasoning model capability:
5.1 AIME 2024 (Competitive Math)
AIME (American Invitational Mathematics Examination) consists of 15 challenging math problems accessible in principle to high school students but requiring significant mathematical sophistication. Only the top 5% of all AMC 12 competitors qualify.
| Model | AIME 2024 Score (pass@1) |
|---|---|
| GPT-4o | 13.4% |
| o1-preview | 56.7% |
| o1 | 74.4% |
| o3 (high) | 96.7% |
| o4-mini (high) | 93.3% |
| Human expert baseline | ~60–70% |
5.2 ARC-AGI (Abstract Reasoning)
ARC-AGI (Abstraction and Reasoning Corpus for Artificial General Intelligence), created by François Chollet, tests pattern abstraction abilities that cannot be solved by memorization — each task is unique and requires genuine reasoning from first principles. It was specifically designed to be resistant to LLMs that pattern-match on training data.
| System | ARC-AGI Score | Notes |
|---|---|---|
| GPT-4o | 5% | Near chance on novel patterns |
| o1 | 32% | Significant jump; still far from human |
| o3 (low compute) | 75.7% | Surpassed all prior AI systems |
| o3 (high compute) | 87.5% | Approaches human-level average |
| Average human | 85% | Benchmark human baseline |
The ARC-AGI result was considered a landmark moment in AI — the first system to approach human performance on a benchmark specifically designed to be resistant to AI pattern matching. Chollet's subsequent response noted that the cost of o3's high-compute solution ($20–$60 per task in API costs) meant it was not yet AGI in any practical sense, but the capability threshold had clearly been crossed.
5.3 SWE-bench Verified (Software Engineering)
SWE-bench Verified tests whether models can resolve real-world GitHub issues in open-source Python repositories — reading the issue description, understanding the codebase, and producing a patch that passes the existing test suite.
| Model | SWE-bench Verified |
|---|---|
| GPT-4o | 33.2% |
| Claude 3.5 Sonnet | 49% |
| o1 | 48.9% |
| o3 (with tools) | 71.7% |
| o4-mini (with tools) | 68.1% |
5.4 GPQA Diamond (Graduate-Level Science)
GPQA Diamond contains 198 graduate-level questions in biology, chemistry, and physics — written by subject-matter experts and verified to be difficult even for PhD-level researchers who aren't specialists in the exact sub-field.
| Model | GPQA Diamond |
|---|---|
| GPT-4o | 53.6% |
| o1 | 78.0% |
| o3 | 87.7% |
| PhD-level human expert | ~70% |
6. o3 vs. GPT-4o: When to Use Which
Reasoning models are not universally better — they are better at specific things, at higher cost and latency. Use this decision framework:
| Use Case | Recommended Model | Reason |
|---|---|---|
| Complex math or science problems | o3 (high effort) | Systematic multi-step reasoning is critical |
| Hard algorithm/coding challenges | o4-mini or o3 | Better at debugging complex logic and edge cases |
| Research analysis with nuance | o3 (medium effort) | Weighs evidence more carefully |
| Summarizing documents | GPT-4o | Does not require deep reasoning; faster and cheaper |
| Simple Q&A, chat | GPT-4o-mini | Fastest and cheapest; reasoning overhead is wasted |
| Creative writing | GPT-4o | Reasoning models are less creative; overly structured output |
| Code review at scale | o4-mini | Better than GPT-4o, cheaper than o3, fast enough for pipelines |
| Legal/financial document analysis | o3 | Accuracy is critical; cost justified by stakes |
| Image understanding, OCR | GPT-4o | Standard vision, no reasoning benefit |
| Real-time product features | GPT-4o or o4-mini (low) | User-facing latency constraints |
7. Using Reasoning Models in the API
7.1 Basic Request
from openai import OpenAI
client = OpenAI()
response = client.chat.completions.create(
model="o4-mini",
reasoning_effort="medium",
messages=[
{
"role": "user",
"content": """
Review this Python function and identify any correctness issues,
edge cases, and performance problems:
def merge_sorted_lists(a, b):
result = []
i = j = 0
while i < len(a) and j < len(b):
if a[i] < b[j]:
result.append(a[i])
i += 1
else:
result.append(b[j])
j += 1
return result
"""
}
]
)
print(response.choices[0].message.content)
print(f"\\nReasoning tokens: {response.usage.completion_tokens_details.reasoning_tokens}")
print(f"Output tokens: {response.usage.completion_tokens}")
7.2 With Tool Use
o3 and o4-mini support full function calling — and combine it with reasoning, which dramatically improves tool selection and chaining quality compared to GPT-4o:
response = client.chat.completions.create(
model="o3",
reasoning_effort="high",
tools=[
{
"type": "function",
"function": {
"name": "run_python",
"description": "Execute Python code and return stdout/stderr.",
"parameters": {
"type": "object",
"properties": {
"code": {"type": "string"}
},
"required": ["code"]
}
}
}
],
messages=[
{
"role": "user",
"content": "Find the 10,000th prime number."
}
]
)
7.3 Streaming Reasoning
Streaming is supported. However, reasoning tokens are generated internally first — the streamed output begins only when the model starts producing visible output. This means streaming reduces perceived latency for response streaming but does not bypass the thinking time.
8. Pricing and Cost Management
| Model | Input (per 1M tokens) | Reasoning (per 1M) | Output (per 1M) |
|---|---|---|---|
| o4-mini | $1.10 | $1.10 | $4.40 |
| o3 | $10.00 | $10.00 | $40.00 |
| o3-mini | $1.10 | $1.10 | $4.40 |
| o1 | $15.00 | $15.00 | $60.00 |
| GPT-4o (reference) | $2.50 | N/A | $10.00 |
8.1 Cost Optimization Strategies
- Match effort to task: Use
lowfor straightforward reasoning tasks. Reservehighfor genuinely hard problems where correctness has high value. - Use o4-mini by default: o4-mini at
mediumeffort handles 90% of reasoning tasks at 1/10th the cost of o3. - Cache prompts: OpenAI prompt caching reduces input token cost by 50% for cached prefixes, which is especially valuable for reasoning models with large system prompts.
- Batch API: Use the Batch API (50% discount) for offline reasoning workloads not requiring real-time responses.
9. Real-World Use Cases
9.1 Automated Code Review at Scale
Engineering teams at Stripe, Notion, and Linear have integrated o4-mini into their CI/CD pipelines for automated code review. The reasoning model reviews PRs for logic errors, security vulnerabilities, and API misuse — catching issues that pattern-matching models miss because they require understanding the intent of the code. Reported false positive rates are 40% lower than GPT-4o-based review systems.
9.2 Scientific Research Assistance
Biology and chemistry researchers use o3 to interpret experimental results, generate hypotheses, and synthesize literature. The model's PhD-level performance on GPQA means it can engage with domain-specific content at a level that was previously impossible without a human expert collaborator.
9.3 Financial Modeling and Analysis
Hedge funds and accounting firms use o3 for complex multi-step financial analysis: reading 10-K filings, extracting figures, performing ratio analysis, and flagging inconsistencies across subsidiary disclosures. Tasks that took a junior analyst a full day can be processed in minutes with o3 at high effort.
9.4 Legal Document Analysis
Contract analysis, precedent research, and regulatory compliance checking all benefit from reasoning models' ability to carefully weigh conditions, exceptions, and cross-references between clauses — rather than summarizing the text superficially.
9.5 Math Education
Khan Academy and other edtech platforms use reasoning models to generate step-by-step worked solutions for complex math problems, with intermediate steps that genuinely reflect how a mathematician would approach the problem rather than pattern-matched solutions from training data.
10. Prompting Reasoning Models Effectively
Reasoning models respond differently to prompts than GPT-4o. Key guidelines:
10.1 Be Direct and Specific
Do not pad prompts with excessive instructions about how to think. The model already thinks carefully — you don't need to say "think step by step" or "reason carefully". The model does this internally. Instead, be specific about what you want as output.
10.2 Provide Constraints, Not Instructions
Instead of "carefully check all edge cases", specify the constraints: "Handle empty arrays, negative numbers, and integer overflow. Return -1 for invalid inputs." The model reasons about these constraints automatically; you just need to specify them.
10.3 Avoid Chain-of-Thought Prompts
Do not add "Let's think step by step" to prompts for o-series models. This is designed for GPT-4o (which doesn't reason internally) and can interfere with o-series models' native reasoning process, sometimes reducing performance.
10.4 Use System Prompts Sparingly
Reasoning models use the system prompt as part of their reasoning context. Short, focused system prompts work better than long ones. The model will comply with extensive instructions — it just wastes thinking tokens processing them rather than reasoning about the problem.
10.5 Request Structured Output for Verifiable Tasks
For tasks with verifiable answers (math, code), request structured output: "Return your answer as a JSON object with keys 'solution', 'confidence' (0–1), and 'verification_check'." This gives you machine-readable confidence signals alongside the answer.
11. Limitations and Failure Modes
- High latency: o3 at
higheffort can take 60–90 seconds to respond. This disqualifies it from user-facing tasks requiring sub-3-second responses. - Cost on high effort: A single o3 high-effort request on a complex problem can cost $1–$5. At scale, this is prohibitive without careful effort management.
- Overconfidence in wrong answers: Reasoning models sometimes produce confident, well-structured wrong answers on tasks that are outside their training distribution. The chain-of-thought can contain elegant but incorrect reasoning chains. Always verify outputs for high-stakes decisions.
- Not better for all tasks: Conversational tasks, creative writing, summarization, and translation do not benefit from reasoning and cost significantly more per token if run through o3/o4-mini.
- Context contamination in long sessions: In multi-turn conversations, accumulated context competes with reasoning tokens for the context window budget. Very long conversations may reduce effective reasoning quality on later turns.
- Cannot "think out loud" to user: The hidden scratchpad is not inspectable, making debugging model reasoning impossible. You can see the token count spent but not the actual content.
12. Competing Reasoning Models
The reasoning model category has attracted every major lab:
| Model | Developer | Approach | Notable Strength |
|---|---|---|---|
| Claude 3.7 Sonnet (Extended Thinking) | Anthropic | Visible thinking mode; user can read CoT | Transparent reasoning; strong code |
| Gemini 2.0 Flash Thinking | Google DeepMind | Fast reasoning; shows thinking summary | Speed/cost ratio; multimodal reasoning |
| DeepSeek-R1 | DeepSeek (China) | Open-weight; visible step-by-step reasoning | Free to run locally; strong math |
| Grok 3 (Thinking) | xAI | Integrated reasoning in xAI platform | Real-time web access + reasoning |
| QwQ-32B | Alibaba (Qwen team) | Open-weight 32B reasoning model | Strong performance for model size |
13. Future of Reasoning Models
OpenAI's research trajectory and published work indicate several near-term directions:
- o5 and beyond: OpenAI has confirmed o5 is in development, targeting further gains on all reasoning benchmarks. Based on the o1→o3 jump, another significant performance improvement is expected.
- Real-time reasoning: Reducing thinking latency through speculative decoding and hardware optimization — making reasoning models viable for interactive use cases.
- Transparent reasoning: Anthropic's Claude Extended Thinking shows the chain-of-thought to users. OpenAI has resisted this (citing concerns about manipulability) but may offer optional transparency in research contexts.
- Multimodal reasoning: Deep integration of visual reasoning — not just analyzing images but reasoning through visual problems like diagrams, charts, and spatial puzzles — with the same depth applied to text.
- Agent-integrated reasoning: Reasoning models natively integrated with agentic frameworks — thinking not just about a single response but about multi-step plans, with the reasoning chain spanning across tool calls.
14. Frequently Asked Questions
- Do I need to add "think step by step" when using o3?
- No. Reasoning models do this internally. Explicitly prompting CoT can actually degrade performance on o-series models.
- Can I see what the model is thinking?
- Not directly. You can see the token count used for reasoning via
usage.completion_tokens_details.reasoning_tokensin the API response, but the reasoning content is not exposed. Anthropic's Claude Extended Thinking mode does show reasoning, if that transparency is important to you. - Why isn't o3 used for everything?
- Cost and latency. o3 is 4× more expensive than GPT-4o and 10× slower. For most everyday tasks, this tradeoff is not justified. It shines specifically on hard problems where being right matters more than being fast or cheap.
- Is o4-mini better than o3-mini?
- Yes, generally. o4-mini includes full multimodal support and outperforms o3-mini on most benchmarks at similar pricing, making o3-mini largely redundant for new projects.
- What is ARC-AGI and why does it matter?
- ARC-AGI is a benchmark designed to measure abstract reasoning from first principles — tasks that cannot be solved by memorizing training data. o3's 87.5% score was significant because the benchmark was specifically designed to resist AI systems, making it the first standardized test where AI has approached human average performance through genuine reasoning ability.
15. References & Further Reading
- OpenAI — Introducing o3 and o4-mini
- OpenAI — o1 System Card (2024)
- ARC Prize — o3 on ARC-AGI: Analysis and Implications
- Mialon et al. — GAIA Benchmark (2023)
- Papers With Code — ARC-AGI Leaderboard
- OpenAI Platform — Reasoning Models Guide
- SWE-bench Verified (2024)
Start with o4-mini at medium effort for your hardest coding or analysis tasks — compare the output quality to GPT-4o side by side. The difference on genuinely complex problems is usually immediate and striking, and at o4-mini pricing it costs only marginally more per request.