- Test-time compute is a new scaling axis: instead of a bigger model, spend more compute *while answering* — sample more, search more, or think longer.
- Two families: parallel (best-of-N, self-consistency, verifier search) and sequential (long chain-of-thought that self-corrects and backtracks).
- Snell et al. showed compute-optimal test-time scaling can let a small model beat one 14x larger on hard reasoning — and is 4x more efficient than naive best-of-N.
- It's the engine behind o1, R1 and the 2025 reasoning boom — but it has limits: diminishing returns, verifier overfitting, and even inverse scaling.
What is test-time compute?
For a decade, the way to make a language model smarter was to make it bigger: more parameters, more data, more training compute. Test-time compute (also called inference-time scaling) is a different lever entirely. You keep the model fixed and instead spend more compute at the moment it answers — generating many candidate solutions, searching over reasoning paths, or letting the model think for longer before it commits to a reply.
The intuition is human: asked a hard question, you do better if you can scribble, try a few approaches, check your work, and revise — rather than blurting the first thing that comes to mind. Test-time compute gives a model that same option, trading wall-clock seconds and tokens for accuracy.
Why it matters: a second scaling axis
Pretraining scaling laws (more params + more data + more train FLOPs → lower loss) are hitting practical walls: high-quality text is finite and frontier training runs cost hundreds of millions of dollars. Test-time compute opened a second, orthogonal axis. As OpenAI’s Noam Brown put it when o1 launched, “We’re no longer bottlenecked by pretraining. We can now scale inference compute too.”
The headline empirical result comes from Snell et al. (2024) (Google DeepMind / UC Berkeley): with a compute-optimal strategy that adapts to question difficulty, a small model using extra inference compute can outperform a model 14x larger on hard math, while being more than 4x more efficient than a naive best-of-N baseline.
The two families of methods
Almost every test-time technique falls under a proposer–verifier view: a proposer generates candidate solutions (or steps), and a verifier (a vote, a reward model, or a checker) decides what to keep. The compute can be spent in parallel, sequentially, or both.
Sample independent answers at non-zero temperature, score each with a verifier or reward model, and return the highest-scoring one. Simple, embarrassingly parallel, and a strong baseline — but it needs a good scorer to pick the winner.
Sample reasoning chains and return the most common final answer. No verifier required — the wisdom-of-crowds across independent reasoning paths. On many benchmarks plain majority voting is shockingly competitive with verifier-based best-of-N.
Use a process reward model (PRM) that scores each reasoning step, and search the tree of partial solutions with beam search or lookahead — pruning bad branches early. More compute-efficient than best-of-N on hard problems, but it can overfit the verifier on easy ones. See reward models.
Let the model produce one long chain of thought that self-corrects — proposing an approach, catching its own mistakes, backtracking, and trying again. This is what o1 and R1 learned to do via RL. Compute scales with thinking length, not sample count.
The deepest result in Snell et al. is that the best method depends on difficulty. On easy questions, the model’s first guesses are usually close, so sequential revision and best-of-N win. On hard questions, the model needs to explore more distinct approaches, so parallel search against a verifier pays off. A compute-optimal policy routes each prompt to the right strategy and the right budget — that’s where the 4x efficiency comes from.
Many independent attempts, then aggregate: best-of-N, self-consistency, verifier search. Great coverage of the solution space; cheap to parallelize across GPUs; needs a good selector to convert coverage into a correct final answer.
One long, self-correcting trajectory: extended chain-of-thought with backtracking and revision. Carries context forward so later steps build on earlier ones; powers reasoning models like o1 and R1, but latency grows with thinking length.
Coverage vs selection: why sampling works
Repeated sampling has its own scaling law. Brown et al. (2024) (“Large Language Monkeys”) showed that coverage — the fraction of problems for which at least one of samples is correct — grows roughly log-linearly in , well modeled by an exponentiated power law. On SWE-bench Lite, DeepSeek-Coder went from 15.9% solved with one sample to 56% with 250 samples, beating far larger single-shot frontier models of the time.
But coverage is only half the battle. There are two distinct questions:
In domains with a cheap, perfect verifier — unit tests for code, a numeric checker for math — coverage is accuracy: just keep sampling until something passes. This is exactly the regime where RLVR thrives. In open-ended domains there’s no oracle, so a learned reward model or a vote has to do the selection — and the gap between coverage and realized accuracy is where most of the difficulty lives.
Learning to think longer: o1, R1, and RL
The 2024–25 breakthrough was not a new search algorithm bolted on at inference — it was teaching the model itself to use test-time compute well. OpenAI’s o1 is trained with large-scale reinforcement learning to produce a long private chain of thought: it learns to break problems down, recognize and fix its own errors, and switch strategies when stuck. OpenAI reported that accuracy rose smoothly with both train-time RL compute and test-time thinking compute — two scaling curves, not one.
DeepSeek-R1 showed the same emerges from RL with verifiable rewards (GRPO on math/code): the model spontaneously learns to generate longer chains, double-check, and backtrack, with response length growing over training. s1 (Muennighoff et al., 2025) demonstrated how cheap the inference-side trick can be — fine-tune on just 1,000 curated reasoning traces, then apply budget forcing: when the model tries to stop, append the token “Wait” to force it to keep thinking. That alone recovered extra accuracy and let s1-32B exceed o1-preview on competition math.
Go deeper: budget forcing and controlling the thinking budget
Budget forcing is the simplest possible test-time controller. To cap compute, you append an end-of-thinking delimiter and force the model to answer. To extend compute, you suppress that delimiter and inject “Wait” (or similar), which prompts the model to re-examine its reasoning — often catching an arithmetic slip or a wrong assumption on the second pass. The s1 paper reports a clean monotonic trade-off: more forced thinking, higher accuracy, up to a point of diminishing returns. It’s a vivid demonstration that the capability to use extra compute can be partly elicited by prompting, not just trained in. See RL for reasoning.
This is why test-time compute and RL are now inseparable: RL is how a model learns a policy over its own reasoning — when to explore, when to verify, when to stop. The reasoning trace is the trajectory; the correct final answer is the reward.
A compute-optimal view
Once test-time compute is a real axis, the natural question is allocation. Given a total budget, how much should go to pretraining (a bigger model) versus inference (thinking harder)? Snell et al. frame it directly:
where is the test-time strategy (method and sample count ) chosen per question , subject to a compute cap . Their finding: for easy and medium problems within a model’s reach, shifting budget from parameters to inference wins. For the hardest problems — beyond the base model’s capability — extra thinking can’t conjure knowledge that isn’t there, and a bigger or better-trained model is the only fix. Test-time compute amplifies capability; it doesn’t create it from nothing.
Where it breaks: diminishing and inverse returns
Beyond inverse scaling, the practical limits are: diminishing returns (each doubling of samples buys less), verifier overfitting (beam search can be worse than best-of-N on easy problems by exploiting PRM quirks — Goodhart again, see reward models), latency and cost (hours of thinking is unacceptable for interactive use), and the capability ceiling (no amount of search fixes missing knowledge). Knowing when not to spend is as important as scaling up.
How the methods compare
| Method | Compute spent | Needs a verifier? | Best for | Main weakness |
|---|---|---|---|---|
| Self-consistency | parallel ( samples) | no (majority vote) | math/QA with a single answer | no signal beyond agreement |
| Best-of-N | parallel ( samples) | yes (reward model) | open-ended, has a scorer | only as good as the scorer |
| PRM + search | parallel (tree) | yes (process RM) | hard multi-step reasoning | overfits verifier on easy items |
| Long CoT (o1/R1) | sequential (tokens) | learned via RL | frontier reasoning, agents | latency; inverse scaling |
| Budget forcing (s1) | sequential (tokens) | no | cheap control of thinking | crude, model-dependent |
For checkable domains, pair sampling with a programmatic verifier (RLVR); for taste/safety, selection falls back to a reward model trained with RLHF. Most production reasoning stacks combine both an RL-trained long-CoT policy and parallel sampling with a verifier on top.
A short history
Frequently asked questions
Is test-time compute the same as chain-of-thought?
Long chain-of-thought is one way to spend test-time compute (the sequential family), but not the only one. Best-of-N sampling, self-consistency voting, and verifier-guided tree search all add inference compute without a single long reasoning trace. Reasoning models like o1 combine RL-trained long CoT with sampling on top.
Does more thinking always improve the answer?
No. Returns diminish, and on some tasks they reverse — Anthropic’s inverse scaling work shows longer reasoning can lower accuracy by amplifying distraction or overfitting the problem framing. There’s usually a knee where most of the gain is captured; spending past it wastes compute and can hurt.
How is this related to reinforcement learning?
RL is how models learn to use test-time compute well. Training with RLVR/GRPO on checkable problems teaches a model to generate longer, self-correcting chains — exploring, verifying, and backtracking. The reasoning trace is the trajectory and the correct answer is the reward, so train-time RL and test-time thinking scale together. See RL for reasoning.
Can test-time compute replace bigger models?
Partly. Compute-optimal scaling can let a small model beat one up to ~14x larger on problems within its reach — but it can’t supply knowledge or skills the base model lacks. On the hardest tasks, beyond the model’s capability, a better-trained model is still required. The two axes are complementary, not interchangeable.
Key papers
- Self-Consistency Improves Chain-of-Thought Reasoning — Wang et al., 2022 — majority voting over sampled reasoning paths.
- Large Language Monkeys — Brown et al., 2024 — inference-time scaling laws via repeated sampling.
- Scaling LLM Test-Time Compute Optimally… — Snell et al., 2024 — compute-optimal allocation; small model beats one 14x larger.
- Learning to Reason with LLMs — OpenAI, 2024 — o1 and the dual train/test scaling curves.
- s1: Simple Test-Time Scaling — Muennighoff et al., 2025 — budget forcing on 1K traces.
- Inverse Scaling in Test-Time Compute — Anthropic, 2025 — when more thinking hurts.
Related
RL for reasoning · RLVR · GRPO · Reward models · Agentic RL · PPO · What is reinforcement learning?