reinforcement-learning.com
// RL FOR LLMS & AGENTS

Test-Time Compute & Inference-Time Scaling

What test-time compute is, how inference-time scaling works (best-of-N, search, long CoT), the scaling laws behind o1 and R1, and where it breaks in 2026.

Updated 2026-06-08 15 min read
Key takeaways
  • Test-time compute is a new scaling axis: instead of a bigger model, spend more compute *while answering* — sample more, search more, or think longer.
  • Two families: parallel (best-of-N, self-consistency, verifier search) and sequential (long chain-of-thought that self-corrects and backtracks).
  • Snell et al. showed compute-optimal test-time scaling can let a small model beat one 14x larger on hard reasoning — and is 4x more efficient than naive best-of-N.
  • It's the engine behind o1, R1 and the 2025 reasoning boom — but it has limits: diminishing returns, verifier overfitting, and even inverse scaling.

What is test-time compute?

For a decade, the way to make a language model smarter was to make it bigger: more parameters, more data, more training compute. Test-time compute (also called inference-time scaling) is a different lever entirely. You keep the model fixed and instead spend more compute at the moment it answers — generating many candidate solutions, searching over reasoning paths, or letting the model think for longer before it commits to a reply.

The intuition is human: asked a hard question, you do better if you can scribble, try a few approaches, check your work, and revise — rather than blurting the first thing that comes to mind. Test-time compute gives a model that same option, trading wall-clock seconds and tokens for accuracy.

PromptParallelSequentialsample 1sample 2sample Nvote / verifierpicks bestthink… try approach Await — that’s wrongbacktrack, try Bfinal answer
Two ways to spend more compute at inference. Parallel scaling samples many independent answers and picks one (best-of-N, majority vote, verifier search). Sequential scaling produces one long, self-correcting chain of thought.

Why it matters: a second scaling axis

Pretraining scaling laws (more params + more data + more train FLOPs → lower loss) are hitting practical walls: high-quality text is finite and frontier training runs cost hundreds of millions of dollars. Test-time compute opened a second, orthogonal axis. As OpenAI’s Noam Brown put it when o1 launched, “We’re no longer bottlenecked by pretraining. We can now scale inference compute too.”

The headline empirical result comes from Snell et al. (2024) (Google DeepMind / UC Berkeley): with a compute-optimal strategy that adapts to question difficulty, a small model using extra inference compute can outperform a model 14x larger on hard math, while being more than 4x more efficient than a naive best-of-N baseline.

14x
Larger model a small one can beat with compute-optimal test-time scaling
4x
Efficiency gain over naive best-of-N (Snell et al.)
15.9% → 56%
SWE-bench Lite solved, 1 → 250 samples (Large Language Monkeys)
▶ Scaling LLM Test-Time Compute Optimally — Yannic Kilcher walks through the Snell et al. paper

The two families of methods

Almost every test-time technique falls under a proposer–verifier view: a proposer generates candidate solutions (or steps), and a verifier (a vote, a reward model, or a checker) decides what to keep. The compute can be spent in parallel, sequentially, or both.

1
Best-of-N sampling

Sample NN independent answers at non-zero temperature, score each with a verifier or reward model, and return the highest-scoring one. Simple, embarrassingly parallel, and a strong baseline — but it needs a good scorer to pick the winner.

2
Self-consistency (majority vote)

Sample NN reasoning chains and return the most common final answer. No verifier required — the wisdom-of-crowds across independent reasoning paths. On many benchmarks plain majority voting is shockingly competitive with verifier-based best-of-N.

3
Verifier-guided search

Use a process reward model (PRM) that scores each reasoning step, and search the tree of partial solutions with beam search or lookahead — pruning bad branches early. More compute-efficient than best-of-N on hard problems, but it can overfit the verifier on easy ones. See reward models.

4
Sequential revision (long CoT)

Let the model produce one long chain of thought that self-corrects — proposing an approach, catching its own mistakes, backtracking, and trying again. This is what o1 and R1 learned to do via RL. Compute scales with thinking length, not sample count.

The deepest result in Snell et al. is that the best method depends on difficulty. On easy questions, the model’s first guesses are usually close, so sequential revision and best-of-N win. On hard questions, the model needs to explore more distinct approaches, so parallel search against a verifier pays off. A compute-optimal policy routes each prompt to the right strategy and the right budget — that’s where the 4x efficiency comes from.

Parallel scaling

Many independent attempts, then aggregate: best-of-N, self-consistency, verifier search. Great coverage of the solution space; cheap to parallelize across GPUs; needs a good selector to convert coverage into a correct final answer.

Sequential scaling

One long, self-correcting trajectory: extended chain-of-thought with backtracking and revision. Carries context forward so later steps build on earlier ones; powers reasoning models like o1 and R1, but latency grows with thinking length.

Coverage vs selection: why sampling works

Repeated sampling has its own scaling law. Brown et al. (2024) (“Large Language Monkeys”) showed that coverage — the fraction of problems for which at least one of NN samples is correct — grows roughly log-linearly in NN, well modeled by an exponentiated power law. On SWE-bench Lite, DeepSeek-Coder went from 15.9% solved with one sample to 56% with 250 samples, beating far larger single-shot frontier models of the time.

But coverage is only half the battle. There are two distinct questions:

coverage@N=Pr[iN:yi is correct]accuracy=Pr[selected y^ is correct]\text{coverage}@N = \Pr\big[\exists\, i \le N : y_i \text{ is correct}\big] \qquad \text{accuracy} = \Pr\big[\text{selected } \hat{y} \text{ is correct}\big]

In domains with a cheap, perfect verifier — unit tests for code, a numeric checker for math — coverage is accuracy: just keep sampling until something passes. This is exactly the regime where RLVR thrives. In open-ended domains there’s no oracle, so a learned reward model or a vote has to do the selection — and the gap between coverage and realized accuracy is where most of the difficulty lives.

Learning to think longer: o1, R1, and RL

The 2024–25 breakthrough was not a new search algorithm bolted on at inference — it was teaching the model itself to use test-time compute well. OpenAI’s o1 is trained with large-scale reinforcement learning to produce a long private chain of thought: it learns to break problems down, recognize and fix its own errors, and switch strategies when stuck. OpenAI reported that accuracy rose smoothly with both train-time RL compute and test-time thinking compute — two scaling curves, not one.

DeepSeek-R1 showed the same emerges from RL with verifiable rewards (GRPO on math/code): the model spontaneously learns to generate longer chains, double-check, and backtrack, with response length growing over training. s1 (Muennighoff et al., 2025) demonstrated how cheap the inference-side trick can be — fine-tune on just 1,000 curated reasoning traces, then apply budget forcing: when the model tries to stop, append the token “Wait” to force it to keep thinking. That alone recovered extra accuracy and let s1-32B exceed o1-preview on competition math.

Go deeper: budget forcing and controlling the thinking budget

Budget forcing is the simplest possible test-time controller. To cap compute, you append an end-of-thinking delimiter and force the model to answer. To extend compute, you suppress that delimiter and inject “Wait” (or similar), which prompts the model to re-examine its reasoning — often catching an arithmetic slip or a wrong assumption on the second pass. The s1 paper reports a clean monotonic trade-off: more forced thinking, higher accuracy, up to a point of diminishing returns. It’s a vivid demonstration that the capability to use extra compute can be partly elicited by prompting, not just trained in. See RL for reasoning.

This is why test-time compute and RL are now inseparable: RL is how a model learns a policy over its own reasoning — when to explore, when to verify, when to stop. The reasoning trace is the trajectory; the correct final answer is the reward.

A compute-optimal view

Once test-time compute is a real axis, the natural question is allocation. Given a total budget, how much should go to pretraining (a bigger model) versus inference (thinking harder)? Snell et al. frame it directly:

θ(q)  =  argmaxθ  Eyπθ(q,N)[1[y=y(q)]]s.t.FLOPsC\theta^{*}(q) \;=\; \arg\max_{\theta}\; \mathbb{E}_{y \sim \pi_\theta(\cdot \mid q,\, N)}\big[\, \mathbb{1}[\,y = y^{*}(q)\,] \,\big] \quad\text{s.t.}\quad \text{FLOPs} \le C

where θ\theta is the test-time strategy (method and sample count NN) chosen per question qq, subject to a compute cap CC. Their finding: for easy and medium problems within a model’s reach, shifting budget from parameters to inference wins. For the hardest problems — beyond the base model’s capability — extra thinking can’t conjure knowledge that isn’t there, and a bigger or better-trained model is the only fix. Test-time compute amplifies capability; it doesn’t create it from nothing.

test-time compute (log) →accuracythe knee: most gain, least wasteWell-posed task (diminishing returns)Distractor-prone task (inverse scaling)
Accuracy versus log test-time compute. Returns are steep at first, then flatten — and on some tasks they bend back down (inverse scaling). The right budget sits near the knee, not at the maximum.

Where it breaks: diminishing and inverse returns

Beyond inverse scaling, the practical limits are: diminishing returns (each doubling of samples buys less), verifier overfitting (beam search can be worse than best-of-N on easy problems by exploiting PRM quirks — Goodhart again, see reward models), latency and cost (hours of thinking is unacceptable for interactive use), and the capability ceiling (no amount of search fixes missing knowledge). Knowing when not to spend is as important as scaling up.

How the methods compare

MethodCompute spentNeeds a verifier?Best forMain weakness
Self-consistencyparallel (NN samples)no (majority vote)math/QA with a single answerno signal beyond agreement
Best-of-Nparallel (NN samples)yes (reward model)open-ended, has a scoreronly as good as the scorer
PRM + searchparallel (tree)yes (process RM)hard multi-step reasoningoverfits verifier on easy items
Long CoT (o1/R1)sequential (tokens)learned via RLfrontier reasoning, agentslatency; inverse scaling
Budget forcing (s1)sequential (tokens)nocheap control of thinkingcrude, model-dependent

For checkable domains, pair sampling with a programmatic verifier (RLVR); for taste/safety, selection falls back to a reward model trained with RLHF. Most production reasoning stacks combine both an RL-trained long-CoT policy and parallel sampling with a verifier on top.

A short history

2022
Chain-of-thought & self-consistency
Wei et al. show prompting for step-by-step reasoning helps; Wang et al. add majority voting over samples — the first cheap inference-time scaling.
2023
Verifiers and tree search
Process reward models and tree-of-thought style search formalize spending compute on structured exploration of reasoning steps.
2024
Scaling laws for inference
Brown et al. (“Large Language Monkeys”) and Snell et al. establish inference-time scaling laws and compute-optimal allocation.
2024
o1 — learning to think
OpenAI ships a model RL-trained to use a long private chain of thought; accuracy scales with both train- and test-time compute.
2025
R1 & s1 — open & cheap
DeepSeek-R1 reproduces long-CoT via RL with verifiable rewards; s1 shows budget forcing on 1K traces rivals o1-preview.
2025
The limits map out
Anthropic documents inverse scaling; large studies (30B+ tokens) chart where extra compute helps and where it doesn’t.

Frequently asked questions

Is test-time compute the same as chain-of-thought?

Long chain-of-thought is one way to spend test-time compute (the sequential family), but not the only one. Best-of-N sampling, self-consistency voting, and verifier-guided tree search all add inference compute without a single long reasoning trace. Reasoning models like o1 combine RL-trained long CoT with sampling on top.

Does more thinking always improve the answer?

No. Returns diminish, and on some tasks they reverse — Anthropic’s inverse scaling work shows longer reasoning can lower accuracy by amplifying distraction or overfitting the problem framing. There’s usually a knee where most of the gain is captured; spending past it wastes compute and can hurt.

How is this related to reinforcement learning?

RL is how models learn to use test-time compute well. Training with RLVR/GRPO on checkable problems teaches a model to generate longer, self-correcting chains — exploring, verifying, and backtracking. The reasoning trace is the trajectory and the correct answer is the reward, so train-time RL and test-time thinking scale together. See RL for reasoning.

Can test-time compute replace bigger models?

Partly. Compute-optimal scaling can let a small model beat one up to ~14x larger on problems within its reach — but it can’t supply knowledge or skills the base model lacks. On the hardest tasks, beyond the model’s capability, a better-trained model is still required. The two axes are complementary, not interchangeable.

Key papers

RL for reasoning · RLVR · GRPO · Reward models · Agentic RL · PPO · What is reinforcement learning?