Test-Time Compute & Inference Scaling

Key takeaways

Test-time compute is a new scaling axis: instead of a bigger model, spend more compute *while answering* — sample more, search more, or think longer.
Two families: parallel (best-of-N, self-consistency, verifier search) and sequential (long chain-of-thought that self-corrects and backtracks).
Snell et al. showed compute-optimal test-time scaling can let a small model beat one 14x larger on hard reasoning — and is 4x more efficient than naive best-of-N.
It's the engine behind o1, R1 and the 2025 reasoning boom — but it has limits: diminishing returns, verifier overfitting, and even inverse scaling.

What is test-time compute?

For a decade, the way to make a language model smarter was to make it bigger: more parameters, more data, more training compute. Test-time compute (also called inference-time scaling) is a different lever entirely. You keep the model fixed and instead spend more compute at the moment it answers — generating many candidate solutions, searching over reasoning paths, or letting the model think for longer before it commits to a reply.

The intuition is human: asked a hard question, you do better if you can scribble, try a few approaches, check your work, and revise — rather than blurting the first thing that comes to mind. Test-time compute gives a model that same option, trading wall-clock seconds and tokens for accuracy.

Two ways to spend more compute at inference. Parallel scaling samples many independent answers and picks one (best-of-N, majority vote, verifier search). Sequential scaling produces one long, self-correcting chain of thought.

Why it matters: a second scaling axis

Pretraining scaling laws (more params + more data + more train FLOPs → lower loss) are hitting practical walls: high-quality text is finite and frontier training runs cost hundreds of millions of dollars. Test-time compute opened a second, orthogonal axis. As OpenAI’s Noam Brown put it when o1 launched, “We’re no longer bottlenecked by pretraining. We can now scale inference compute too.”

The headline empirical result comes from Snell et al. (2024) (Google DeepMind / UC Berkeley): with a compute-optimal strategy that adapts to question difficulty, a small model using extra inference compute can outperform a model 14x larger on hard math, while being more than 4x more efficient than a naive best-of-N baseline.

14x

Larger model a small one can beat with compute-optimal test-time scaling

Efficiency gain over naive best-of-N (Snell et al.)

15.9% → 56%

SWE-bench Lite solved, 1 → 250 samples (Large Language Monkeys)

▶ Scaling LLM Test-Time Compute Optimally — Yannic Kilcher walks through the Snell et al. paper

The two families of methods

Almost every test-time technique falls under a proposer–verifier view: a proposer generates candidate solutions (or steps), and a verifier (a vote, a reward model, or a checker) decides what to keep. The compute can be spent in parallel, sequentially, or both.

Best-of-N sampling

Sample $N$ independent answers at non-zero temperature, score each with a verifier or reward model, and return the highest-scoring one. Simple, embarrassingly parallel, and a strong baseline — but it needs a good scorer to pick the winner.

Self-consistency (majority vote)

Sample $N$ reasoning chains and return the most common final answer. No verifier required — the wisdom-of-crowds across independent reasoning paths. On many benchmarks plain majority voting is shockingly competitive with verifier-based best-of-N.

Verifier-guided search

Use a process reward model (PRM) that scores each reasoning step, and search the tree of partial solutions with beam search or lookahead — pruning bad branches early. More compute-efficient than best-of-N on hard problems, but it can overfit the verifier on easy ones. See reward models.

Sequential revision (long CoT)

Let the model produce one long chain of thought that self-corrects — proposing an approach, catching its own mistakes, backtracking, and trying again. This is what o1 and R1 learned to do via RL. Compute scales with thinking length, not sample count.

The deepest result in Snell et al. is that the best method depends on difficulty. On easy questions, the model’s first guesses are usually close, so sequential revision and best-of-N win. On hard questions, the model needs to explore more distinct approaches, so parallel search against a verifier pays off. A compute-optimal policy routes each prompt to the right strategy and the right budget — that’s where the 4x efficiency comes from.

Parallel scaling

Many independent attempts, then aggregate: best-of-N, self-consistency, verifier search. Great coverage of the solution space; cheap to parallelize across GPUs; needs a good selector to convert coverage into a correct final answer.

Sequential scaling

One long, self-correcting trajectory: extended chain-of-thought with backtracking and revision. Carries context forward so later steps build on earlier ones; powers reasoning models like o1 and R1, but latency grows with thinking length.

Coverage vs selection: why sampling works

Repeated sampling has its own scaling law. Brown et al. (2024) (“Large Language Monkeys”) showed that coverage — the fraction of problems for which at least one of $N$ samples is correct — grows roughly log-linearly in $N$ , well modeled by an exponentiated power law. On SWE-bench Lite, DeepSeek-Coder went from 15.9% solved with one sample to 56% with 250 samples, beating far larger single-shot frontier models of the time.

But coverage is only half the battle. There are two distinct questions:

\text{coverage}@N = \Pr\big[\exists\, i \le N : y_i \text{ is correct}\big] \qquad \text{accuracy} = \Pr\big[\text{selected } \hat{y} \text{ is correct}\big]

In domains with a cheap, perfect verifier — unit tests for code, a numeric checker for math — coverage is accuracy: just keep sampling until something passes. This is exactly the regime where RLVR thrives. In open-ended domains there’s no oracle, so a learned reward model or a vote has to do the selection — and the gap between coverage and realized accuracy is where most of the difficulty lives.

Learning to think longer: o1, R1, and RL

The 2024–25 breakthrough was not a new search algorithm bolted on at inference — it was teaching the model itself to use test-time compute well. OpenAI’s o1 is trained with large-scale reinforcement learning to produce a long private chain of thought: it learns to break problems down, recognize and fix its own errors, and switch strategies when stuck. OpenAI reported that accuracy rose smoothly with both train-time RL compute and test-time thinking compute — two scaling curves, not one.

DeepSeek-R1 showed the same emerges from RL with verifiable rewards (GRPO on math/code): the model spontaneously learns to generate longer chains, double-check, and backtrack, with response length growing over training. s1 (Muennighoff et al., 2025) demonstrated how cheap the inference-side trick can be — fine-tune on just 1,000 curated reasoning traces, then apply budget forcing: when the model tries to stop, append the token “Wait” to force it to keep thinking. That alone recovered extra accuracy and let s1-32B exceed o1-preview on competition math.

Go deeper: budget forcing and controlling the thinking budget

Budget forcing is the simplest possible test-time controller. To cap compute, you append an end-of-thinking delimiter and force the model to answer. To extend compute, you suppress that delimiter and inject “Wait” (or similar), which prompts the model to re-examine its reasoning — often catching an arithmetic slip or a wrong assumption on the second pass. The s1 paper reports a clean monotonic trade-off: more forced thinking, higher accuracy, up to a point of diminishing returns. It’s a vivid demonstration that the capability to use extra compute can be partly elicited by prompting, not just trained in. See RL for reasoning.

This is why test-time compute and RL are now inseparable: RL is how a model learns a policy over its own reasoning — when to explore, when to verify, when to stop. The reasoning trace is the trajectory; the correct final answer is the reward.

View Noam Brown's post on X →

A compute-optimal view

Once test-time compute is a real axis, the natural question is allocation. Given a total budget, how much should go to pretraining (a bigger model) versus inference (thinking harder)? Snell et al. frame it directly:

\theta^{*}(q) \;=\; \arg\max_{\theta}\; \mathbb{E}_{y \sim \pi_\theta(\cdot \mid q,\, N)}\big[\, \mathbb{1}[\,y = y^{*}(q)\,] \,\big] \quad\text{s.t.}\quad \text{FLOPs} \le C

where $\theta$ is the test-time strategy (method and sample count $N$ ) chosen per question $q$ , subject to a compute cap $C$ . Their finding: for easy and medium problems within a model’s reach, shifting budget from parameters to inference wins. For the hardest problems — beyond the base model’s capability — extra thinking can’t conjure knowledge that isn’t there, and a bigger or better-trained model is the only fix. Test-time compute amplifies capability; it doesn’t create it from nothing.

Accuracy versus log test-time compute. Returns are steep at first, then flatten — and on some tasks they bend back down (inverse scaling). The right budget sits near the knee, not at the maximum.

Where it breaks: diminishing and inverse returns

Beyond inverse scaling, the practical limits are: diminishing returns (each doubling of samples buys less), verifier overfitting (beam search can be worse than best-of-N on easy problems by exploiting PRM quirks — Goodhart again, see reward models), latency and cost (hours of thinking is unacceptable for interactive use), and the capability ceiling (no amount of search fixes missing knowledge). Knowing when not to spend is as important as scaling up.

How the methods compare

Method	Compute spent	Needs a verifier?	Best for	Main weakness
Self-consistency	parallel ( $N$ samples)	no (majority vote)	math/QA with a single answer	no signal beyond agreement
Best-of-N	parallel ( $N$ samples)	yes (reward model)	open-ended, has a scorer	only as good as the scorer
PRM + search	parallel (tree)	yes (process RM)	hard multi-step reasoning	overfits verifier on easy items
Long CoT (o1/R1)	sequential (tokens)	learned via RL	frontier reasoning, agents	latency; inverse scaling
Budget forcing (s1)	sequential (tokens)	no	cheap control of thinking	crude, model-dependent

For checkable domains, pair sampling with a programmatic verifier (RLVR); for taste/safety, selection falls back to a reward model trained with RLHF. Most production reasoning stacks combine both an RL-trained long-CoT policy and parallel sampling with a verifier on top.

A short history

2022

Chain-of-thought & self-consistency

Wei et al. show prompting for step-by-step reasoning helps; Wang et al. add majority voting over samples — the first cheap inference-time scaling.

2023

Verifiers and tree search

Process reward models and tree-of-thought style search formalize spending compute on structured exploration of reasoning steps.

2024

Scaling laws for inference

Brown et al. (“Large Language Monkeys”) and Snell et al. establish inference-time scaling laws and compute-optimal allocation.

2024

o1 — learning to think

OpenAI ships a model RL-trained to use a long private chain of thought; accuracy scales with both train- and test-time compute.

2025

R1 & s1 — open & cheap

DeepSeek-R1 reproduces long-CoT via RL with verifiable rewards; s1 shows budget forcing on 1K traces rivals o1-preview.

2025

The limits map out

Anthropic documents inverse scaling; large studies (30B+ tokens) chart where extra compute helps and where it doesn’t.

Frequently asked questions

Is test-time compute the same as chain-of-thought?

Long chain-of-thought is one way to spend test-time compute (the sequential family), but not the only one. Best-of-N sampling, self-consistency voting, and verifier-guided tree search all add inference compute without a single long reasoning trace. Reasoning models like o1 combine RL-trained long CoT with sampling on top.

Does more thinking always improve the answer?

No. Returns diminish, and on some tasks they reverse — Anthropic’s inverse scaling work shows longer reasoning can lower accuracy by amplifying distraction or overfitting the problem framing. There’s usually a knee where most of the gain is captured; spending past it wastes compute and can hurt.

How is this related to reinforcement learning?

RL is how models learn to use test-time compute well. Training with RLVR/GRPO on checkable problems teaches a model to generate longer, self-correcting chains — exploring, verifying, and backtracking. The reasoning trace is the trajectory and the correct answer is the reward, so train-time RL and test-time thinking scale together. See RL for reasoning.

Can test-time compute replace bigger models?

Partly. Compute-optimal scaling can let a small model beat one up to ~14x larger on problems within its reach — but it can’t supply knowledge or skills the base model lacks. On the hardest tasks, beyond the model’s capability, a better-trained model is still required. The two axes are complementary, not interchangeable.

Key papers

Self-Consistency Improves Chain-of-Thought Reasoning — Wang et al., 2022 — majority voting over sampled reasoning paths.
Large Language Monkeys — Brown et al., 2024 — inference-time scaling laws via repeated sampling.
Scaling LLM Test-Time Compute Optimally… — Snell et al., 2024 — compute-optimal allocation; small model beats one 14x larger.
Learning to Reason with LLMs — OpenAI, 2024 — o1 and the dual train/test scaling curves.
s1: Simple Test-Time Scaling — Muennighoff et al., 2025 — budget forcing on 1K traces.
Inverse Scaling in Test-Time Compute — Anthropic, 2025 — when more thinking hurts.

RL for reasoning · RLVR · GRPO · Reward models · Agentic RL · PPO · What is reinforcement learning?

Test-Time Compute & Inference-Time Scaling

What is test-time compute?

Why it matters: a second scaling axis

The two families of methods

Coverage vs selection: why sampling works

Learning to think longer: o1, R1, and RL

A compute-optimal view

Where it breaks: diminishing and inverse returns

How the methods compare

A short history

Frequently asked questions

Key papers

Related