reinforcement-learning.com
// RL FOR LLMS & AGENTS

RL for Reasoning (o1 / R1-style)

How reasoning models like o1 and DeepSeek-R1 are trained with RL: chain-of-thought, verifiable rewards (RLVR), GRPO, test-time compute, and the 2026 debates.

Updated 2026-06-07 18 min read
Key takeaways
  • Reasoning models (o1, o3, DeepSeek-R1) are trained with RL to write a long chain-of-thought before answering — and rewarded mainly when the final answer is verifiably correct.
  • The dominant recipe is RLVR (RL with Verifiable Rewards): a deterministic checker (math answer, passing unit tests) supplies the reward, so it's hard to game and needs no human-labeled reasoning traces.
  • DeepSeek-R1-Zero showed reasoning can emerge from pure RL on a base model — self-checking, backtracking, longer 'thinking' — using GRPO, a critic-free policy-gradient method.
  • It opened a second scaling axis (test-time compute), but 2025–26 work hotly debates whether RL adds new reasoning or just sharpens what the base model already knew (pass@1 up, pass@k down).

What is RL for reasoning?

RL for reasoning is the training technique behind “reasoning models” like OpenAI’s o1/o3 and DeepSeek-R1. Instead of only imitating human-written answers, the model is optimized with reinforcement learning to first generate a long internal chain-of-thought (CoT) — a <think>…</think> scratchpad where it plans, tries approaches, checks itself, and backtracks — and is then rewarded when its final answer is verifiably correct: a math result that checks out, code that passes its tests.

The crucial move is what supplies the reward. In classic RLHF a learned reward model scores style and helpfulness. For reasoning, the reward usually comes from a deterministic verifier — this is Reinforcement Learning with Verifiable Rewards (RLVR). Because the checker is right by construction, the reward is hard to hack, and the model is free to discover whatever reasoning process gets to the correct answer. Nobody hand-writes the chain-of-thought; it emerges from the optimization.

Problem xPolicy LM(π_θ)<think> … </think>long CoT→ answerVerifiertests / checkerreward r ∈ {0,1} → policy-gradient update (GRPO / PPO)sample a group of G traces per problem
RLVR for reasoning: the policy samples a long chain-of-thought ending in an answer; a deterministic verifier checks the answer and emits a 0/1 reward; the policy is updated to make rewarded traces more likely (GRPO compares a group of samples to a shared baseline).

Reasoning models vs ordinary LLMs

An ordinary instruction-tuned LLM is optimized to produce a good-looking answer in one pass. A reasoning model (sometimes called a Large Reasoning Model, LRM) is optimized to spend compute thinking before it commits — and that thinking is what RL shapes.

Standard instruction-tuned LLMReasoning model (o1 / R1-style)
Post-training signalRLHF/DPO on human preferencesRLVR on verifiable correctness
OutputDirect answerLong internal CoT, then answer
What scales qualityBigger model / more SFTMore RL and more thinking at inference
StrengthTone, helpfulness, breadthMath, code, multi-step logic
Reward sourceLearned reward modelDeterministic verifier
12% → 74%
GPT-4o vs o1 on AIME 2024 (single sample)
~671B
DeepSeek-V3/R1 total params (≈37B active, MoE)
2 axes
train-time RL + test-time compute both scale accuracy

Why RL and not just more supervised fine-tuning?

You can teach reasoning by SFT on human-written solutions — but it caps out. Demonstrations teach the model to imitate one path a human happened to write; they can’t teach it to explore, notice it’s wrong, and recover, because there’s no signal for “this attempt failed, try another.” RL provides exactly that signal: sample many attempts, reward the ones that land, and let the model find its own strategies — including ones no human wrote down.

The other reason is data. High-quality step-by-step reasoning traces are scarce and expensive. A verifier sidesteps the bottleneck: for math and code you often already have the ground-truth answer or a test suite, so you can generate unlimited training signal automatically. This is why reasoning RL scaled so fast — the reward is essentially free and essentially uncheatable.

Go deeper: the link to classic RL fundamentals

If you come from the RL side, RLVR is a clean instance of policy-gradient learning. The trajectory is the token sequence; the action is each token; the reward is a single sparse terminal scalar from the verifier. The policy gradient pushes up the log-probability of tokens in high-reward trajectories. The only twist versus Atari-era RL is the baseline used to reduce variance: instead of a learned value function (PPO’s critic), GRPO uses the mean reward of a group of samples for the same prompt. A KL penalty to a reference model keeps the policy from drifting into gibberish — the same safety belt RLHF uses.

The core idea: reward the answer, let the reasoning emerge

The recipe is almost suspiciously simple. Give the model a problem with a known answer. Ask it to think inside <think>…</think> and then give a final answer. Sample several attempts. Run a verifier. Reward the attempts that got it right. Update. Repeat.

1
Sample a group of attempts

For each problem xx, sample a group of GG full chain-of-thought traces {y1,,yG}\{y_1,\dots,y_G\} from the current policy πθ\pi_\theta. Each is an independent attempt — some explore, some go straight, some fail.

2
Score with a verifier

A deterministic checker assigns each trace a reward rir_i — typically 11 if the final answer matches ground truth (or all unit tests pass), else 00, often plus small format rewards (did it use the think/answer tags? right language?).

3
Compute a group-relative advantage

Instead of a learned critic, normalize within the group:

Ai=rimean(r1,,rG)std(r1,,rG)A_i = \frac{r_i - \operatorname{mean}(r_1,\dots,r_G)}{\operatorname{std}(r_1,\dots,r_G)}

A trace beats the baseline if it’s better than its siblings on the same problem.

4
Update the policy

Increase the probability of tokens in above-average traces, decrease it for below-average ones (a clipped policy-gradient step, with a KL penalty to a reference model). Over many iterations the model learns to write the kind of reasoning that tends to be correct.

What’s striking is what this doesn’t contain: no human reasoning labels, no step-by-step supervision, no learned reward model. The verifier is the whole teacher. Longer thinking, self-verification, and backtracking are not programmed — they are simply behaviors that raise the hit rate, so RL amplifies them.

RLVR: why verifiable rewards resist hacking

Reward hacking is the original sin of RLHF: optimize a learned proxy hard enough and the policy finds quirks the reward model loves but humans don’t (Goodhart’s law). RLVR swaps the learnable proxy for a ground-truth oracle. You can’t sweet-talk a unit test; either it passes or it doesn’t.

RLHF — learn the reward

A reward model trained on human preferences scores outputs. Best where “good” is taste — tone, helpfulness, safety — but gameable, because the proxy is itself a model. See reward models.

RLVR — verify the reward

A programmatic checker (answer-matcher, unit tests, a theorem prover) emits the reward. Best for math, code, and checkable logic — hard to hack, no labels needed. See RLVR.

The catch is coverage: RLVR only works where you can write a verifier. Math and code are easy; “write a moving essay” is not. That’s why frontier models use both — RLVR to sharpen reasoning, then a round of preference-based RL to keep the result helpful and safe. The term RLVR was named and popularized by AI2’s Tülu 3 open post-training report.

Case study 1 — OpenAI o1: learning to reason

In September 2024 OpenAI shipped o1, the first widely deployed reasoning model, framed explicitly as learning to reason. Their description: “Our large-scale reinforcement learning algorithm teaches the model how to think productively using its chain of thought.” The model learns to recognize and correct mistakes, break hard steps into simpler ones, and try a different approach when one isn’t working.

o1’s headline contribution was a new scaling axis: performance improved both with more train-time RL and with more test-time compute (thinking longer). That’s a departure from the pretraining-scaling story — you can now buy accuracy at inference time by letting the model think. OpenAI hid the raw CoT from users (for safety and competitive reasons), which kept the exact recipe closed. See the o1 System Card and OpenAI’s account of competitive programming with large reasoning models.

Case study 2 — DeepSeek-R1: the open recipe and the “aha moment”

DeepSeek’s R1 paper (January 2025; peer-reviewed version in Nature) did for the open world what o1 did for the closed one — and showed the machinery in full.

R1-Zero: reasoning from pure RL

The boldest experiment was R1-Zero: take the DeepSeek-V3 base model, apply no SFT at all, and run RLVR directly with GRPO on math/code with answer-checking rewards. Reasoning emerged — the model spontaneously learned to think longer, verify its work, and backtrack. The paper highlights an “aha moment” where the model learns to stop, re-evaluate its approach, and allocate more thinking time, in language like “wait, let me reconsider.” Nobody trained that behavior in; it raised the reward, so it appeared.

R1-Zero had real warts — language mixing (switching between English and Chinese mid-thought) and messy, hard-to-read CoT. So the production R1 wrapped the pure-RL core in a four-stage pipeline.

The R1 multi-stage recipe

1
Cold-start SFT

Fine-tune the base model on a few thousand curated long-CoT examples to stabilize the early RL phase and fix readability — a gentle on-ramp, not the main teacher.

2
Reasoning-oriented RL

Run large-scale RLVR (GRPO) on math/code, now with an added language-consistency reward to suppress the language-mixing seen in R1-Zero.

3
Rejection sampling + SFT

Once RL converges, sample the checkpoint and keep only correct responses (~600k reasoning samples) plus ~200k general samples, then SFT — distilling the RL-honed behavior back into a clean, broad model.

4
Final RL alignment

A last RL pass over all prompts to align helpfulness and harmlessness — a preference-based stage so the reasoning model is also a usable assistant.

DeepSeek also distilled R1’s reasoning into small dense models (1.5B–70B Qwen/Llama), showing that a strong teacher’s traces transfer remarkably well — often beating same-size models trained with RL directly.

Under the hood: the RL algorithms

PPO and its critic-model cost

PPO is the workhorse of RLHF, but for reasoning it’s heavy: it trains a separate value network (critic) the same size as the policy to estimate advantages, roughly doubling memory and adding instability. For long-CoT RL — where a single trajectory can be tens of thousands of tokens — that overhead bites.

GRPO: drop the critic

GRPO (Group Relative Policy Optimization), introduced in DeepSeekMath and used to train R1, removes the critic entirely. As shown above, it samples a group of answers per prompt and uses the group’s mean reward as the baseline — the advantage is just how much each sample beat its siblings. Cheaper, simpler, and well-suited to verifiable rewards where you can afford many rollouts per prompt.

PPOPolicyCritic(value net)advantage = r − V(s)GRPOPolicyG sampled answersadvantage = rᵢ − mean(r)no critic model
PPO learns a value network to baseline each token; GRPO replaces it with the mean reward of a group of sampled answers to the same prompt — no critic, no extra model.

DAPO, Kimi k1.5, and the open tricks

GRPO is simple but finicky; a wave of 2025 work stabilized it. DAPO (ByteDance) is a fully open recipe with four tricks — Clip-Higher (decouple the clip bounds so promising rare tokens aren’t over-penalized), Dynamic Sampling (drop prompt-groups where every sample is right or every sample is wrong, since they give zero gradient), Token-Level Policy-Gradient Loss (average over tokens, not sequences, so long CoTs aren’t down-weighted), and Overlong Reward Shaping (gently penalize runaway length). DAPO reported beating R1-Zero-Qwen-32B on AIME 2024 with half the training steps.

Kimi k1.5 (Moonshot) is a parallel long-context (128K) RL recipe that deliberately avoids MCTS, value functions, and process reward models — partial rollouts for efficiency, betting that scale plus a clean objective beats elaborate search.

MethodBaseline / advantageExtra modelNotes
PPOlearned value networkcriticStable, controllable, expensive
GRPOgroup mean rewardnoneCritic-free; trained R1
DAPOGRPO + 4 stabilizersnoneOpen, faster convergence at scale
DPOclosed-form preference lossnoneOffline; for preference, not verifiable RL
RAFT / rejection samplingkeep best-of-N, then SFTnoneSimplest “poor man’s RL”

Reward design: outcome vs process

Two ways to reward reasoning:

  • Outcome Reward Model (ORM) — score only the final answer. Simple, and for RLVR the “ORM” is literally the verifier. The downside: a correct answer can come from flawed reasoning (lucky guesses), and the signal is sparse.
  • Process Reward Model (PRM) — score each step. OpenAI’s Let’s Verify Step by Step showed step-level supervision beats outcome-only on MATH. But PRMs are expensive (need step labels) and themselves hackable — so the R1/Kimi camp largely skipped them in favor of pure outcome rewards plus light format and language-consistency rewards. See reward models.

Test-time compute: the second scaling axis

The deepest shift from o1/R1 is that inference is now a place you scale. A reasoning model gets better the longer you let it think — and you can spend that budget several ways:

Longer chains

Let the model emit more reasoning tokens before answering. RL teaches it to use the budget — to verify and backtrack rather than ramble.

Majority voting / self-consistency

Sample many independent answers and take the most common (or, with a verifier, the one that checks out). Cheap, embarrassingly parallel, and surprisingly strong.

Search & re-ranking

Explore a tree of partial solutions and score them with a verifier or PRM — best-of-N re-ranking pushed o1’s AIME from ~74% to ~93%.

This complements training-time scaling. See RL for reasoning’s sibling page on RLVR and the broader RL for reasoning vs reasoning-via-search discussion.

The big debate: new reasoning, or sharpened base model?

Here’s where 2025–26 got genuinely contentious. The uncritical story is “RL teaches models to reason.” The data complicate it.

The counterpoints are just as sharp:

  • It does add reasoning, measured right. RLVR Implicitly Incentivizes Correct Reasoning in Base LLMs argues that when you score intermediate steps rather than just final answers, RLVR genuinely improves the reasoning, not just the hit rate.
  • Spurious rewards “work” — on Qwen. The unsettling Spurious Rewards paper showed that random or even incorrect rewards still boosted Qwen2.5-Math-7B on MATH-500 by ~21% (vs ~29% for true rewards) — but the same spurious rewards failed on Llama and OLMo. The likely mechanism: RL surfaces a latent “think in code” behavior already baked into Qwen by pretraining (possibly including test-set contamination). The lesson: many flashy RLVR results are model-family-dependent, and “RL improved my score” can mean “RL un-hid something pretraining already put there.”

So the honest 2026 summary is: RLVR reliably makes models more reliable at problems within reach of the base model, and it’s the best tool we have for eliciting test-time-scaling behavior — but claims that it creates new reasoning from nothing should be read with the pass@k caveat and the model-family caveat firmly in mind.

Failure modes (and the fixes)

Failure modeWhat happensCommon fix
Entropy collapsePolicy becomes over-confident, stops exploring, output diversity diesClip-Higher (DAPO), entropy bonus, KL control
Reward hackingGaming weak verifiers / format rewards instead of reasoningHarder verifiers, hidden tests, length shaping
Language mixingCoT switches languages mid-thought (seen in R1-Zero)Language-consistency reward
OverthinkingEndless CoT that wastes tokens without improving accuracyOverlong reward shaping, length penalties
pass@k regressionHigher pass@1 but worse coverage at large kKeep some exploration; don’t over-train
Go deeper: why entropy collapse is the central tension

RL on verifiable rewards is a giant exploitation engine — it relentlessly up-weights paths that already work. Left unchecked, the policy’s per-token entropy crashes: it commits to one strategy and stops sampling alternatives. That raises pass@1 in the short run but kills the diversity that pass@k measures and that long-horizon improvement depends on. Most stabilization tricks — Clip-Higher, dynamic sampling that drops zero-gradient groups, entropy regularization, careful KL — are really exploration-preservation tricks. This is the same explore/exploit dilemma at the heart of all of reinforcement learning, now playing out over token sequences.

How to train your own reasoning model

You need three ingredients: a base model with latent capability, a verifier for your domain, and a framework to run rollouts at scale.

1
Pick a capable base + data with checkable answers

Start from a strong base/instruct model and a dataset where answers are verifiable — GSM8K/MATH/AIME for math, problems with unit tests for code. Quality of the verifier matters more than dataset size.

2
Define the reward

Usually: correctness (1/0 from the checker) + small format rewards (think/answer tags) + optional length and language rewards. Keep it simple; complex rewards invite hacking.

3
Run GRPO/DAPO rollouts

Sample a group per prompt, verify, compute group-relative advantages, update with a clipped policy gradient and a KL leash to the reference. Watch entropy and the gold-vs-proxy gap.

4
Distill (optional)

Rejection-sample correct traces from your RL checkpoint and SFT a smaller model on them — often the cheapest way to ship reasoning at low latency.

The open frameworks: verl (Volcano Engine RL, scales to large clusters), OpenRLHF, Hugging Face TRL (GRPO built in), and AI2’s open-instruct (the Tülu 3 RLVR stack). Building the verifiers, environments, and graders at production scale is its own emerging industry — see the companies building reasoning RL environments.

A short history

2023
Let's Verify Step by Step
OpenAI shows step-level process reward models beat outcome-only supervision on MATH — the PRM idea.
Feb 2024
DeepSeekMath / GRPO
DeepSeek introduces GRPO, the critic-free, group-relative algorithm that will train R1.
Sep 2024
OpenAI o1
First deployed reasoning model; reframes the problem as “learning to reason” and reveals test-time-compute scaling.
Nov 2024
Tülu 3 names RLVR
AI2’s open post-training report coins and popularizes “Reinforcement Learning with Verifiable Rewards.”
Jan 2025
DeepSeek-R1 & Kimi k1.5
Open-weights R1 shows reasoning emerging from pure RL (R1-Zero) and ships the full multi-stage recipe; Kimi k1.5 brings long-context RL.
Mar 2025
DAPO
A fully open RL system (clip-higher, dynamic sampling, token-level loss, overlong shaping) beats R1-Zero-Qwen-32B on AIME with fewer steps.
2025–26
The reckoning
pass@1-vs-pass@k, spurious rewards, entropy collapse — the field starts asking what RL actually adds, and for which model families.

Frontiers: beyond math and code

The hard part of reasoning RL now is escaping the verifiable sandbox:

  • Agentic / tool-use RL — reward the model for correctly using tools (search, code execution, browsers) over multi-step tasks; the verifier becomes the environment’s outcome. See agentic RL.
  • Multimodal reasoning — extend verifiable rewards to vision/diagram problems.
  • Unverifiable domains — for writing, strategy, or open-ended analysis there’s no checker, so the frontier is generative verifiers, LLM-as-judge graders, and rubric-based rewards — which reintroduces the reward-hacking risk RLVR was prized for avoiding.

Researcher takes

The OpenAI researcher behind o1’s reasoning work lays out, on launch day, the core thesis for why RL-trained chain-of-thought opens a brand-new axis for scaling.

Karpathy argues the real magic of RL for reasoning is that the solving strategies are emergent and could never come from imitation — a ‘Move 37’ moment for language models.

Frequently asked questions

Is RL-for-reasoning the same as chain-of-thought prompting?

No. CoT prompting just asks a frozen model to show its work. RL-for-reasoning trains the model — via RLVR — to generate reasoning that tends to be correct, rewarding the final answer. Prompting is free and shallow; RL changes the weights and produces models that reason without being asked.

Does o1/R1 use RLHF or RLVR?

Both, in sequence. The reasoning core is RLVR (verifiable rewards on math/code). Then a final RLHF-style alignment pass makes the model helpful and safe. RLVR sharpens correctness; RLHF governs behavior.

Why GRPO instead of PPO for reasoning?

PPO needs a separate value network (critic) the size of the policy — costly and unstable for very long chains-of-thought. GRPO drops the critic and baselines each sample against a group of siblings on the same prompt, which is cheaper and fits naturally with verifiable rewards where you can sample many attempts.

Does RL actually make models smarter, or just better at sampling?

Genuinely contested. Pass@k studies suggest RLVR mostly re-weights paths the base model already had (better pass@1, worse pass@k), and spurious-reward results show some gains are really pretraining behaviors being un-hidden — and are model-family-dependent. Other work argues the reasoning itself does improve when you measure steps, not just answers. Safest read: RL makes models reliably better within the base model’s reach.

Key papers

RLVR · GRPO · PPO · RLHF · DPO & preference optimization · Reward models · Agentic RL · RL environments · What is reinforcement learning?