- Reasoning models (o1, o3, DeepSeek-R1) are trained with RL to write a long chain-of-thought before answering — and rewarded mainly when the final answer is verifiably correct.
- The dominant recipe is RLVR (RL with Verifiable Rewards): a deterministic checker (math answer, passing unit tests) supplies the reward, so it's hard to game and needs no human-labeled reasoning traces.
- DeepSeek-R1-Zero showed reasoning can emerge from pure RL on a base model — self-checking, backtracking, longer 'thinking' — using GRPO, a critic-free policy-gradient method.
- It opened a second scaling axis (test-time compute), but 2025–26 work hotly debates whether RL adds new reasoning or just sharpens what the base model already knew (pass@1 up, pass@k down).
What is RL for reasoning?
RL for reasoning is the training technique behind “reasoning models” like OpenAI’s o1/o3 and DeepSeek-R1. Instead of only imitating human-written answers, the model is optimized with reinforcement learning to first generate a long internal chain-of-thought (CoT) — a <think>…</think> scratchpad where it plans, tries approaches, checks itself, and backtracks — and is then rewarded when its final answer is verifiably correct: a math result that checks out, code that passes its tests.
The crucial move is what supplies the reward. In classic RLHF a learned reward model scores style and helpfulness. For reasoning, the reward usually comes from a deterministic verifier — this is Reinforcement Learning with Verifiable Rewards (RLVR). Because the checker is right by construction, the reward is hard to hack, and the model is free to discover whatever reasoning process gets to the correct answer. Nobody hand-writes the chain-of-thought; it emerges from the optimization.
Reasoning models vs ordinary LLMs
An ordinary instruction-tuned LLM is optimized to produce a good-looking answer in one pass. A reasoning model (sometimes called a Large Reasoning Model, LRM) is optimized to spend compute thinking before it commits — and that thinking is what RL shapes.
| Standard instruction-tuned LLM | Reasoning model (o1 / R1-style) | |
|---|---|---|
| Post-training signal | RLHF/DPO on human preferences | RLVR on verifiable correctness |
| Output | Direct answer | Long internal CoT, then answer |
| What scales quality | Bigger model / more SFT | More RL and more thinking at inference |
| Strength | Tone, helpfulness, breadth | Math, code, multi-step logic |
| Reward source | Learned reward model | Deterministic verifier |
Why RL and not just more supervised fine-tuning?
You can teach reasoning by SFT on human-written solutions — but it caps out. Demonstrations teach the model to imitate one path a human happened to write; they can’t teach it to explore, notice it’s wrong, and recover, because there’s no signal for “this attempt failed, try another.” RL provides exactly that signal: sample many attempts, reward the ones that land, and let the model find its own strategies — including ones no human wrote down.
The other reason is data. High-quality step-by-step reasoning traces are scarce and expensive. A verifier sidesteps the bottleneck: for math and code you often already have the ground-truth answer or a test suite, so you can generate unlimited training signal automatically. This is why reasoning RL scaled so fast — the reward is essentially free and essentially uncheatable.
Go deeper: the link to classic RL fundamentals
If you come from the RL side, RLVR is a clean instance of policy-gradient learning. The trajectory is the token sequence; the action is each token; the reward is a single sparse terminal scalar from the verifier. The policy gradient pushes up the log-probability of tokens in high-reward trajectories. The only twist versus Atari-era RL is the baseline used to reduce variance: instead of a learned value function (PPO’s critic), GRPO uses the mean reward of a group of samples for the same prompt. A KL penalty to a reference model keeps the policy from drifting into gibberish — the same safety belt RLHF uses.
The core idea: reward the answer, let the reasoning emerge
The recipe is almost suspiciously simple. Give the model a problem with a known answer. Ask it to think inside <think>…</think> and then give a final answer. Sample several attempts. Run a verifier. Reward the attempts that got it right. Update. Repeat.
For each problem , sample a group of full chain-of-thought traces from the current policy . Each is an independent attempt — some explore, some go straight, some fail.
A deterministic checker assigns each trace a reward — typically if the final answer matches ground truth (or all unit tests pass), else , often plus small format rewards (did it use the think/answer tags? right language?).
Instead of a learned critic, normalize within the group:
A trace beats the baseline if it’s better than its siblings on the same problem.
Increase the probability of tokens in above-average traces, decrease it for below-average ones (a clipped policy-gradient step, with a KL penalty to a reference model). Over many iterations the model learns to write the kind of reasoning that tends to be correct.
What’s striking is what this doesn’t contain: no human reasoning labels, no step-by-step supervision, no learned reward model. The verifier is the whole teacher. Longer thinking, self-verification, and backtracking are not programmed — they are simply behaviors that raise the hit rate, so RL amplifies them.
RLVR: why verifiable rewards resist hacking
Reward hacking is the original sin of RLHF: optimize a learned proxy hard enough and the policy finds quirks the reward model loves but humans don’t (Goodhart’s law). RLVR swaps the learnable proxy for a ground-truth oracle. You can’t sweet-talk a unit test; either it passes or it doesn’t.
A reward model trained on human preferences scores outputs. Best where “good” is taste — tone, helpfulness, safety — but gameable, because the proxy is itself a model. See reward models.
A programmatic checker (answer-matcher, unit tests, a theorem prover) emits the reward. Best for math, code, and checkable logic — hard to hack, no labels needed. See RLVR.
The catch is coverage: RLVR only works where you can write a verifier. Math and code are easy; “write a moving essay” is not. That’s why frontier models use both — RLVR to sharpen reasoning, then a round of preference-based RL to keep the result helpful and safe. The term RLVR was named and popularized by AI2’s Tülu 3 open post-training report.
Case study 1 — OpenAI o1: learning to reason
In September 2024 OpenAI shipped o1, the first widely deployed reasoning model, framed explicitly as learning to reason. Their description: “Our large-scale reinforcement learning algorithm teaches the model how to think productively using its chain of thought.” The model learns to recognize and correct mistakes, break hard steps into simpler ones, and try a different approach when one isn’t working.
o1’s headline contribution was a new scaling axis: performance improved both with more train-time RL and with more test-time compute (thinking longer). That’s a departure from the pretraining-scaling story — you can now buy accuracy at inference time by letting the model think. OpenAI hid the raw CoT from users (for safety and competitive reasons), which kept the exact recipe closed. See the o1 System Card and OpenAI’s account of competitive programming with large reasoning models.
Case study 2 — DeepSeek-R1: the open recipe and the “aha moment”
DeepSeek’s R1 paper (January 2025; peer-reviewed version in Nature) did for the open world what o1 did for the closed one — and showed the machinery in full.
R1-Zero: reasoning from pure RL
The boldest experiment was R1-Zero: take the DeepSeek-V3 base model, apply no SFT at all, and run RLVR directly with GRPO on math/code with answer-checking rewards. Reasoning emerged — the model spontaneously learned to think longer, verify its work, and backtrack. The paper highlights an “aha moment” where the model learns to stop, re-evaluate its approach, and allocate more thinking time, in language like “wait, let me reconsider.” Nobody trained that behavior in; it raised the reward, so it appeared.
R1-Zero had real warts — language mixing (switching between English and Chinese mid-thought) and messy, hard-to-read CoT. So the production R1 wrapped the pure-RL core in a four-stage pipeline.
The R1 multi-stage recipe
Fine-tune the base model on a few thousand curated long-CoT examples to stabilize the early RL phase and fix readability — a gentle on-ramp, not the main teacher.
Run large-scale RLVR (GRPO) on math/code, now with an added language-consistency reward to suppress the language-mixing seen in R1-Zero.
Once RL converges, sample the checkpoint and keep only correct responses (~600k reasoning samples) plus ~200k general samples, then SFT — distilling the RL-honed behavior back into a clean, broad model.
A last RL pass over all prompts to align helpfulness and harmlessness — a preference-based stage so the reasoning model is also a usable assistant.
DeepSeek also distilled R1’s reasoning into small dense models (1.5B–70B Qwen/Llama), showing that a strong teacher’s traces transfer remarkably well — often beating same-size models trained with RL directly.
Under the hood: the RL algorithms
PPO and its critic-model cost
PPO is the workhorse of RLHF, but for reasoning it’s heavy: it trains a separate value network (critic) the same size as the policy to estimate advantages, roughly doubling memory and adding instability. For long-CoT RL — where a single trajectory can be tens of thousands of tokens — that overhead bites.
GRPO: drop the critic
GRPO (Group Relative Policy Optimization), introduced in DeepSeekMath and used to train R1, removes the critic entirely. As shown above, it samples a group of answers per prompt and uses the group’s mean reward as the baseline — the advantage is just how much each sample beat its siblings. Cheaper, simpler, and well-suited to verifiable rewards where you can afford many rollouts per prompt.
DAPO, Kimi k1.5, and the open tricks
GRPO is simple but finicky; a wave of 2025 work stabilized it. DAPO (ByteDance) is a fully open recipe with four tricks — Clip-Higher (decouple the clip bounds so promising rare tokens aren’t over-penalized), Dynamic Sampling (drop prompt-groups where every sample is right or every sample is wrong, since they give zero gradient), Token-Level Policy-Gradient Loss (average over tokens, not sequences, so long CoTs aren’t down-weighted), and Overlong Reward Shaping (gently penalize runaway length). DAPO reported beating R1-Zero-Qwen-32B on AIME 2024 with half the training steps.
Kimi k1.5 (Moonshot) is a parallel long-context (128K) RL recipe that deliberately avoids MCTS, value functions, and process reward models — partial rollouts for efficiency, betting that scale plus a clean objective beats elaborate search.
| Method | Baseline / advantage | Extra model | Notes |
|---|---|---|---|
| PPO | learned value network | critic | Stable, controllable, expensive |
| GRPO | group mean reward | none | Critic-free; trained R1 |
| DAPO | GRPO + 4 stabilizers | none | Open, faster convergence at scale |
| DPO | closed-form preference loss | none | Offline; for preference, not verifiable RL |
| RAFT / rejection sampling | keep best-of-N, then SFT | none | Simplest “poor man’s RL” |
Reward design: outcome vs process
Two ways to reward reasoning:
- Outcome Reward Model (ORM) — score only the final answer. Simple, and for RLVR the “ORM” is literally the verifier. The downside: a correct answer can come from flawed reasoning (lucky guesses), and the signal is sparse.
- Process Reward Model (PRM) — score each step. OpenAI’s Let’s Verify Step by Step showed step-level supervision beats outcome-only on MATH. But PRMs are expensive (need step labels) and themselves hackable — so the R1/Kimi camp largely skipped them in favor of pure outcome rewards plus light format and language-consistency rewards. See reward models.
Test-time compute: the second scaling axis
The deepest shift from o1/R1 is that inference is now a place you scale. A reasoning model gets better the longer you let it think — and you can spend that budget several ways:
Let the model emit more reasoning tokens before answering. RL teaches it to use the budget — to verify and backtrack rather than ramble.
Sample many independent answers and take the most common (or, with a verifier, the one that checks out). Cheap, embarrassingly parallel, and surprisingly strong.
Explore a tree of partial solutions and score them with a verifier or PRM — best-of-N re-ranking pushed o1’s AIME from ~74% to ~93%.
This complements training-time scaling. See RL for reasoning’s sibling page on RLVR and the broader RL for reasoning vs reasoning-via-search discussion.
The big debate: new reasoning, or sharpened base model?
Here’s where 2025–26 got genuinely contentious. The uncritical story is “RL teaches models to reason.” The data complicate it.
The counterpoints are just as sharp:
- It does add reasoning, measured right. RLVR Implicitly Incentivizes Correct Reasoning in Base LLMs argues that when you score intermediate steps rather than just final answers, RLVR genuinely improves the reasoning, not just the hit rate.
- Spurious rewards “work” — on Qwen. The unsettling Spurious Rewards paper showed that random or even incorrect rewards still boosted Qwen2.5-Math-7B on MATH-500 by ~21% (vs ~29% for true rewards) — but the same spurious rewards failed on Llama and OLMo. The likely mechanism: RL surfaces a latent “think in code” behavior already baked into Qwen by pretraining (possibly including test-set contamination). The lesson: many flashy RLVR results are model-family-dependent, and “RL improved my score” can mean “RL un-hid something pretraining already put there.”
So the honest 2026 summary is: RLVR reliably makes models more reliable at problems within reach of the base model, and it’s the best tool we have for eliciting test-time-scaling behavior — but claims that it creates new reasoning from nothing should be read with the pass@k caveat and the model-family caveat firmly in mind.
Failure modes (and the fixes)
| Failure mode | What happens | Common fix |
|---|---|---|
| Entropy collapse | Policy becomes over-confident, stops exploring, output diversity dies | Clip-Higher (DAPO), entropy bonus, KL control |
| Reward hacking | Gaming weak verifiers / format rewards instead of reasoning | Harder verifiers, hidden tests, length shaping |
| Language mixing | CoT switches languages mid-thought (seen in R1-Zero) | Language-consistency reward |
| Overthinking | Endless CoT that wastes tokens without improving accuracy | Overlong reward shaping, length penalties |
| pass@k regression | Higher pass@1 but worse coverage at large k | Keep some exploration; don’t over-train |
Go deeper: why entropy collapse is the central tension
RL on verifiable rewards is a giant exploitation engine — it relentlessly up-weights paths that already work. Left unchecked, the policy’s per-token entropy crashes: it commits to one strategy and stops sampling alternatives. That raises pass@1 in the short run but kills the diversity that pass@k measures and that long-horizon improvement depends on. Most stabilization tricks — Clip-Higher, dynamic sampling that drops zero-gradient groups, entropy regularization, careful KL — are really exploration-preservation tricks. This is the same explore/exploit dilemma at the heart of all of reinforcement learning, now playing out over token sequences.
How to train your own reasoning model
You need three ingredients: a base model with latent capability, a verifier for your domain, and a framework to run rollouts at scale.
Start from a strong base/instruct model and a dataset where answers are verifiable — GSM8K/MATH/AIME for math, problems with unit tests for code. Quality of the verifier matters more than dataset size.
Usually: correctness (1/0 from the checker) + small format rewards (think/answer tags) + optional length and language rewards. Keep it simple; complex rewards invite hacking.
Sample a group per prompt, verify, compute group-relative advantages, update with a clipped policy gradient and a KL leash to the reference. Watch entropy and the gold-vs-proxy gap.
Rejection-sample correct traces from your RL checkpoint and SFT a smaller model on them — often the cheapest way to ship reasoning at low latency.
The open frameworks: verl (Volcano Engine RL, scales to large clusters), OpenRLHF, Hugging Face TRL (GRPO built in), and AI2’s open-instruct (the Tülu 3 RLVR stack). Building the verifiers, environments, and graders at production scale is its own emerging industry — see the companies building reasoning RL environments.
A short history
Frontiers: beyond math and code
The hard part of reasoning RL now is escaping the verifiable sandbox:
- Agentic / tool-use RL — reward the model for correctly using tools (search, code execution, browsers) over multi-step tasks; the verifier becomes the environment’s outcome. See agentic RL.
- Multimodal reasoning — extend verifiable rewards to vision/diagram problems.
- Unverifiable domains — for writing, strategy, or open-ended analysis there’s no checker, so the frontier is generative verifiers, LLM-as-judge graders, and rubric-based rewards — which reintroduces the reward-hacking risk RLVR was prized for avoiding.
Researcher takes
The OpenAI researcher behind o1’s reasoning work lays out, on launch day, the core thesis for why RL-trained chain-of-thought opens a brand-new axis for scaling.
Karpathy argues the real magic of RL for reasoning is that the solving strategies are emergent and could never come from imitation — a ‘Move 37’ moment for language models.
Frequently asked questions
Is RL-for-reasoning the same as chain-of-thought prompting?
No. CoT prompting just asks a frozen model to show its work. RL-for-reasoning trains the model — via RLVR — to generate reasoning that tends to be correct, rewarding the final answer. Prompting is free and shallow; RL changes the weights and produces models that reason without being asked.
Does o1/R1 use RLHF or RLVR?
Both, in sequence. The reasoning core is RLVR (verifiable rewards on math/code). Then a final RLHF-style alignment pass makes the model helpful and safe. RLVR sharpens correctness; RLHF governs behavior.
Why GRPO instead of PPO for reasoning?
PPO needs a separate value network (critic) the size of the policy — costly and unstable for very long chains-of-thought. GRPO drops the critic and baselines each sample against a group of siblings on the same prompt, which is cheaper and fits naturally with verifiable rewards where you can sample many attempts.
Does RL actually make models smarter, or just better at sampling?
Genuinely contested. Pass@k studies suggest RLVR mostly re-weights paths the base model already had (better pass@1, worse pass@k), and spurious-reward results show some gains are really pretraining behaviors being un-hidden — and are model-family-dependent. Other work argues the reasoning itself does improve when you measure steps, not just answers. Safest read: RL makes models reliably better within the base model’s reach.
Key papers
- DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via RL — Guo et al., 2025 — R1-Zero (pure RL), the multi-stage recipe, the “aha moment.” Nature version.
- DeepSeekMath / GRPO — Shao et al., 2024 — the critic-free algorithm behind R1.
- Learning to reason with LLMs — OpenAI, 2024 — the o1 framing and test-time scaling. o1 System Card.
- Tülu 3 — Lambert et al., 2024 — names and popularizes RLVR.
- Let’s Verify Step by Step — Lightman et al., 2023 — process reward models.
- Kimi k1.5 — 2025 — long-context RL without search/PRMs.
- DAPO — 2025 — open recipe: clip-higher, dynamic sampling, token-level loss, overlong shaping.
- Does RL Really Incentivize Reasoning Beyond the Base Model? — 2025 — the pass@1-vs-pass@k critique.
- Spurious Rewards — 2025 — random rewards work on Qwen, not Llama/OLMo.
Related
RLVR · GRPO · PPO · RLHF · DPO & preference optimization · Reward models · Agentic RL · RL environments · What is reinforcement learning?