- RLVR fine-tunes a model with RL where the reward comes from an automatic checker — not a learned reward model or human raters: correct answer → 1, otherwise 0.
- Because the signal is objective and free to compute, it scales without human labeling — it's the core recipe behind reasoning models like DeepSeek-R1 and Ai2's Tulu 3 (which coined the name).
- It's almost always run with GRPO: a critic-free RL algorithm that scores a group of sampled answers against each other, which fits binary rewards perfectly.
- The open debate: does RLVR teach new reasoning or just sharpen what the base model already knows? pass@1 rises while pass@k at large k can fall — and even random rewards 'work' on some models.
What is RLVR?
Reinforcement Learning with Verifiable Rewards (RLVR) is a post-training method that fine-tunes a language model using reinforcement learning, where the reward comes from an automatic, rule-based checker instead of a learned reward model or human raters. You ask the model a question with a knowable answer — a math problem, a coding task, a constrained instruction — it generates a response, and a verifier awards a reward of 1 if the answer is confirmed correct and 0 otherwise.
That single change has a huge consequence. RLHF needs humans to label preferences and a reward model to generalize from them — slow, expensive, and game-able. RLVR replaces all of that with a deterministic function you can run a million times for free. No human in the loop, no proxy to over-optimize against in the usual way: the reward is the ground truth.
A worked example: a math problem through RLVR
The whole method fits in one loop. Take a prompt with a known answer, sample several attempts, grade each one, and push the policy toward the attempts that scored a 1.
Start from a dataset of (question, ground-truth answer) pairs — e.g. “If 3x + 7 = 22, what is x?” with the gold answer 5. No human labels the response; only the final answer is stored.
The current policy generates G full chain-of-thought attempts for the prompt (say G = 8). Some will reach x = 5, some won’t. This group of rollouts is the raw material GRPO needs.
A verifier extracts the final answer from each response and checks it against the gold value. Correct → reward 1; wrong → 0. An optional small format reward (e.g. did the model put its answer in \boxed{}?) can be added. No reward model is involved.
Where RLVR came from
The recipe and the name have slightly different origin stories — and most explainers get this wrong.
What “verifiable” actually means
A domain is verifiable when correctness can be decided by a program, not a person. Three families dominate:
Parse the final answer and compare to the gold value, ideally with a symbolic engine (SymPy) so that 1/2, 0.5 and \frac{1}{2} all count as equal. The workhorse domain for RLVR.
Run the generated program against a hidden test suite. Reward = 1 only if all tests pass. The verifier is literally the test runner — concrete and hard to fake.
Check programmatically verifiable constraints: “answer in exactly 3 bullet points,” “include the word photosynthesis,” “valid JSON matching this schema.” Tulu 3 used these for instruction following.
The catch is that “verifiable” is narrower than it looks. Most of what we want from an assistant — be helpful, be tactful, write well — has no programmatic checker, which is exactly the territory where RLHF and reward models still rule. Pushing RLVR past math and code is an active frontier (see below).
How RLVR works under the hood
The reward function
The defining feature is a binary correctness (accuracy) reward, sometimes with a small additive format reward:
There is no scalar “quality” head to over-optimize, so the classic RLHF failure of a policy charming a reward model largely disappears — there’s nothing to charm. (Reward hacking doesn’t vanish, though; it just moves to the verifier — more below.)
The optimization algorithm: GRPO
RLVR is almost always paired with GRPO (Group Relative Policy Optimization). Standard PPO needs a learned value/critic network to estimate the baseline for each token — a second large model to train and store. GRPO throws it out. Instead, for each prompt it samples a group of G responses and uses the group’s own statistics as the baseline:
This fits verifiable rewards perfectly: with binary rewards, the group advantage is just “how many of my siblings also got it right?” — a free, well-calibrated baseline. The policy is then updated with PPO’s clipped objective using these group-relative advantages, plus a KL penalty toward the reference model.
Go deeper: why critic-free matters for RLVR
A value network in PPO has to learn to predict the expected return at every token position — a hard regression problem, especially over long chains of thought where the reward only arrives at the very end. Training it well roughly doubles memory and adds its own instability. Because RLVR rewards are sparse (one signal per full response) and binary, the Monte-Carlo group baseline GRPO uses is both cheaper and often less biased than a struggling critic. That’s why nearly every open RLVR codebase — open-instruct, TRL, verl — defaults to GRPO or a close relative.
Refinements practitioners use
The basic GRPO recipe has known rough edges, and several variants patch them:
| Variant | What it fixes | One-line summary |
|---|---|---|
| DAPO | Entropy collapse, length bias | Decoupled clipping (“clip-higher”), dynamic sampling that drops all-correct/all-wrong groups, token-level loss. Open recipe from ByteDance. |
| Dr.GRPO | Length & difficulty bias in GRPO’s normalization | Removes the response-length and std-dev normalization terms that bias updates toward longer/easier answers. |
| RLOO | Critic cost (pre-GRPO) | REINFORCE-with-leave-one-out baseline; uses other samples in the batch as the baseline. A simpler ancestor of the same idea. |
A minimal RLVR training loop
group = policy.sample(prompt, n=G)
rewards = [verify(r, gold) for r in group]
adv = (rewards - mean(rewards)) / std(rewards)
loss = grpo_clip(policy, group, adv) + beta * KL(policy, ref)
loss.backward(); optimizer.step()
The pipeline that wraps it is usually SFT → (optional DPO) → RLVR, with most of the engineering effort going into dataset curation (clean, checkable answers) and the verifier itself.
The open questions and failure modes
RLVR’s results are real, but 2025–26 research has been refreshingly honest about what it doesn’t do. This is where most beginner pages stop — and where the interesting science is.
Does RLVR create new reasoning, or just sharpen the base model?
The headline skeptical result: RLVR reliably improves pass@1 (one shot at the answer) but, on several model families, the base model achieves a higher pass@k at large k — meaning RLVR narrows the set of solutions the model explores rather than discovering genuinely new ones.
The counterpoint camp pushes back: standard pass@k can be gamed by lucky guesses with wrong reasoning, and using a stricter CoT-Pass@K metric (the chain must also be valid) shows RLVR genuinely improving correct reasoning. The honest summary in mid-2026: RLVR clearly improves reliable reasoning within the base model’s reach; whether it can discover fundamentally new capabilities is still contested.
Go deeper: the “invisible leash”
The Invisible Leash (Wu et al., 2025) gives a theoretical frame: RLVR is a support-constrained optimization — it can only up-weight solutions the base model already assigns nonzero probability to, so it can’t reach answers outside the base model’s support. Strikingly, even as token-level entropy rises during training, answer-level entropy collapses: “more uncertain paths ultimately converge onto a smaller set of distinct answers.” That’s the leash — RLVR makes the model more decisive within its origin distribution, but the distribution itself is the ceiling. See also the pass@k study.
Spurious rewards: when even random rewards “work”
The most surprising 2025 result. Spurious Rewards (Shao et al.) found that on Qwen2.5-Math-7B, RLVR with random rewards still improved MATH-500 by ~21 points, and an incorrect-label reward by ~24 points — nearly matching the +29 from ground-truth rewards.
The mechanism: RLVR was surfacing a behavior already latent in Qwen (it “thinks in code” without executing it), and almost any reward nudged it to do that more often. The effect did not transfer to Llama3 or OLMo2. The lesson for the field is methodological: a result on one model family is not a result about RLVR — always validate across diverse base models.
Verifier reliability: the reward hacking moves to the checker
“Binary reward = no reward hacking” is the most common myth about RLVR. The hacking just relocates from a learned reward model to the verifier:
This is why verifier design is the real engineering work in RLVR. Symbolic/semantic checkers (or small “verifier LLMs” like TinyV) reduce false negatives; sandboxed execution and hidden tests reduce false positives. Both noisy-reward theory and LLMs-gaming-verifiers work show that asymmetric verifier noise materially hurts RLVR — so the verifier is not a free, perfect oracle, and treating it like one is the classic beginner mistake.
RLVR vs RLHF: the comparison
| Dimension | RLHF | RLVR |
|---|---|---|
| Reward source | Learned reward model trained on human preferences | Deterministic verifier (parser, test runner, rule) |
| Reward shape | Continuous scalar score | Binary {0, 1} (+ optional format bonus) |
| Human labeling | Required, expensive, the bottleneck | None — rewards are free to compute |
| Best domain | Open-ended quality, tone, helpfulness, safety | Math, code, checkable reasoning |
| Main failure | Reward-model hacking (sycophancy, length) | Verifier false negatives/positives; base-model ceiling |
| Typical optimizer | PPO, DPO | GRPO (+ DAPO/Dr.GRPO) |
Frontier models use both: RLVR to sharpen reasoning, then RLHF/RLAIF to keep the result helpful and safe. They’re complementary axes — verify the reward where you can, learn the reward where you can’t.
Pushing RLVR beyond math and code
The obvious limit of RLVR is that most useful tasks aren’t programmatically checkable. The 2025–26 frontier is widening the “verifiable” tent:
- Rubrics as rewards. Instead of one binary check, score a response against a structured checklist (“cites a source ✓, addresses the counterargument ✓, no factual errors ✓”). Each criterion is more checkable than “is this essay good?”, turning a fuzzy task into a sum of semi-verifiable parts.
- Generative / model-based verifiers. Use a capable LLM to judge correctness for answers that exact-match misses (algebraic equivalence, free-form proofs) — recovering false negatives at the cost of reintroducing some model judgment.
- LLM-as-judge for open-ended domains. For writing, dialogue and other taste-driven tasks, a judge model supplies the reward — which blurs the line back toward RLHF/RLAIF. The trade-off is honest: the more subjective the domain, the less “verifiable” the reward truly is.
When should you use RLVR?
You have tasks with checkable answers (math, code, structured extraction, constrained formatting), a base model already competent in the domain, and a verifier you can make robust to false negatives. Start from SFT, use GRPO, and validate on more than one model family.
The task is open-ended or taste-driven (creative writing, tone, safety nuance) with no programmatic check; the base model can’t do the task at all (RLVR can’t conjure ability outside its support); or your verifier is a brittle string-matcher that will punish correct answers. Use RLHF/DPO or invest in a better verifier first.
Building the verifiers, datasets and RL environments that make RLVR work at production scale is its own emerging industry — see the companies building verifiable-reward environments and RL environments.
Researcher takes
A widely-discussed result challenging the assumption that the reward signal is what makes RLVR work.
A sharp caveat from one of the spurious-rewards authors on why these surprising RLVR results don’t generalize.
Frequently asked questions
Is RLVR the same as GRPO?
No. RLVR describes where the reward comes from (a verifier). GRPO is the RL algorithm used to optimize against it. RLVR can run with PPO, RLOO or DAPO too — GRPO is just the most common pairing because critic-free, group-relative advantages suit binary rewards so well.
Does RLVR replace RLHF?
No — they solve different problems. RLVR handles objectively checkable tasks (math, code); RLHF handles taste-driven ones (helpfulness, tone, safety). Frontier reasoning models use RLVR to sharpen reasoning and RLHF/RLAIF to stay aligned. See RL for reasoning.
If the reward is just “correct or not,” why isn’t there any reward hacking?
There is — it moves to the verifier. Models can game brittle checkers (e.g. printing expected outputs), and imperfect verifiers wrongly reject correct answers (over 38% false negatives in one audited dataset). “Binary reward = no hacking” is the most common myth about RLVR.
Did RLVR teach DeepSeek-R1 to reason, or was it already there?
Contested. RLVR clearly made R1 reliably produce long chains of thought and self-verification. Whether it created new reasoning or just up-weighted paths already in the base model is the pass@1-vs-pass@k debate — and the “invisible leash” theory argues RLVR is bounded by base-model support.
Key papers
- Tulu 3 — Lambert et al., Ai2, 2024 — coined “RLVR” and formalized verifier-function rewards for math and instruction following.
- DeepSeekMath — Shao et al., 2024 — introduces GRPO, the critic-free optimizer that became RLVR’s default.
- DeepSeek-R1 — DeepSeek, 2025 — rule-based verifiable rewards alone elicit long chain-of-thought; the flagship RLVR reasoning model.
- Does RL Really Incentivize Reasoning Beyond the Base Model? — Yue et al., 2025 — pass@1 up, pass@k down; the skeptics’ case.
- Spurious Rewards — Shao et al., 2025 — random/incorrect rewards still help Qwen-Math; gains are model-dependent.
- The Invisible Leash — Wu et al., 2025 — RLVR as support-constrained reweighting; answer-level entropy collapse.
- RLVR yet Noisy Rewards under Imperfect Verifiers — 2025 — how asymmetric verifier noise degrades training.
Related
RLHF · GRPO · PPO · DPO & preference optimization · Reward models · RL for reasoning · Agentic RL · What is reinforcement learning?