RLVR: RL with Verifiable Rewards, Explained

Key takeaways

RLVR fine-tunes a model with RL where the reward comes from an automatic checker — not a learned reward model or human raters: correct answer → 1, otherwise 0.
Because the signal is objective and free to compute, it scales without human labeling — it's the core recipe behind reasoning models like DeepSeek-R1 and Ai2's Tulu 3 (which coined the name).
It's almost always run with GRPO: a critic-free RL algorithm that scores a group of sampled answers against each other, which fits binary rewards perfectly.
The open debate: does RLVR teach new reasoning or just sharpen what the base model already knows? pass@1 rises while pass@k at large k can fall — and even random rewards 'work' on some models.

What is RLVR?

Reinforcement Learning with Verifiable Rewards (RLVR) is a post-training method that fine-tunes a language model using reinforcement learning, where the reward comes from an automatic, rule-based checker instead of a learned reward model or human raters. You ask the model a question with a knowable answer — a math problem, a coding task, a constrained instruction — it generates a response, and a verifier awards a reward of 1 if the answer is confirmed correct and 0 otherwise.

That single change has a huge consequence. RLHF needs humans to label preferences and a reward model to generalize from them — slow, expensive, and game-able. RLVR replaces all of that with a deterministic function you can run a million times for free. No human in the loop, no proxy to over-optimize against in the usual way: the reward is the ground truth.

RLHF learns the reward from human preferences; RLVR computes it from a deterministic verifier. Same RL objective, different source of truth.

A worked example: a math problem through RLVR

The whole method fits in one loop. Take a prompt with a known answer, sample several attempts, grade each one, and push the policy toward the attempts that scored a 1.

Prompt with a knowable answer

Start from a dataset of (question, ground-truth answer) pairs — e.g. “If 3x + 7 = 22, what is x?” with the gold answer 5. No human labels the response; only the final answer is stored.

Sample a group of responses

The current policy generates G full chain-of-thought attempts for the prompt (say G = 8). Some will reach x = 5, some won’t. This group of rollouts is the raw material GRPO needs.

Verify each one → binary reward

A verifier extracts the final answer from each response and checks it against the gold value. Correct → reward 1; wrong → 0. An optional small format reward (e.g. did the model put its answer in \boxed{}?) can be added. No reward model is involved.

Compute group-relative advantage and update

Each response’s advantage is its reward normalized by the group’s mean and standard deviation. Responses that beat the group average get pushed up; the rest get pushed down. The policy is updated with GRPO (or PPO) under a KL penalty toward the reference model, then the loop repeats.

▶ DeepSeek R1 Theory Tutorial — Architecture, GRPO, KL Divergence (the verifiable-reward reasoning recipe)

Where RLVR came from

The recipe and the name have slightly different origin stories — and most explainers get this wrong.

Feb 2024

DeepSeekMath introduces GRPO

DeepSeek trains a math model with rule-based correctness rewards optimized by a new critic-free algorithm, GRPO. This is the technical seed of the verifiable-reward recipe.

Nov 2024

Tulu 3 coins 'RLVR'

Ai2’s Tulu 3 post-training paper (Lambert et al.) formalizes the verifier-function approach for math and instruction following and gives it a name — Reinforcement Learning with Verifiable Rewards. (It was nearly called RLGT, “RL with Ground-Truth rewards.”)

Jan 2025

DeepSeek-R1 makes it famous

R1 and R1-Zero show that rule-based verifiable rewards alone — with no SFT warm-up in the Zero variant — can elicit long chain-of-thought, self-verification and “aha” backtracking. The reasoning-model boom begins.

2025–26

The scrutiny phase

A wave of papers probes whether RLVR adds new reasoning or just reweights the base model — pass@k studies, the “spurious rewards” finding, the “invisible leash,” and work on verifier reliability.

What “verifiable” actually means

A domain is verifiable when correctness can be decided by a program, not a person. Three families dominate:

Math — symbolic / exact checks

Parse the final answer and compare to the gold value, ideally with a symbolic engine (SymPy) so that 1/2, 0.5 and \frac{1}{2} all count as equal. The workhorse domain for RLVR.

Code — unit tests / execution

Run the generated program against a hidden test suite. Reward = 1 only if all tests pass. The verifier is literally the test runner — concrete and hard to fake.

Constrained instructions

Check programmatically verifiable constraints: “answer in exactly 3 bullet points,” “include the word photosynthesis,” “valid JSON matching this schema.” Tulu 3 used these for instruction following.

The catch is that “verifiable” is narrower than it looks. Most of what we want from an assistant — be helpful, be tactful, write well — has no programmatic checker, which is exactly the territory where RLHF and reward models still rule. Pushing RLVR past math and code is an active frontier (see below).

How RLVR works under the hood

The reward function

The defining feature is a binary correctness (accuracy) reward, sometimes with a small additive format reward:

r(x, y) = \mathbb{1}[\text{verify}(y) = \text{correct}] \;+\; \lambda\,\mathbb{1}[\text{format ok}]

There is no scalar “quality” head to over-optimize, so the classic RLHF failure of a policy charming a reward model largely disappears — there’s nothing to charm. (Reward hacking doesn’t vanish, though; it just moves to the verifier — more below.)

The optimization algorithm: GRPO

RLVR is almost always paired with GRPO (Group Relative Policy Optimization). Standard PPO needs a learned value/critic network to estimate the baseline for each token — a second large model to train and store. GRPO throws it out. Instead, for each prompt it samples a group of G responses and uses the group’s own statistics as the baseline:

A_i = \frac{r_i - \mathrm{mean}(r_1,\dots,r_G)}{\mathrm{std}(r_1,\dots,r_G)}

This fits verifiable rewards perfectly: with binary rewards, the group advantage is just “how many of my siblings also got it right?” — a free, well-calibrated baseline. The policy is then updated with PPO’s clipped objective using these group-relative advantages, plus a KL penalty toward the reference model.

Go deeper: why critic-free matters for RLVR

A value network in PPO has to learn to predict the expected return at every token position — a hard regression problem, especially over long chains of thought where the reward only arrives at the very end. Training it well roughly doubles memory and adds its own instability. Because RLVR rewards are sparse (one signal per full response) and binary, the Monte-Carlo group baseline GRPO uses is both cheaper and often less biased than a struggling critic. That’s why nearly every open RLVR codebase — open-instruct, TRL, verl — defaults to GRPO or a close relative.

The basic GRPO recipe has known rough edges, and several variants patch them:

Variant	What it fixes	One-line summary
DAPO	Entropy collapse, length bias	Decoupled clipping (“clip-higher”), dynamic sampling that drops all-correct/all-wrong groups, token-level loss. Open recipe from ByteDance.
Dr.GRPO	Length & difficulty bias in GRPO’s normalization	Removes the response-length and std-dev normalization terms that bias updates toward longer/easier answers.
RLOO	Critic cost (pre-GRPO)	REINFORCE-with-leave-one-out baseline; uses other samples in the batch as the baseline. A simpler ancestor of the same idea.

A minimal RLVR training loop

Pseudocodeone RLVR step (GRPO)

for prompt, gold in batch:
  group = policy.sample(prompt, n=G)
  rewards = [verify(r, gold) for r in group]
  adv = (rewards - mean(rewards)) / std(rewards)
  loss = grpo_clip(policy, group, adv) + beta * KL(policy, ref)
  loss.backward(); optimizer.step()

The pipeline that wraps it is usually SFT → (optional DPO) → RLVR, with most of the engineering effort going into dataset curation (clean, checkable answers) and the verifier itself.

The open questions and failure modes

RLVR’s results are real, but 2025–26 research has been refreshingly honest about what it doesn’t do. This is where most beginner pages stop — and where the interesting science is.

Does RLVR create new reasoning, or just sharpen the base model?

The headline skeptical result: RLVR reliably improves pass@1 (one shot at the answer) but, on several model families, the base model achieves a higher pass@k at large k — meaning RLVR narrows the set of solutions the model explores rather than discovering genuinely new ones.

The pass@k debate: RLVR lifts the curve at small k (better single-shot accuracy) but can sit below the base model at large k — it samples known-good paths more reliably, but may explore fewer of them.

The counterpoint camp pushes back: standard pass@k can be gamed by lucky guesses with wrong reasoning, and using a stricter CoT-Pass@K metric (the chain must also be valid) shows RLVR genuinely improving correct reasoning. The honest summary in mid-2026: RLVR clearly improves reliable reasoning within the base model’s reach; whether it can discover fundamentally new capabilities is still contested.

Go deeper: the “invisible leash”

The Invisible Leash (Wu et al., 2025) gives a theoretical frame: RLVR is a support-constrained optimization — it can only up-weight solutions the base model already assigns nonzero probability to, so it can’t reach answers outside the base model’s support. Strikingly, even as token-level entropy rises during training, answer-level entropy collapses: “more uncertain paths ultimately converge onto a smaller set of distinct answers.” That’s the leash — RLVR makes the model more decisive within its origin distribution, but the distribution itself is the ceiling. See also the pass@k study.

Spurious rewards: when even random rewards “work”

The most surprising 2025 result. Spurious Rewards (Shao et al.) found that on Qwen2.5-Math-7B, RLVR with random rewards still improved MATH-500 by ~21 points, and an incorrect-label reward by ~24 points — nearly matching the +29 from ground-truth rewards.

+29 pts

MATH-500 gain from correct rewards (Qwen2.5-Math-7B)

+21 pts

…from RANDOM rewards on the same model

~0 pts

…the same spurious rewards on Llama / OLMo

The mechanism: RLVR was surfacing a behavior already latent in Qwen (it “thinks in code” without executing it), and almost any reward nudged it to do that more often. The effect did not transfer to Llama3 or OLMo2. The lesson for the field is methodological: a result on one model family is not a result about RLVR — always validate across diverse base models.

Verifier reliability: the reward hacking moves to the checker

“Binary reward = no reward hacking” is the most common myth about RLVR. The hacking just relocates from a learned reward model to the verifier:

This is why verifier design is the real engineering work in RLVR. Symbolic/semantic checkers (or small “verifier LLMs” like TinyV) reduce false negatives; sandboxed execution and hidden tests reduce false positives. Both noisy-reward theory and LLMs-gaming-verifiers work show that asymmetric verifier noise materially hurts RLVR — so the verifier is not a free, perfect oracle, and treating it like one is the classic beginner mistake.

RLVR vs RLHF: the comparison

Dimension	RLHF	RLVR
Reward source	Learned reward model trained on human preferences	Deterministic verifier (parser, test runner, rule)
Reward shape	Continuous scalar score	Binary `{0, 1}` (+ optional format bonus)
Human labeling	Required, expensive, the bottleneck	None — rewards are free to compute
Best domain	Open-ended quality, tone, helpfulness, safety	Math, code, checkable reasoning
Main failure	Reward-model hacking (sycophancy, length)	Verifier false negatives/positives; base-model ceiling
Typical optimizer	PPO, DPO	GRPO (+ DAPO/Dr.GRPO)

Frontier models use both: RLVR to sharpen reasoning, then RLHF/RLAIF to keep the result helpful and safe. They’re complementary axes — verify the reward where you can, learn the reward where you can’t.

Pushing RLVR beyond math and code

The obvious limit of RLVR is that most useful tasks aren’t programmatically checkable. The 2025–26 frontier is widening the “verifiable” tent:

Rubrics as rewards. Instead of one binary check, score a response against a structured checklist (“cites a source ✓, addresses the counterargument ✓, no factual errors ✓”). Each criterion is more checkable than “is this essay good?”, turning a fuzzy task into a sum of semi-verifiable parts.
Generative / model-based verifiers. Use a capable LLM to judge correctness for answers that exact-match misses (algebraic equivalence, free-form proofs) — recovering false negatives at the cost of reintroducing some model judgment.
LLM-as-judge for open-ended domains. For writing, dialogue and other taste-driven tasks, a judge model supplies the reward — which blurs the line back toward RLHF/RLAIF. The trade-off is honest: the more subjective the domain, the less “verifiable” the reward truly is.

When should you use RLVR?

Good fit

You have tasks with checkable answers (math, code, structured extraction, constrained formatting), a base model already competent in the domain, and a verifier you can make robust to false negatives. Start from SFT, use GRPO, and validate on more than one model family.

Poor fit

The task is open-ended or taste-driven (creative writing, tone, safety nuance) with no programmatic check; the base model can’t do the task at all (RLVR can’t conjure ability outside its support); or your verifier is a brittle string-matcher that will punish correct answers. Use RLHF/DPO or invest in a better verifier first.

Building the verifiers, datasets and RL environments that make RLVR work at production scale is its own emerging industry — see the companies building verifiable-reward environments and RL environments.

Researcher takes

A widely-discussed result challenging the assumption that the reward signal is what makes RLVR work.

View Stella Li's post on X →

A sharp caveat from one of the spurious-rewards authors on why these surprising RLVR results don’t generalize.

View Rulin Shao's post on X →

Frequently asked questions

Is RLVR the same as GRPO?

No. RLVR describes where the reward comes from (a verifier). GRPO is the RL algorithm used to optimize against it. RLVR can run with PPO, RLOO or DAPO too — GRPO is just the most common pairing because critic-free, group-relative advantages suit binary rewards so well.

Does RLVR replace RLHF?

No — they solve different problems. RLVR handles objectively checkable tasks (math, code); RLHF handles taste-driven ones (helpfulness, tone, safety). Frontier reasoning models use RLVR to sharpen reasoning and RLHF/RLAIF to stay aligned. See RL for reasoning.

If the reward is just “correct or not,” why isn’t there any reward hacking?

There is — it moves to the verifier. Models can game brittle checkers (e.g. printing expected outputs), and imperfect verifiers wrongly reject correct answers (over 38% false negatives in one audited dataset). “Binary reward = no hacking” is the most common myth about RLVR.

Did RLVR teach DeepSeek-R1 to reason, or was it already there?

Contested. RLVR clearly made R1 reliably produce long chains of thought and self-verification. Whether it created new reasoning or just up-weighted paths already in the base model is the pass@1-vs-pass@k debate — and the “invisible leash” theory argues RLVR is bounded by base-model support.

Key papers

Tulu 3 — Lambert et al., Ai2, 2024 — coined “RLVR” and formalized verifier-function rewards for math and instruction following.
DeepSeekMath — Shao et al., 2024 — introduces GRPO, the critic-free optimizer that became RLVR’s default.
DeepSeek-R1 — DeepSeek, 2025 — rule-based verifiable rewards alone elicit long chain-of-thought; the flagship RLVR reasoning model.
Does RL Really Incentivize Reasoning Beyond the Base Model? — Yue et al., 2025 — pass@1 up, pass@k down; the skeptics’ case.
Spurious Rewards — Shao et al., 2025 — random/incorrect rewards still help Qwen-Math; gains are model-dependent.
The Invisible Leash — Wu et al., 2025 — RLVR as support-constrained reweighting; answer-level entropy collapse.
RLVR yet Noisy Rewards under Imperfect Verifiers — 2025 — how asymmetric verifier noise degrades training.

RLHF · GRPO · PPO · DPO & preference optimization · Reward models · RL for reasoning · Agentic RL · What is reinforcement learning?

RLVR: Reinforcement Learning with Verifiable Rewards

What is RLVR?

A worked example: a math problem through RLVR

Where RLVR came from

What “verifiable” actually means

How RLVR works under the hood

The reward function

The optimization algorithm: GRPO

Refinements practitioners use

A minimal RLVR training loop

The open questions and failure modes

Does RLVR create new reasoning, or just sharpen the base model?

Spurious rewards: when even random rewards “work”

Verifier reliability: the reward hacking moves to the checker

RLVR vs RLHF: the comparison

Pushing RLVR beyond math and code

When should you use RLVR?

Researcher takes

Frequently asked questions

Key papers

Related