- GRPO is a reinforcement-learning algorithm for LLMs that drops PPO's value network — it samples a group of answers per prompt and uses the group's own average reward as the baseline.
- Each answer's advantage is just how much it beats its peers: Aᵢ = (rᵢ − mean(r)) / std(r). No critic means ~40% less memory and fewer moving parts.
- Introduced in DeepSeekMath (2024) and made famous by DeepSeek-R1 (2025), it's the default recipe for reasoning models trained with verifiable rewards (RLVR).
- It has known biases (length, difficulty) — the 2025 variants Dr. GRPO, DAPO and GSPO each fix a specific one; most production runs now set the KL term β = 0.
What is GRPO?
Group Relative Policy Optimization (GRPO) is a reinforcement-learning algorithm for fine-tuning large language models. Its one big idea is a study group that grades itself: for each prompt, the model generates a group of answers (say 8 or 16), each answer is scored, and an answer is rewarded for beating the group average — not for hitting some absolute target. Answers above the group’s mean get pushed up; answers below it get pushed down.
That sounds small, but it removes the most expensive, most finicky component of the classic PPO recipe: the value network (the “critic”). PPO trains a second neural network, nearly as large as the policy itself, just to estimate how good a state is so it can compute an advantage. GRPO replaces that learned estimate with a statistic computed from the group of samples — the mean reward becomes the baseline, for free. No critic to train, store, or get wrong.
Where GRPO came from
GRPO was introduced by DeepSeek in the DeepSeekMath paper (Feb 2024), as “a variant of PPO that enhances mathematical reasoning while concurrently optimizing the memory usage of PPO.” It then went mainstream a year later when DeepSeek-R1 (Jan 2025) used it at scale to elicit emergent reasoning — including the now-famous R1-Zero, which applied GRPO with pure RL and no supervised fine-tuning and watched chain-of-thought, self-verification, and “aha moments” emerge on their own.
The core idea: group-relative advantage
How a group of samples replaces the critic
In any policy-gradient method you update the policy in proportion to an advantage — “how much better than expected was this action?” PPO learns a value function to define “expected,” then sets advantage . GRPO instead samples a group of completions for the same prompt, scores each, and normalizes the rewards within the group:
Every token of completion shares the same advantage (in the outcome-supervised case). The mean is the baseline; the standard deviation rescales updates so a group where answers barely differ doesn’t dominate one with a clear winner.
A worked numeric example
Suppose for one math prompt the model samples four answers and a rule-based checker scores them (1 = correct, 0 = wrong):
| Answer | Reward | Advantage | |
|---|---|---|---|
| A (correct) | 1 | +0.5 | +1.0 |
| B (correct) | 1 | +0.5 | +1.0 |
| C (wrong) | 0 | −0.5 | −1.0 |
| D (wrong) | 0 | −0.5 | −1.0 |
Here mean = 0.5 and std = 0.5, so the correct answers get and the wrong ones . The gradient pushes the policy up on tokens that produced the correct chains and down on the failed ones — no critic, no learned value, just the group comparing itself. If all four answers were correct (or all wrong), the std is 0 and the advantages are undefined/zero — that prompt teaches nothing, which is itself a useful signal (TRL logs it as frac_reward_zero_std).
GRPO vs PPO: what’s actually different
No value network, lower memory
| PPO | GRPO | |
|---|---|---|
| Models in memory | policy, reference, reward, value | policy, reference, reward |
| Baseline for advantage | learned value function | group mean reward |
| Advantage estimator | GAE over the value net | group normalization |
| Memory / compute | higher (extra ~policy-sized net) | ~40% lower |
| Best fit | general RLHF, dense per-token control | reasoning + verifiable rewards |
Dropping the critic isn’t just a memory win — it removes a whole class of bugs. A miscalibrated value net silently corrupts every advantage; GRPO’s baseline is a transparent arithmetic mean you can inspect.
Why this fits reasoning + verifiable rewards (RLVR)
GRPO pairs naturally with Reinforcement Learning with Verifiable Rewards (RLVR): tasks where a program can check correctness — does the math answer match? do the unit tests pass? does the output match the required format? You write a cheap rule-based reward function instead of training a reward model. That sidesteps reward hacking of a learned RM, and the binary/sparse reward is exactly what group-relative normalization handles gracefully — you don’t need fine-grained per-token value estimates, just “which of these whole answers were right.”
Sample a group, score each answer with a verifier (math checker, test suite, regex on format), baseline against the group. Powers DeepSeek-R1, and most open math/code reasoning models. See RL for reasoning.
You can still feed GRPO a learned reward model instead of a verifier — useful for open-ended quality where “correct” isn’t checkable. You lose RLVR’s hack-resistance but keep the no-critic efficiency.
The math (practitioner depth)
The clipped surrogate objective
GRPO keeps PPO’s clipped surrogate so each update stays in a trust region, then sums over the group and its tokens:
where is the per-token probability ratio. The / pair caps how far the policy can move in a single step, just like PPO.
The KL penalty to the reference policy
The term keeps the policy near a frozen reference model — the same “safety belt” idea as RLHF. DeepSeekMath used an unbiased KL estimator added directly to the loss (not folded into the reward as in classic PPO-RLHF).
Outcome vs process supervision
- Outcome supervision (the common case): one reward for the whole answer; every token in gets the same advantage. Cheap, and all you need with a final-answer checker.
- Process supervision: a process reward model scores each reasoning step; the advantage at token is the normalized sum of step rewards from onward. Denser signal, better credit assignment on long chains — but you need step-level labels or a PRM.
Known biases and failure modes
GRPO works, but its exact normalization introduces subtle optimization biases that 2025 papers diagnosed and fixed.
Length bias and difficulty bias
- Length bias. The loss normalizes by token count in a way that, combined with the std term, systematically rewards longer responses — the policy learns to ramble. Dr. GRPO (“Understanding R1-Zero-Like Training”) traced this to the per-response length normalization and the std scaling.
- Difficulty bias. Dividing by over-weights questions that are very easy or very hard (low-variance groups) and under-weights medium-difficulty ones, distorting the curriculum. TRL exposes
scale_rewards=Falseto disable std scaling for exactly this reason.
Token-level vs sequence-level aggregation
How you average the loss matters. Averaging per-response (then across responses) lets long answers dominate; DAPO’s token-level loss averages over all tokens in the batch so every token counts equally regardless of which answer it came from. At the other extreme, GSPO argues the per-token importance ratio is itself the problem — for long sequences the product of token ratios is high-variance — and moves clipping/weighting to the sequence level, which stabilizes training dramatically for Mixture-of-Experts models.
Go deeper: why MoE breaks token-level GRPO
In a Mixture-of-Experts model, routing can send the same token to different experts between the old and new policy, making per-token importance ratios wildly unstable — they no longer measure “how much did this token’s probability change” cleanly. GSPO defines a single sequence-level ratio (the geometric-mean of token ratios) and clips on that, which the Qwen team reports was necessary to scale RL on Qwen3’s MoE checkpoints.
GRPO variants (2025)
Each major variant maps to a specific bias in vanilla GRPO. This is the table most ranking pages predate:
| Variant | Core change | Bias / problem it fixes |
|---|---|---|
| Dr. GRPO | Remove length normalization & std scaling | Length-inflation bias; question-difficulty bias → better token efficiency |
| DAPO | Clip-Higher, dynamic sampling, token-level loss, overlong-reward shaping | Entropy collapse, wasted all-correct/all-wrong groups, length bias |
| GSPO | Sequence-level importance ratio & clipping | High-variance token ratios; instability in MoE / long sequences |
| GMPO | Geometric-mean (not arithmetic-mean) token reward | Sensitivity to outlier tokens |
Implementing GRPO in practice
The fastest path is Hugging Face TRL’s GRPOTrainer. You provide a model, a dataset of prompts, and one or more reward functions (plain Python that returns a score per completion).
A reward function takes the generated completions and returns a list of floats. For RLVR this is a rule-based checker — e.g. parse the boxed answer and compare to ground truth, or run a format regex. You can pass several (correctness + format) and weight them with reward_weights.
The key knobs: num_generations (the group size , e.g. 8–16), epsilon (clip range), and beta (KL weight — defaults to 0.0). Larger gives a lower-variance baseline at higher rollout cost. Use scale_rewards=False to adopt the Dr. GRPO fix for difficulty bias.
Generation is the bottleneck in any on-policy method. TRL integrates vLLM (use_vllm=True) in colocate or server mode to make sampling the group fast — often the difference between hours and days. TRL also applies truncated importance sampling to correct the train/inference engine mismatch.
Track reward, reward_std, frac_reward_zero_std (prompts where the whole group agreed — wasted signal), kl (only if β > 0), and clip_ratio/* (how often updates hit the trust region). Rising response length with flat accuracy is the classic length-bias warning sign.
Go deeper: a minimal reward function
A correctness reward for math, in spirit, looks like:
def correctness_reward(completions, ground_truth, **kwargs): scores = [] for c, gt in zip(completions, ground_truth): scores.append(1.0 if extract_boxed(c) == gt else 0.0) return scores
Pair it with a format reward (e.g. “uses <think>…</think> then a boxed answer”) and let GRPO normalize both within each group. The verifier is your reward model — see RLVR.
When to use GRPO vs PPO vs RLOO
| Use… | When |
|---|---|
| GRPO (or DAPO) | Verifiable rewards (math, code, format), reasoning models, limited GPU memory — the default for RLVR. |
| GSPO | Same as GRPO but training a Mixture-of-Experts model or seeing token-ratio instability on long outputs. |
| PPO | You want dense per-token value estimates and maximal control, have memory to spare, or are doing classic RLHF with a learned reward model. |
| RLOO | A lean REINFORCE-with-baseline alternative; baselines each sample against the leave-one-out mean of its group — close cousin of GRPO, sometimes simpler. |
| DPO | You have static preference pairs and want to skip online sampling entirely. |
GRPO, RLOO and DAPO all sit in the same family: on-policy, critic-free, group-baselined policy gradients. They differ mainly in how the baseline and the loss are normalized. PPO is the heavyweight with a critic; DPO leaves the RL loop behind altogether.
Building the environments, verifiers and reward pipelines that GRPO depends on is increasingly its own discipline — see the RL environment and eval providers and our RL environments page.
Researcher takes
Lambert summarizes the now-famous length-normalization bias in DeepSeek’s GRPO.
A contrarian framing arguing GRPO’s length bias might be intentional rather than a defect.
Frequently asked questions
Is GRPO just PPO without the critic?
Essentially yes — that’s the cleanest one-line description. GRPO keeps PPO’s clipped surrogate objective and KL idea but replaces the learned value function with a baseline computed from a group of sampled answers. The practical consequences (no critic to train, ~40% less memory, natural fit for sparse verifiable rewards) are big enough that it behaves like a different recipe.
Does GRPO need a reward model?
No. Its signature use is with rule-based / verifiable rewards (RLVR) — a Python function that checks correctness or format — which avoids training and hacking a learned reward model. You can plug a learned reward model in for open-ended tasks, but the no-RM verifiable setup is what made GRPO famous via DeepSeek-R1.
Why do people set β = 0 (no KL penalty)?
For verifiable-reward reasoning you often want the policy to move far from the base model, and several 2025 studies (Open-Reasoner-Zero, Dr. GRPO, DAPO) found the KL term unnecessary or counter-productive there. TRL now defaults beta=0.0. Keep KL when staying close to a tuned reference (style, safety) matters.
What’s the difference between GRPO and Dr. GRPO / DAPO / GSPO?
They’re all GRPO descendants fixing a specific bias. Dr. GRPO removes length/std normalization (length & difficulty bias). DAPO adds clip-higher, dynamic sampling and a token-level loss (entropy collapse, length, wasted groups). GSPO moves importance ratios to the sequence level (instability on long sequences and MoE models). See the variants table above.
Key papers
- DeepSeekMath — Shao et al., 2024 — introduces GRPO.
- DeepSeek-R1 — DeepSeek-AI, 2025 — GRPO at scale; R1-Zero’s pure-RL reasoning (Nature version).
- Understanding R1-Zero-Like Training (Dr. GRPO) — Liu et al., 2025 — diagnoses length/difficulty bias.
- DAPO — Yu et al., 2025 — open large-scale recipe, 50 on AIME 2024.
- GSPO — Zheng et al., 2025 — sequence-level optimization for MoE RL.
- A Technical Survey of RL for LLMs — 2025 — situates GRPO among PPO and newer methods.
Related
PPO · RLHF · RLVR · DPO & preference optimization · Reward models · RL for reasoning · What is reinforcement learning?