GRPO: Group Relative Policy Optimization

Key takeaways

GRPO is a reinforcement-learning algorithm for LLMs that drops PPO's value network — it samples a group of answers per prompt and uses the group's own average reward as the baseline.
Each answer's advantage is just how much it beats its peers: Aᵢ = (rᵢ − mean(r)) / std(r). No critic means ~40% less memory and fewer moving parts.
Introduced in DeepSeekMath (2024) and made famous by DeepSeek-R1 (2025), it's the default recipe for reasoning models trained with verifiable rewards (RLVR).
It has known biases (length, difficulty) — the 2025 variants Dr. GRPO, DAPO and GSPO each fix a specific one; most production runs now set the KL term β = 0.

What is GRPO?

Group Relative Policy Optimization (GRPO) is a reinforcement-learning algorithm for fine-tuning large language models. Its one big idea is a study group that grades itself: for each prompt, the model generates a group of answers (say 8 or 16), each answer is scored, and an answer is rewarded for beating the group average — not for hitting some absolute target. Answers above the group’s mean get pushed up; answers below it get pushed down.

That sounds small, but it removes the most expensive, most finicky component of the classic PPO recipe: the value network (the “critic”). PPO trains a second neural network, nearly as large as the policy itself, just to estimate how good a state is so it can compute an advantage. GRPO replaces that learned estimate with a statistic computed from the group of samples — the mean reward becomes the baseline, for free. No critic to train, store, or get wrong.

GRPO vs PPO at a glance. PPO trains a separate value network to baseline each response; GRPO samples a group and uses the group's own mean reward as the baseline — no critic.

▶ GRPO — Group Relative Policy Optimization: How DeepSeek trains reasoning models (Serrano.Academy)

Where GRPO came from

GRPO was introduced by DeepSeek in the DeepSeekMath paper (Feb 2024), as “a variant of PPO that enhances mathematical reasoning while concurrently optimizing the memory usage of PPO.” It then went mainstream a year later when DeepSeek-R1 (Jan 2025) used it at scale to elicit emergent reasoning — including the now-famous R1-Zero, which applied GRPO with pure RL and no supervised fine-tuning and watched chain-of-thought, self-verification, and “aha moments” emerge on their own.

~40%

memory saved by dropping the value network

Feb 2024

GRPO introduced (DeepSeekMath)

Jan 2025

DeepSeek-R1 makes it the default reasoning recipe

2024

DeepSeekMath introduces GRPO

A 7B model hits 51.7% on the competition-level MATH benchmark; GRPO is the RL ingredient that gets it there.

Jan 2025

DeepSeek-R1 & R1-Zero

GRPO scales to a frontier reasoning model; R1-Zero shows reasoning can emerge from pure RL with no SFT.

Mar 2025

Dr. GRPO & DAPO

Two papers diagnose GRPO’s length/difficulty biases and ship fixes; DAPO’s open recipe hits 50 on AIME 2024.

Jul 2025

GSPO

Qwen team moves importance ratios from token-level to sequence-level, stabilizing RL for Mixture-of-Experts models.

The core idea: group-relative advantage

How a group of samples replaces the critic

In any policy-gradient method you update the policy in proportion to an advantage — “how much better than expected was this action?” PPO learns a value function $V(s)$ to define “expected,” then sets advantage $\approx r - V(s)$ . GRPO instead samples a group of $G$ completions $\{o_1, \dots, o_G\}$ for the same prompt, scores each, and normalizes the rewards within the group:

\hat{A}_{i,t} = \frac{r_i - \operatorname{mean}(\mathbf{r})}{\operatorname{std}(\mathbf{r})}

Every token of completion $o_i$ shares the same advantage $\hat{A}_{i,t}$ (in the outcome-supervised case). The mean is the baseline; the standard deviation rescales updates so a group where answers barely differ doesn’t dominate one with a clear winner.

A worked numeric example

Suppose for one math prompt the model samples four answers and a rule-based checker scores them (1 = correct, 0 = wrong):

Answer	Reward $r_i$	$r_i - \text{mean}$	Advantage $\hat A_i$
A (correct)	1	+0.5	+1.0
B (correct)	1	+0.5	+1.0
C (wrong)	0	−0.5	−1.0
D (wrong)	0	−0.5	−1.0

Here mean = 0.5 and std = 0.5, so the correct answers get $\hat A = +1$ and the wrong ones $\hat A = -1$ . The gradient pushes the policy up on tokens that produced the correct chains and down on the failed ones — no critic, no learned value, just the group comparing itself. If all four answers were correct (or all wrong), the std is 0 and the advantages are undefined/zero — that prompt teaches nothing, which is itself a useful signal (TRL logs it as frac_reward_zero_std).

GRPO vs PPO: what’s actually different

No value network, lower memory

	PPO	GRPO
Models in memory	policy, reference, reward, value	policy, reference, reward
Baseline for advantage	learned value function $V(s)$	group mean reward
Advantage estimator	GAE over the value net	group normalization
Memory / compute	higher (extra ~policy-sized net)	~40% lower
Best fit	general RLHF, dense per-token control	reasoning + verifiable rewards

Dropping the critic isn’t just a memory win — it removes a whole class of bugs. A miscalibrated value net silently corrupts every advantage; GRPO’s baseline is a transparent arithmetic mean you can inspect.

Why this fits reasoning + verifiable rewards (RLVR)

GRPO pairs naturally with Reinforcement Learning with Verifiable Rewards (RLVR): tasks where a program can check correctness — does the math answer match? do the unit tests pass? does the output match the required format? You write a cheap rule-based reward function instead of training a reward model. That sidesteps reward hacking of a learned RM, and the binary/sparse reward is exactly what group-relative normalization handles gracefully — you don’t need fine-grained per-token value estimates, just “which of these whole answers were right.”

GRPO + RLVR (the reasoning recipe)

Sample a group, score each answer with a verifier (math checker, test suite, regex on format), baseline against the group. Powers DeepSeek-R1, and most open math/code reasoning models. See RL for reasoning.

GRPO + learned reward model

You can still feed GRPO a learned reward model instead of a verifier — useful for open-ended quality where “correct” isn’t checkable. You lose RLVR’s hack-resistance but keep the no-critic efficiency.

The math (practitioner depth)

The clipped surrogate objective

GRPO keeps PPO’s clipped surrogate so each update stays in a trust region, then sums over the group and its tokens:

\mathcal{L}_{\text{GRPO}}(\theta) = -\frac{1}{\sum_{i} |o_i|} \sum_{i=1}^{G} \sum_{t=1}^{|o_i|} \min\!\Big( r_{i,t}(\theta)\,\hat A_{i,t},\; \operatorname{clip}\big(r_{i,t}(\theta),\, 1-\epsilon,\, 1+\epsilon\big)\,\hat A_{i,t} \Big) + \beta\,\mathbb{D}_{\text{KL}}\big[\pi_\theta \,\|\, \pi_{\text{ref}}\big]

where $r_{i,t}(\theta) = \dfrac{\pi_\theta(o_{i,t}\mid q, o_{i,<t})}{\pi_{\theta_{\text{old}}}(o_{i,t}\mid q, o_{i,<t})}$ is the per-token probability ratio. The $\min$ / $\operatorname{clip}$ pair caps how far the policy can move in a single step, just like PPO.

The KL penalty to the reference policy

The $\beta\,\mathbb{D}_{\text{KL}}[\pi_\theta \| \pi_{\text{ref}}]$ term keeps the policy near a frozen reference model — the same “safety belt” idea as RLHF. DeepSeekMath used an unbiased $k_3$ KL estimator added directly to the loss (not folded into the reward as in classic PPO-RLHF).

Outcome vs process supervision

Outcome supervision (the common case): one reward for the whole answer; every token in $o_i$ gets the same advantage. Cheap, and all you need with a final-answer checker.
Process supervision: a process reward model scores each reasoning step; the advantage at token $t$ is the normalized sum of step rewards from $t$ onward. Denser signal, better credit assignment on long chains — but you need step-level labels or a PRM.

Known biases and failure modes

GRPO works, but its exact normalization introduces subtle optimization biases that 2025 papers diagnosed and fixed.

Length bias and difficulty bias

Length bias. The loss normalizes by token count in a way that, combined with the std term, systematically rewards longer responses — the policy learns to ramble. Dr. GRPO (“Understanding R1-Zero-Like Training”) traced this to the per-response length normalization and the std scaling.
Difficulty bias. Dividing by $\operatorname{std}(\mathbf{r})$ over-weights questions that are very easy or very hard (low-variance groups) and under-weights medium-difficulty ones, distorting the curriculum. TRL exposes scale_rewards=False to disable std scaling for exactly this reason.

Token-level vs sequence-level aggregation

How you average the loss matters. Averaging per-response (then across responses) lets long answers dominate; DAPO’s token-level loss averages over all tokens in the batch so every token counts equally regardless of which answer it came from. At the other extreme, GSPO argues the per-token importance ratio is itself the problem — for long sequences the product of token ratios is high-variance — and moves clipping/weighting to the sequence level, which stabilizes training dramatically for Mixture-of-Experts models.

Go deeper: why MoE breaks token-level GRPO

In a Mixture-of-Experts model, routing can send the same token to different experts between the old and new policy, making per-token importance ratios $r_{i,t}$ wildly unstable — they no longer measure “how much did this token’s probability change” cleanly. GSPO defines a single sequence-level ratio (the geometric-mean of token ratios) and clips on that, which the Qwen team reports was necessary to scale RL on Qwen3’s MoE checkpoints.

GRPO variants (2025)

Each major variant maps to a specific bias in vanilla GRPO. This is the table most ranking pages predate:

Variant	Core change	Bias / problem it fixes
Dr. GRPO	Remove length normalization & std scaling	Length-inflation bias; question-difficulty bias → better token efficiency
DAPO	Clip-Higher, dynamic sampling, token-level loss, overlong-reward shaping	Entropy collapse, wasted all-correct/all-wrong groups, length bias
GSPO	Sequence-level importance ratio & clipping	High-variance token ratios; instability in MoE / long sequences
GMPO	Geometric-mean (not arithmetic-mean) token reward	Sensitivity to outlier tokens

Implementing GRPO in practice

The fastest path is Hugging Face TRL’s GRPOTrainer. You provide a model, a dataset of prompts, and one or more reward functions (plain Python that returns a score per completion).

Write reward functions

A reward function takes the generated completions and returns a list of floats. For RLVR this is a rule-based checker — e.g. parse the boxed answer and compare to ground truth, or run a format regex. You can pass several (correctness + format) and weight them with reward_weights.

Set the group and trust-region hyperparameters

The key knobs: num_generations (the group size $G$ , e.g. 8–16), epsilon (clip range), and beta (KL weight — defaults to 0.0). Larger $G$ gives a lower-variance baseline at higher rollout cost. Use scale_rewards=False to adopt the Dr. GRPO fix for difficulty bias.

Accelerate rollouts with vLLM

Generation is the bottleneck in any on-policy method. TRL integrates vLLM (use_vllm=True) in colocate or server mode to make sampling the group fast — often the difference between hours and days. TRL also applies truncated importance sampling to correct the train/inference engine mismatch.

Watch the right metrics

Track reward, reward_std, frac_reward_zero_std (prompts where the whole group agreed — wasted signal), kl (only if β > 0), and clip_ratio/* (how often updates hit the trust region). Rising response length with flat accuracy is the classic length-bias warning sign.

Go deeper: a minimal reward function

A correctness reward for math, in spirit, looks like:

def correctness_reward(completions, ground_truth, **kwargs): scores = [] for c, gt in zip(completions, ground_truth): scores.append(1.0 if extract_boxed(c) == gt else 0.0) return scores

Pair it with a format reward (e.g. “uses <think>…</think> then a boxed answer”) and let GRPO normalize both within each group. The verifier is your reward model — see RLVR.

When to use GRPO vs PPO vs RLOO

Use…	When
GRPO (or DAPO)	Verifiable rewards (math, code, format), reasoning models, limited GPU memory — the default for RLVR.
GSPO	Same as GRPO but training a Mixture-of-Experts model or seeing token-ratio instability on long outputs.
PPO	You want dense per-token value estimates and maximal control, have memory to spare, or are doing classic RLHF with a learned reward model.
RLOO	A lean REINFORCE-with-baseline alternative; baselines each sample against the leave-one-out mean of its group — close cousin of GRPO, sometimes simpler.
DPO	You have static preference pairs and want to skip online sampling entirely.

GRPO, RLOO and DAPO all sit in the same family: on-policy, critic-free, group-baselined policy gradients. They differ mainly in how the baseline and the loss are normalized. PPO is the heavyweight with a critic; DPO leaves the RL loop behind altogether.

Building the environments, verifiers and reward pipelines that GRPO depends on is increasingly its own discipline — see the RL environment and eval providers and our RL environments page.

Researcher takes

Lambert summarizes the now-famous length-normalization bias in DeepSeek’s GRPO.

View Nathan Lambert's post on X →

A contrarian framing arguing GRPO’s length bias might be intentional rather than a defect.

View Max Rumpf's post on X →

Frequently asked questions

Is GRPO just PPO without the critic?

Essentially yes — that’s the cleanest one-line description. GRPO keeps PPO’s clipped surrogate objective and KL idea but replaces the learned value function with a baseline computed from a group of sampled answers. The practical consequences (no critic to train, ~40% less memory, natural fit for sparse verifiable rewards) are big enough that it behaves like a different recipe.

Does GRPO need a reward model?

No. Its signature use is with rule-based / verifiable rewards (RLVR) — a Python function that checks correctness or format — which avoids training and hacking a learned reward model. You can plug a learned reward model in for open-ended tasks, but the no-RM verifiable setup is what made GRPO famous via DeepSeek-R1.

Why do people set β = 0 (no KL penalty)?

For verifiable-reward reasoning you often want the policy to move far from the base model, and several 2025 studies (Open-Reasoner-Zero, Dr. GRPO, DAPO) found the KL term unnecessary or counter-productive there. TRL now defaults beta=0.0. Keep KL when staying close to a tuned reference (style, safety) matters.

What’s the difference between GRPO and Dr. GRPO / DAPO / GSPO?

They’re all GRPO descendants fixing a specific bias. Dr. GRPO removes length/std normalization (length & difficulty bias). DAPO adds clip-higher, dynamic sampling and a token-level loss (entropy collapse, length, wasted groups). GSPO moves importance ratios to the sequence level (instability on long sequences and MoE models). See the variants table above.

Key papers

DeepSeekMath — Shao et al., 2024 — introduces GRPO.
DeepSeek-R1 — DeepSeek-AI, 2025 — GRPO at scale; R1-Zero’s pure-RL reasoning (Nature version).
Understanding R1-Zero-Like Training (Dr. GRPO) — Liu et al., 2025 — diagnoses length/difficulty bias.
DAPO — Yu et al., 2025 — open large-scale recipe, 50 on AIME 2024.
GSPO — Zheng et al., 2025 — sequence-level optimization for MoE RL.
A Technical Survey of RL for LLMs — 2025 — situates GRPO among PPO and newer methods.

PPO · RLHF · RLVR · DPO & preference optimization · Reward models · RL for reasoning · What is reinforcement learning?