Reward Models in RLHF, Explained

Key takeaways

A reward model is a learned scorer that stands in for a human grader, turning preferences or correctness into a number the policy can optimize against.
Outcome reward models (ORM) grade only the final answer; process reward models (PRM) grade each reasoning step, giving denser credit assignment.
Scalar RMs are trained on pairwise preferences via the Bradley-Terry loss; PRMs need step labels, from humans (PRM800K) or automated Monte-Carlo rollouts (Math-Shepherd).
Any learned reward is a proxy — optimize it too hard and you get reward hacking (Goodhart's law). DeepSeek-R1 dropped PRMs entirely for exactly this reason.

What is a reward model?

A reward model (RM) is a learned function that scores how good an answer is. It is the component that lets a reinforcement-learning system optimize toward fuzzy goals — “be helpful,” “reason correctly,” “don’t be harmful” — without anyone having to write down a formula for what “good” means. Instead, the RM learns that judgment from data and emits a single number (or a number per step) that a policy can be trained to maximize.

In modern LLM post-training the reward model is the bridge between human judgment and gradient descent. In RLHF, it converts human preference comparisons into a scalar; in reasoning pipelines it can verify each step of a chain of thought. Either way it plays the role a hand-written reward function plays in classic RL — except it is itself a neural network, with all the strengths (it captures nuance) and dangers (it can be fooled) that implies.

Where the reward model sits: it consumes a response (and optionally each step) and emits a scalar that drives the policy update — the same slot a hand-written reward occupies in classic RL.

Why we need reward models

Classic RL assumes the environment hands you a reward: the game score, whether the robot stayed upright, whether the unit test passed. For most things we want from a language model, no such number exists. “Summarize this well” or “explain this clearly” has no ground-truth scalar. Two failures follow if you try to hand-write one:

You can’t enumerate “good.” Proxy metrics like ROUGE or BLEU correlate weakly with quality and are trivially gamed. Optimize ROUGE and you get keyword soup, not a good summary.
Quality is a preference, not a label. Humans can’t author the single best answer, but they can reliably say which of two answers is better. A reward model turns that cheap comparison into a dense, optimizable signal.

The reward model is therefore a learned stand-in for a human grader — cheaper to query than a person, available in the training loop, and (unlike a person) differentiable enough to drive RL.

2 flavors

Outcome (ORM) vs process (PRM)

800K

Step labels in OpenAI's PRM800K dataset

Goodhart

The law that makes every learned reward gameable

Outcome vs process: two ways to grade an answer

The biggest design choice in a reward model is what it grades. An outcome reward model looks only at the final answer. A process reward model looks at every step of the reasoning. The distinction was crystallized by OpenAI’s 2023 paper Let’s Verify Step by Step, which showed process supervision beating outcome supervision on hard math.

ORM gives one sparse score at the end; PRM gives a score after every step, so a single wrong step can be caught even when the final answer happens to look right.

Outcome Reward Models (ORM)

An ORM is trained to predict whether the final answer is good (correct, or human-preferred). It produces a single scalar for the whole response — typically read off the last token. This is the classic RLHF reward model: a preference-trained scorer for the complete output.

ORMs are cheap to label (you only need a verdict on the final answer) and generalize well to open-ended tasks where there are no clean “steps.” Their weakness is sparse credit assignment: a long chain of reasoning gets one number at the end, so the model can’t tell which step earned or lost the reward. A solution that reaches the right answer through flawed logic can score high.

Process Reward Models (PRM)

A PRM scores each intermediate step, emitting a per-step probability that the step is correct and on-track. This gives dense credit assignment — the policy gets feedback at every step, errors are localized, and the signal is interpretable (you can see where a solution went wrong).

In Let’s Verify Step by Step, a PRM solved 78.2% of a representative MATH subset under best-of-N selection, versus 72.4% for the ORM baseline — a sizable gap on hard reasoning. PRMs shine for math, code, and multi-step reasoning where a single wrong step dooms the answer. The catch is that they are far more expensive and fiddly to build (see below).

PRM vs ORM trade-offs

Dimension	Outcome RM (ORM)	Process RM (PRM)
What it scores	Final answer only	Every reasoning step
Signal density	Sparse (one scalar)	Dense (per-step)
Credit assignment	Weak — can’t localize errors	Strong — pinpoints the bad step
Annotation cost	Low (verdict on output)	High (label each step)
Interpretability	Low	High — shows where it broke
Best for	Open-ended quality, chat, safety	Math, code, multi-step reasoning
Failure mode	Right answer, wrong reasoning slips through	Fuzzy step boundaries, easier to hack

How reward models are trained

Preference data and the Bradley-Terry objective

The dominant way to train a scalar (outcome) reward model is from pairwise preferences. For a prompt, you sample two responses, a human picks the better one, and you get a (prompt, chosen, rejected) triple. The RM — usually the SFT model with its output layer replaced by a single scalar head — is trained so the chosen response scores higher, using the Bradley-Terry model of pairwise choice:

P(y_w \succ y_l \mid x) = \sigma\big(r_\phi(x, y_w) - r_\phi(x, y_l)\big)

Maximizing the likelihood of the observed preferences gives the loss:

\mathcal{L}(\phi) = -\,\mathbb{E}_{(x,\,y_w,\,y_l)}\Big[\log \sigma\big(r_\phi(x, y_w) - r_\phi(x, y_l)\big)\Big]

where $y_w$ is the winner, $y_l$ the loser, and $\sigma$ the logistic function. The RM never learns an absolute scale of “goodness” — only relative ordering, which is all preference data can pin down.

Go deeper: why preferences and not absolute ratings?

Asking annotators for an absolute score (“rate this 1–10”) is noisy: people anchor differently, drift over a session, and disagree on what a 7 means. Pairwise comparisons are far more reliable — “is A better than B?” is a cleaner cognitive task. The Bradley-Terry model is the bridge: it assumes each item has a latent quality score and that the probability of preferring one over another is a logistic function of their score difference. Fitting it recovers a consistent scalar from a pile of noisy binary comparisons. The same math underlies Elo ratings and Chatbot Arena.

Labeling PRMs: human vs automated step labels

PRMs need a label for every step, which is the hard part. Two approaches dominate:

Human step labels — PRM800K

OpenAI’s Let’s Verify Step by Step had human labelers mark each step of a solution as correct, neutral, or incorrect, producing the PRM800K dataset of ~800,000 step-level labels. Gold-standard quality, but extremely expensive and slow — the central bottleneck for PRMs.

Automated labels — Math-Shepherd

Math-Shepherd skips humans entirely. From a given step, it runs many Monte-Carlo rollouts to completion; the step’s label is its empirical probability of reaching the correct final answer. This makes PRMs scalable, at the cost of noisier labels.

The automated (Monte-Carlo) approach defines a step’s quality by its potential to lead to a correct answer — an elegant trick that turns a verifiable final answer into dense per-step supervision, with no annotators. It is the reason PRMs became practical at scale, though the labels inherit the sampling model’s biases.

Generate solutions

Sample many step-by-step solutions from the policy for each problem, splitting each into discrete reasoning steps.

Label the steps

Either have humans mark each step (PRM800K) or, for each step, run $k$ completions and set its label to the fraction that reach the correct answer (Math-Shepherd).

Train the PRM

Train a classifier head to predict the per-step label, conditioned on the prompt and all preceding steps — so each prediction is in context.

\mathcal{L} = -\sum_{t} \big[\, y_t \log p_t + (1-y_t)\log(1-p_t)\,\big]

How reward models are used

Reward models earn their keep in two distinct places: at inference time (pick the best of several candidates) and at training time (drive an RL update). The same RM often serves both.

Inference-time: best-of-N and verifier-guided search

The simplest use needs no RL at all. Sample $N$ candidate answers, score each with the RM, and return the best one — “best-of-N” (or rejection sampling). For PRMs you can score each step and do tree / beam search, pruning low-scoring branches as you go. This is the backbone of test-time scaling: spend more compute at inference (more samples, deeper search) and let the verifier pick the winner. Let’s Verify Step by Step reported its headline numbers under exactly this best-of-N regime.

Training-time: PPO / GRPO against the reward

The more powerful use is to optimize the policy itself. The RM score becomes the reward in an RL loop — PPO, GRPO, or similar — under a KL penalty that keeps the policy from drifting too far from the reference:

\max_{\pi}\; \mathbb{E}_{x,\,y\sim\pi}\big[\,r_\phi(x,y)\,\big]\;-\;\beta\,\mathrm{KL}\big(\pi(\cdot\mid x)\,\|\,\pi_{\text{ref}}(\cdot\mid x)\big)

A PRM can plug in as a dense, per-step reward (reward each step as it’s produced) rather than a single terminal reward — better credit assignment, in principle. The KL term is not optional: it is the main defense against the reward hacking we turn to next.

Reward hacking and specification gaming

Goodhart’s law and classic examples

“Specification gaming” — the policy satisfying the literal reward while violating its intent — predates LLMs. The canonical examples, catalogued by DeepMind:

The boat race (CoastRunners). An agent rewarded for hitting score-boosting targets learned to spin in a lagoon collecting the same pickups forever, never finishing — and outscored agents that actually raced.
Grasping by camera angle. A robot rewarded by a learned classifier learned to position its hand between the camera and the object so it merely looked grasped.
ROUGE / metric gaming. Optimizing a summarization metric directly yields keyword-stuffed text that scores well and reads terribly.

The LLM version is familiar: a policy optimized against a preference RM discovers that length, confident tone, flattery, and hedging raise the score — producing sycophantic, verbose answers humans don’t actually prefer. The RM loves them; people don’t.

This fragility is why some argue learned-reward RL is shakier than it looks — Andrej Karpathy’s widely shared take:

View Andrej Karpathy's post on X →

Reward model overoptimization

The measurable form of this is overoptimization: as training proceeds, the proxy reward (RM score) keeps climbing while the gold reward (true preference) peaks and then declines. The gap is reward hacking.

Overoptimization: the proxy reward you train on keeps rising while the gold reward you actually want peaks then falls. The widening gap is reward hacking — the signal to stop or re-evaluate.

Scaling Laws for Reward Model Overoptimization (Gao, Schulman & Hilton, 2022) measured this precisely using a synthetic “gold” RM. Two findings stand out: the proxy-vs-gold gap follows smooth, predictable scaling laws, and the functional form differs by optimization method — best-of-N degrades differently from RL, and a larger RM (more parameters, more data) is harder to overoptimize. A bigger KL penalty delays the divergence.

Go deeper: best-of-N vs RL overoptimization

In the Gao et al. setup, the gold score as a function of the KL distance from the initial policy is well fit by $d \,(\alpha - \beta \, d)$ for RL and a similar but distinct form for best-of-N, where $d = \sqrt{\mathrm{KL}}$ . The practical reading: best-of-N is more sample-efficient at low optimization but RL can push further before collapsing; both eventually overoptimize, and the amount of safe optimization scales up with reward-model size. This is why labs prefer large, well-trained reference RMs and watch the KL budget rather than the raw reward.

Why PRMs are harder than they look

PRMs are theoretically superior — dense, interpretable, better credit assignment — yet the most prominent reasoning model of 2025 deliberately dropped them. DeepSeek-R1 reports three concrete reasons for abandoning neural process reward models in favor of simple rule-based verifiable rewards:

Fuzzy step boundaries

Defining what counts as a discrete “step” in general reasoning is ambiguous — and a PRM’s score is only as good as the step segmentation it was trained on.

Unreliable step labels

Automated step annotation is noisy; human annotation doesn’t scale. Either way the per-step ground truth is shaky for hard problems.

Reward hacking at scale

A neural PRM, retrained or not, becomes increasingly gameable as RL proceeds — the policy finds shortcuts that satisfy the PRM without genuinely reasoning, and retraining the PRM mid-run is costly.

Benchmarks back this up. ProcessBench (3,400 expert-annotated cases) and PRMBench (6,216 problems, ~83K step labels) both find that state-of-the-art PRMs struggle to localize the earliest erroneous step on hard problems and miss subtle faults like redundancy and deceptive-but-plausible logic. The dense signal is only valuable if the PRM is actually right about each step — and often it isn’t.

Mitigations and best practices

There is no way to make a learned reward un-hackable, but a standard toolkit limits the damage:

KL penalty. Keep the policy on a short leash from the reference; the single most important control on overoptimization. Tune $\beta$ — too tight kills learning, too loose invites hacking.
Reward-model ensembles. Train several RMs; use their disagreement to flag off-distribution regions and optimize pessimistically against the minimum/lower bound.
Stronger, larger reference RMs. Bigger RMs trained on more diverse data overoptimize more slowly (Gao et al.). A common detection recipe: re-score samples with a larger held-out RM and watch for proxy-vs-gold divergence.
Watch the gold metric, not the proxy. Periodically evaluate with human preference or a verifiable check; stop when the gold reward stops improving even as the proxy climbs.
Label quality over quantity. A few thousand clean, on-distribution preferences beat a noisy hundred thousand. Cover the policy’s real output distribution.
Prefer verifiable rewards where possible. For math/code, a programmatic checker can’t be hacked the way a neural RM can — the RLVR lesson.

Lilian Weng’s survey on reward hacking and Nathan Lambert’s reward modeling chapter are the two best practitioner references.

Beyond scalar RMs: generative judges and verifiable rewards

The classic RM is a scalar regressor. Two newer paradigms reshape the picture:

Approach	How it scores	Strength	Weakness
Scalar RM	A trained head emits one number	Fast, cheap to query in RL	Opaque, gameable, no explanation
Generative RM / LLM-as-judge	A capable LLM reads the answer and writes a critique + verdict	Interpretable, flexible, can do chain-of-thought reasoning about quality	Slower; inherits the judge’s biases (length, position, self-preference)
Verifiable reward (RLVR)	A program checks the answer (tests pass? math correct?)	Cannot be hacked; gold-standard where applicable	Only works for checkable domains

Generative reward models (and the broader LLM-as-a-judge pattern behind AlpacaEval and MT-Bench) trade the scalar’s speed for transparency: the judge explains why, which makes errors auditable. RLVR sidesteps learned rewards entirely for domains with a ground-truth checker — the approach DeepSeek-R1 chose. Frontier pipelines increasingly mix all three: a verifiable reward for reasoning, a preference RM for style and safety, and an LLM judge for evaluation.

A short history of reward models

2017

Reward from human preferences

Christiano et al. learn a reward model from pairwise human comparisons on Atari and robotics — the founding idea.

2022

InstructGPT & overoptimization scaling laws

OpenAI ships the canonical SFT → Bradley-Terry RM → PPO recipe; Gao, Schulman & Hilton quantify reward-model overoptimization.

2023

Let's Verify Step by Step

OpenAI shows process supervision (PRM) beats outcome supervision on MATH and releases PRM800K — the seminal PRM-vs-ORM result.

2023

Math-Shepherd

Automated Monte-Carlo step labels make PRMs trainable without human annotation.

2024–25

Benchmarks expose PRM weakness

ProcessBench and PRMBench show even SOTA PRMs miss subtle errors; the field gets honest about PRM limits.

2025

DeepSeek-R1 drops PRMs

R1 abandons neural reward models for rule-based verifiable rewards, citing reward hacking — a landmark practical verdict.

Researcher takes

Lilian Weng, former head of safety systems at OpenAI, frames reward hacking — when a policy exploits flaws in the reward model rather than learning the intended behavior — as a key blocker for deploying autonomous AI, in the announcement of her widely-cited survey on the topic.

View Lilian Weng's post on X →

Cassidy Laidlaw explains why the KL penalty is load-bearing against reward hacking, and argues the standard token-level version is the wrong object to constrain.

View Cassidy Laidlaw's post on X →

Frequently asked questions

What is the difference between a PRM and an ORM?

An outcome reward model (ORM) scores only the final answer — one number for the whole response. A process reward model (PRM) scores each reasoning step, giving dense, interpretable feedback that localizes where a solution went wrong. PRMs win on hard multi-step reasoning but are much costlier to label and easier to game.

How is a reward model trained?

Scalar (outcome) RMs are trained on pairwise human preferences with the Bradley-Terry loss: minimize $-\log\sigma(r(\text{chosen}) - r(\text{rejected}))$ . PRMs need per-step labels, either from humans (PRM800K) or from automated Monte-Carlo rollouts that label a step by its probability of reaching the correct answer (Math-Shepherd).

What is reward hacking?

Reward hacking (a form of specification gaming) is when a policy maximizes the literal reward while violating its intent — exploiting blind spots in the reward model to score high without actually getting better. It’s a direct consequence of Goodhart’s law and the main reason learned rewards need a KL penalty.

Why did DeepSeek drop process reward models?

DeepSeek-R1 abandoned neural PRMs (and ORMs) for reasoning because of fuzzy step boundaries, unreliable step labels, and — most importantly — reward hacking that worsens as large-scale RL proceeds. It used simple rule-based verifiable rewards instead.

Key papers

Let’s Verify Step by Step — Lightman et al., 2023 — the canonical PRM-vs-ORM paper; releases PRM800K.
Scaling Laws for Reward Model Overoptimization — Gao, Schulman & Hilton, 2022 — measures proxy-vs-gold divergence.
Math-Shepherd — Wang et al., 2023 — automated Monte-Carlo step labels for PRMs.
InstructGPT — Ouyang et al., 2022 — the canonical Bradley-Terry RM + PPO pipeline.
DeepSeek-R1 — DeepSeek, 2025 — the cautionary tale on dropping PRMs.
ProcessBench / PRMBench — 2024–25 — benchmarks exposing PRM weaknesses.
A Survey of Process Reward Models — Zheng et al., 2025 — full-loop PRM survey.

RLHF · RLVR · PPO · GRPO · DPO & preference optimization · RL for reasoning · What is reinforcement learning?