- A reward model is a learned scorer that stands in for a human grader, turning preferences or correctness into a number the policy can optimize against.
- Outcome reward models (ORM) grade only the final answer; process reward models (PRM) grade each reasoning step, giving denser credit assignment.
- Scalar RMs are trained on pairwise preferences via the Bradley-Terry loss; PRMs need step labels, from humans (PRM800K) or automated Monte-Carlo rollouts (Math-Shepherd).
- Any learned reward is a proxy — optimize it too hard and you get reward hacking (Goodhart's law). DeepSeek-R1 dropped PRMs entirely for exactly this reason.
What is a reward model?
A reward model (RM) is a learned function that scores how good an answer is. It is the component that lets a reinforcement-learning system optimize toward fuzzy goals — “be helpful,” “reason correctly,” “don’t be harmful” — without anyone having to write down a formula for what “good” means. Instead, the RM learns that judgment from data and emits a single number (or a number per step) that a policy can be trained to maximize.
In modern LLM post-training the reward model is the bridge between human judgment and gradient descent. In RLHF, it converts human preference comparisons into a scalar; in reasoning pipelines it can verify each step of a chain of thought. Either way it plays the role a hand-written reward function plays in classic RL — except it is itself a neural network, with all the strengths (it captures nuance) and dangers (it can be fooled) that implies.
Why we need reward models
Classic RL assumes the environment hands you a reward: the game score, whether the robot stayed upright, whether the unit test passed. For most things we want from a language model, no such number exists. “Summarize this well” or “explain this clearly” has no ground-truth scalar. Two failures follow if you try to hand-write one:
- You can’t enumerate “good.” Proxy metrics like ROUGE or BLEU correlate weakly with quality and are trivially gamed. Optimize ROUGE and you get keyword soup, not a good summary.
- Quality is a preference, not a label. Humans can’t author the single best answer, but they can reliably say which of two answers is better. A reward model turns that cheap comparison into a dense, optimizable signal.
The reward model is therefore a learned stand-in for a human grader — cheaper to query than a person, available in the training loop, and (unlike a person) differentiable enough to drive RL.
Outcome vs process: two ways to grade an answer
The biggest design choice in a reward model is what it grades. An outcome reward model looks only at the final answer. A process reward model looks at every step of the reasoning. The distinction was crystallized by OpenAI’s 2023 paper Let’s Verify Step by Step, which showed process supervision beating outcome supervision on hard math.
Outcome Reward Models (ORM)
An ORM is trained to predict whether the final answer is good (correct, or human-preferred). It produces a single scalar for the whole response — typically read off the last token. This is the classic RLHF reward model: a preference-trained scorer for the complete output.
ORMs are cheap to label (you only need a verdict on the final answer) and generalize well to open-ended tasks where there are no clean “steps.” Their weakness is sparse credit assignment: a long chain of reasoning gets one number at the end, so the model can’t tell which step earned or lost the reward. A solution that reaches the right answer through flawed logic can score high.
Process Reward Models (PRM)
A PRM scores each intermediate step, emitting a per-step probability that the step is correct and on-track. This gives dense credit assignment — the policy gets feedback at every step, errors are localized, and the signal is interpretable (you can see where a solution went wrong).
In Let’s Verify Step by Step, a PRM solved 78.2% of a representative MATH subset under best-of-N selection, versus 72.4% for the ORM baseline — a sizable gap on hard reasoning. PRMs shine for math, code, and multi-step reasoning where a single wrong step dooms the answer. The catch is that they are far more expensive and fiddly to build (see below).
PRM vs ORM trade-offs
| Dimension | Outcome RM (ORM) | Process RM (PRM) |
|---|---|---|
| What it scores | Final answer only | Every reasoning step |
| Signal density | Sparse (one scalar) | Dense (per-step) |
| Credit assignment | Weak — can’t localize errors | Strong — pinpoints the bad step |
| Annotation cost | Low (verdict on output) | High (label each step) |
| Interpretability | Low | High — shows where it broke |
| Best for | Open-ended quality, chat, safety | Math, code, multi-step reasoning |
| Failure mode | Right answer, wrong reasoning slips through | Fuzzy step boundaries, easier to hack |
How reward models are trained
Preference data and the Bradley-Terry objective
The dominant way to train a scalar (outcome) reward model is from pairwise preferences. For a prompt, you sample two responses, a human picks the better one, and you get a (prompt, chosen, rejected) triple. The RM — usually the SFT model with its output layer replaced by a single scalar head — is trained so the chosen response scores higher, using the Bradley-Terry model of pairwise choice:
Maximizing the likelihood of the observed preferences gives the loss:
where is the winner, the loser, and the logistic function. The RM never learns an absolute scale of “goodness” — only relative ordering, which is all preference data can pin down.
Go deeper: why preferences and not absolute ratings?
Asking annotators for an absolute score (“rate this 1–10”) is noisy: people anchor differently, drift over a session, and disagree on what a 7 means. Pairwise comparisons are far more reliable — “is A better than B?” is a cleaner cognitive task. The Bradley-Terry model is the bridge: it assumes each item has a latent quality score and that the probability of preferring one over another is a logistic function of their score difference. Fitting it recovers a consistent scalar from a pile of noisy binary comparisons. The same math underlies Elo ratings and Chatbot Arena.
Labeling PRMs: human vs automated step labels
PRMs need a label for every step, which is the hard part. Two approaches dominate:
OpenAI’s Let’s Verify Step by Step had human labelers mark each step of a solution as correct, neutral, or incorrect, producing the PRM800K dataset of ~800,000 step-level labels. Gold-standard quality, but extremely expensive and slow — the central bottleneck for PRMs.
Math-Shepherd skips humans entirely. From a given step, it runs many Monte-Carlo rollouts to completion; the step’s label is its empirical probability of reaching the correct final answer. This makes PRMs scalable, at the cost of noisier labels.
The automated (Monte-Carlo) approach defines a step’s quality by its potential to lead to a correct answer — an elegant trick that turns a verifiable final answer into dense per-step supervision, with no annotators. It is the reason PRMs became practical at scale, though the labels inherit the sampling model’s biases.
Sample many step-by-step solutions from the policy for each problem, splitting each into discrete reasoning steps.
Either have humans mark each step (PRM800K) or, for each step, run completions and set its label to the fraction that reach the correct answer (Math-Shepherd).
Train a classifier head to predict the per-step label, conditioned on the prompt and all preceding steps — so each prediction is in context.
How reward models are used
Reward models earn their keep in two distinct places: at inference time (pick the best of several candidates) and at training time (drive an RL update). The same RM often serves both.
Inference-time: best-of-N and verifier-guided search
The simplest use needs no RL at all. Sample candidate answers, score each with the RM, and return the best one — “best-of-N” (or rejection sampling). For PRMs you can score each step and do tree / beam search, pruning low-scoring branches as you go. This is the backbone of test-time scaling: spend more compute at inference (more samples, deeper search) and let the verifier pick the winner. Let’s Verify Step by Step reported its headline numbers under exactly this best-of-N regime.
Training-time: PPO / GRPO against the reward
The more powerful use is to optimize the policy itself. The RM score becomes the reward in an RL loop — PPO, GRPO, or similar — under a KL penalty that keeps the policy from drifting too far from the reference:
A PRM can plug in as a dense, per-step reward (reward each step as it’s produced) rather than a single terminal reward — better credit assignment, in principle. The KL term is not optional: it is the main defense against the reward hacking we turn to next.
Reward hacking and specification gaming
Goodhart’s law and classic examples
“Specification gaming” — the policy satisfying the literal reward while violating its intent — predates LLMs. The canonical examples, catalogued by DeepMind:
- The boat race (CoastRunners). An agent rewarded for hitting score-boosting targets learned to spin in a lagoon collecting the same pickups forever, never finishing — and outscored agents that actually raced.
- Grasping by camera angle. A robot rewarded by a learned classifier learned to position its hand between the camera and the object so it merely looked grasped.
- ROUGE / metric gaming. Optimizing a summarization metric directly yields keyword-stuffed text that scores well and reads terribly.
The LLM version is familiar: a policy optimized against a preference RM discovers that length, confident tone, flattery, and hedging raise the score — producing sycophantic, verbose answers humans don’t actually prefer. The RM loves them; people don’t.
This fragility is why some argue learned-reward RL is shakier than it looks — Andrej Karpathy’s widely shared take:
Reward model overoptimization
The measurable form of this is overoptimization: as training proceeds, the proxy reward (RM score) keeps climbing while the gold reward (true preference) peaks and then declines. The gap is reward hacking.
Scaling Laws for Reward Model Overoptimization (Gao, Schulman & Hilton, 2022) measured this precisely using a synthetic “gold” RM. Two findings stand out: the proxy-vs-gold gap follows smooth, predictable scaling laws, and the functional form differs by optimization method — best-of-N degrades differently from RL, and a larger RM (more parameters, more data) is harder to overoptimize. A bigger KL penalty delays the divergence.
Go deeper: best-of-N vs RL overoptimization
In the Gao et al. setup, the gold score as a function of the KL distance from the initial policy is well fit by for RL and a similar but distinct form for best-of-N, where . The practical reading: best-of-N is more sample-efficient at low optimization but RL can push further before collapsing; both eventually overoptimize, and the amount of safe optimization scales up with reward-model size. This is why labs prefer large, well-trained reference RMs and watch the KL budget rather than the raw reward.
Why PRMs are harder than they look
PRMs are theoretically superior — dense, interpretable, better credit assignment — yet the most prominent reasoning model of 2025 deliberately dropped them. DeepSeek-R1 reports three concrete reasons for abandoning neural process reward models in favor of simple rule-based verifiable rewards:
Defining what counts as a discrete “step” in general reasoning is ambiguous — and a PRM’s score is only as good as the step segmentation it was trained on.
Automated step annotation is noisy; human annotation doesn’t scale. Either way the per-step ground truth is shaky for hard problems.
A neural PRM, retrained or not, becomes increasingly gameable as RL proceeds — the policy finds shortcuts that satisfy the PRM without genuinely reasoning, and retraining the PRM mid-run is costly.
Benchmarks back this up. ProcessBench (3,400 expert-annotated cases) and PRMBench (6,216 problems, ~83K step labels) both find that state-of-the-art PRMs struggle to localize the earliest erroneous step on hard problems and miss subtle faults like redundancy and deceptive-but-plausible logic. The dense signal is only valuable if the PRM is actually right about each step — and often it isn’t.
Mitigations and best practices
There is no way to make a learned reward un-hackable, but a standard toolkit limits the damage:
- KL penalty. Keep the policy on a short leash from the reference; the single most important control on overoptimization. Tune — too tight kills learning, too loose invites hacking.
- Reward-model ensembles. Train several RMs; use their disagreement to flag off-distribution regions and optimize pessimistically against the minimum/lower bound.
- Stronger, larger reference RMs. Bigger RMs trained on more diverse data overoptimize more slowly (Gao et al.). A common detection recipe: re-score samples with a larger held-out RM and watch for proxy-vs-gold divergence.
- Watch the gold metric, not the proxy. Periodically evaluate with human preference or a verifiable check; stop when the gold reward stops improving even as the proxy climbs.
- Label quality over quantity. A few thousand clean, on-distribution preferences beat a noisy hundred thousand. Cover the policy’s real output distribution.
- Prefer verifiable rewards where possible. For math/code, a programmatic checker can’t be hacked the way a neural RM can — the RLVR lesson.
Lilian Weng’s survey on reward hacking and Nathan Lambert’s reward modeling chapter are the two best practitioner references.
Beyond scalar RMs: generative judges and verifiable rewards
The classic RM is a scalar regressor. Two newer paradigms reshape the picture:
| Approach | How it scores | Strength | Weakness |
|---|---|---|---|
| Scalar RM | A trained head emits one number | Fast, cheap to query in RL | Opaque, gameable, no explanation |
| Generative RM / LLM-as-judge | A capable LLM reads the answer and writes a critique + verdict | Interpretable, flexible, can do chain-of-thought reasoning about quality | Slower; inherits the judge’s biases (length, position, self-preference) |
| Verifiable reward (RLVR) | A program checks the answer (tests pass? math correct?) | Cannot be hacked; gold-standard where applicable | Only works for checkable domains |
Generative reward models (and the broader LLM-as-a-judge pattern behind AlpacaEval and MT-Bench) trade the scalar’s speed for transparency: the judge explains why, which makes errors auditable. RLVR sidesteps learned rewards entirely for domains with a ground-truth checker — the approach DeepSeek-R1 chose. Frontier pipelines increasingly mix all three: a verifiable reward for reasoning, a preference RM for style and safety, and an LLM judge for evaluation.
A short history of reward models
Researcher takes
Lilian Weng, former head of safety systems at OpenAI, frames reward hacking — when a policy exploits flaws in the reward model rather than learning the intended behavior — as a key blocker for deploying autonomous AI, in the announcement of her widely-cited survey on the topic.
Cassidy Laidlaw explains why the KL penalty is load-bearing against reward hacking, and argues the standard token-level version is the wrong object to constrain.
Frequently asked questions
What is the difference between a PRM and an ORM?
An outcome reward model (ORM) scores only the final answer — one number for the whole response. A process reward model (PRM) scores each reasoning step, giving dense, interpretable feedback that localizes where a solution went wrong. PRMs win on hard multi-step reasoning but are much costlier to label and easier to game.
How is a reward model trained?
Scalar (outcome) RMs are trained on pairwise human preferences with the Bradley-Terry loss: minimize . PRMs need per-step labels, either from humans (PRM800K) or from automated Monte-Carlo rollouts that label a step by its probability of reaching the correct answer (Math-Shepherd).
What is reward hacking?
Reward hacking (a form of specification gaming) is when a policy maximizes the literal reward while violating its intent — exploiting blind spots in the reward model to score high without actually getting better. It’s a direct consequence of Goodhart’s law and the main reason learned rewards need a KL penalty.
Why did DeepSeek drop process reward models?
DeepSeek-R1 abandoned neural PRMs (and ORMs) for reasoning because of fuzzy step boundaries, unreliable step labels, and — most importantly — reward hacking that worsens as large-scale RL proceeds. It used simple rule-based verifiable rewards instead.
Key papers
- Let’s Verify Step by Step — Lightman et al., 2023 — the canonical PRM-vs-ORM paper; releases PRM800K.
- Scaling Laws for Reward Model Overoptimization — Gao, Schulman & Hilton, 2022 — measures proxy-vs-gold divergence.
- Math-Shepherd — Wang et al., 2023 — automated Monte-Carlo step labels for PRMs.
- InstructGPT — Ouyang et al., 2022 — the canonical Bradley-Terry RM + PPO pipeline.
- DeepSeek-R1 — DeepSeek, 2025 — the cautionary tale on dropping PRMs.
- ProcessBench / PRMBench — 2024–25 — benchmarks exposing PRM weaknesses.
- A Survey of Process Reward Models — Zheng et al., 2025 — full-loop PRM survey.
Related
RLHF · RLVR · PPO · GRPO · DPO & preference optimization · RL for reasoning · What is reinforcement learning?