reinforcement-learning.com
// RL FOR LLMS & AGENTS

RLHF: Reinforcement Learning from Human Feedback

What RLHF is, how the SFT → reward model → RL pipeline works, the math behind it, and how it compares to DPO, GRPO, RLAIF and RLVR in 2026.

Updated 2026-06-07 16 min read
Key takeaways
  • RLHF fine-tunes a model to match human preferences instead of a hand-written reward function.
  • Three stages: supervised fine-tuning (SFT) → train a reward model → optimize with RL (usually PPO) under a KL penalty.
  • It's the step that turned base models like GPT-3 into assistants like ChatGPT and Claude.
  • Since 2024 the field split: DPO/GRPO simplify the RL step; RLVR (verifiable rewards) powers reasoning models — but RLHF still owns style, helpfulness and safety.

What is RLHF?

Reinforcement Learning from Human Feedback (RLHF) teaches a model what “good” looks like by showing it, not defining it. Writing a reward function for fuzzy goals like “be helpful” or “be honest” is nearly impossible — but a human can easily say which of two answers is better. RLHF turns that cheap comparison signal into a training target: learn a reward model from human preferences, then use reinforcement learning to push the model toward outputs that reward model scores highly.

Prompt xPolicy LM(π_θ)Response yRewardmodel r_φReference LM (π_ref)reward → policy update (PPO / GRPO)− β · KL
The RLHF loop: the policy generates responses, a reward model scores them, and PPO updates the policy toward higher reward — while a KL penalty keeps it close to the reference model.
▶ RLHF, Clearly Explained — StatQuest (the plain-English intuition, ~12 min)

Why RLHF exists

A pretrained language model is a brilliant autocomplete with no built-in notion that helpful and truthful beats plausible but useless. Plain supervised fine-tuning (SFT) helps, but hits two walls: you can’t write enough ideal answers, and “good” is a preference, not a label. RLHF reframes the goal around cheap pairwise comparisons.

The landmark result — OpenAI’s InstructGPT — showed a 1.3B model tuned with RLHF was preferred by humans over the 100× larger 175B GPT-3. That paper is the direct precursor to ChatGPT.

1.3B
InstructGPT params — preferred by humans over…
175B
…GPT-3, a 100× larger base model
3 stages
SFT → reward model → RL optimization

How RLHF works: the three-stage pipeline

1
Supervised fine-tuning (SFT)

Fine-tune the pretrained model on curated demonstrations (prompt → ideal response) so it already produces reasonable, on-format answers. This SFT model also becomes the reference policy used later to keep training stable.

2
Train the reward model

Sample multiple responses per prompt and have humans pick the better one, yielding (prompt, chosen, rejected) triples. A separate reward model (RM) learns to score any response so that chosen beats rejected, via the Bradley–Terry objective:

P(AB)=σ(r(A)r(B))P(A \succ B) = \sigma\big(r(A) - r(B)\big)

Training minimizes −log σ(r(chosen) − r(rejected)) — the RM internalizes human taste.

3
Optimize the policy with RL

Treat the LM as a policy and optimize it to produce high-reward outputs, minus a penalty for drifting from the reference:

maxπ  Ex,yπ[r(x,y)]    βKL(π(x)πref(x))\max_{\pi}\; \mathbb{E}_{x,\,y\sim\pi}\big[\,r(x,y)\,\big]\;-\;\beta\,\mathrm{KL}\big(\pi(\cdot\mid x)\,\|\,\pi_{\text{ref}}(\cdot\mid x)\big)

The KL term is the safety belt; β sets how tight the leash is. The classic optimizer is PPO.

What does a single piece of preference data actually look like? A prompt, two model responses, and a human’s pick:

Prompt”Explain photosynthesis to a 6-year-old.”
✓ chosenPlants are like tiny chefs. They take in sunlight, water and air, and cook it into food so they can grow — and the leftover they breathe out is the oxygen we need.
✗ rejectedPhotosynthesis is the biochemical process by which chlorophyll-bearing organisms convert photonic energy into glucose via the Calvin–Benson cycle.

Both answers are correct. The reward model learns from thousands of such picks that, for this kind of prompt, the first style is preferred — a judgment no hand-written reward function could easily capture. Good preference data is harder than it looks: annotators frequently disagree, the comparisons have to span the model’s real output distribution, and a few thousand clean picks usually beat a noisy hundred thousand.

Go deeper: inside the reward model

The reward model is typically the SFT model with its final layer swapped for a single scalar head, trained on the same preference pairs. Two failure modes dominate: it overfits the data (memorizing annotator quirks) and it’s miscalibrated off-distribution (confidently wrong on prompts unlike anything it trained on). Teams mitigate with held-out validation, larger and more diverse preference sets, and sometimes reward-model ensembles whose disagreement flags unreliable regions. A variant — process reward models (PRMs) — score each reasoning step rather than the final answer; see reward models.

Go deeper: what PPO actually optimizes

PPO maximizes a clipped surrogate objective so each update stays close to the current policy. It scales the advantage by the probability ratio r_t = π_θ(a|s) / π_old(a|s), then clips r_t to [1−ε, 1+ε] so a single update can’t move the policy too far. In RLHF the advantage is driven by the reward-model score minus the per-token KL term. Full derivation on the PPO page.

Reward hacking and over-optimization

training steps →rewardover-optimization beginsProxy reward (RM score)Gold reward (true preference)
Over-optimization: the proxy reward (what you train on) keeps climbing while the gold reward (what you actually want) peaks and then declines. The gap is reward hacking.

Practitioners watch the gold reward (true human preference) diverge from the proxy reward (RM score) as training proceeds. Lilian Weng’s survey on reward hacking is the best single reference.

This fragility is also why some researchers argue RLHF isn’t “real” RL at all — Andrej Karpathy’s widely-shared take:

How do you know it worked? Evaluating RLHF

You can’t just watch the reward climb — that’s the proxy that gets hacked. Real evaluation triangulates across several lenses:

MethodWhat it measuresWatch out for
Held-out rewardRM score on prompts it didn’t train onstill just the proxy
Human win-rate% of times humans prefer the new model over the SFT baselineslow, expensive — the gold standard
Chatbot Arena (Elo)crowdsourced head-to-head votes, ranked as Elocoverage skews to chat-style prompts
LLM-as-judge (AlpacaEval, MT-Bench)automated approximation of human preferencebiased toward length and confident style
Red-teamingsafety: refusals, jailbreak resistance, harmadversarial coverage is never complete

A model can win on automated judges while regressing in real use — which is exactly why labs never trust a single number. This is also the practical flip side of reward hacking: your evaluation has to be harder to game than your reward.

Beyond PPO: the modern landscape (2024–2026)

The classic PPO recipe works but is heavy — it juggles four models (policy, reference, reward, value) and is finicky. A wave of alternatives reshaped post-training:

MethodWhat it doesDropsBest for
PPOOn-policy RL against a learned reward, KL-regularizedThe original, most controllable recipe
DPOTurns preferences into a direct classification-style loss on the policyreward model + RL loopSimplicity, stability; default for many open models
GRPOGroup-relative advantages from sampled answersvalue/critic networkCheaper RL; reasoning models (DeepSeek)
Rejection samplingKeep best-of-N by RM score, fine-tune on thosethe RL loopSimple gains (used in Llama 2)
RLAIFAI feedback from a written “constitution” instead of human labelshuman labelersScaling the labeling bottleneck
RLVRReward from a programmatic verifier (tests pass? answer correct?)the learned reward modelMath, code, checkable reasoning

Two of these matter most today. DPO showed the RLHF objective can be solved in closed form, collapsing the whole reward-model-plus-PPO machinery into a single classification-style loss on the policy — far simpler and more stable, at some cost in control. GRPO keeps the RL loop but throws out the value network, estimating advantage by comparing a group of sampled answers to each other; it’s cheaper to run and underpins DeepSeek’s reasoning models.

A second axis is where the reward comes from:

RLHF — learn the reward

Reward comes from a model trained on human preferences. Best where “good” is a matter of taste: open-ended quality, tone, helpfulness, safety. See reward models.

RLVR — verify the reward

Reward comes from a programmatic checker (do the unit tests pass? is the math answer correct?). Best for math, code and checkable reasoning — it powers the 2025 reasoning-model boom. See RLVR.

Frontier models increasingly use both: RLVR to sharpen reasoning, RLHF to keep the result helpful and safe. See RL for reasoning.

RLAIF and Constitutional AI

Human labeling is the bottleneck — so why not let an AI do it? RLAIF (RL from AI Feedback) replaces human preference labels with judgments from a capable model, guided by a written set of principles. Anthropic’s Constitutional AI is the canonical example: the model critiques and revises its own answers against a “constitution” (be helpful, avoid harm), generating the preference data that trains the reward model. It scales the labeling step and makes the values explicit and auditable — at the cost of inheriting the judge model’s blind spots. Most modern pipelines blend human and AI feedback rather than choosing one.

A short history of RLHF

2017
Deep RL from Human Preferences
Christiano et al. learn a reward from human pairwise comparisons on Atari and simulated robotics — the core idea.
2020
Learning to Summarize
OpenAI applies the full reward-model + PPO pipeline to language for the first time.
2022
InstructGPT → ChatGPT
RLHF turns GPT-3 into an instruction-follower; a 1.3B model beats 175B on human preference. ChatGPT launches that November.
2022
Constitutional AI
Anthropic introduces RLAIF, scaling feedback with a written constitution.
2023
Llama 2 & DPO
Meta publishes an open RLHF recipe; Rafailov et al. introduce DPO, removing the separate reward model and RL loop.
2024–25
GRPO & the RLVR turn
DeepSeek’s GRPO and verifiable-reward RL drive reasoning models; the field splits reward-from-taste (RLHF) from reward-from-verifier (RLVR).

Where RLHF is used

Virtually every frontier assistant uses RLHF or a direct-alignment descendant in post-training:

Model familyLabHow alignment is done
ChatGPT / GPT-4oOpenAIRLHF with PPO — the original production recipe
ClaudeAnthropicRLHF + RLAIF / Constitutional AI
Llama 2 / 3MetaRLHF (rejection sampling + PPO); DPO in later work
GeminiGoogle DeepMindRLHF as part of post-training
DeepSeek-V3 / R1DeepSeekGRPO + RLVR for reasoning, preference tuning for chat

Limitations and open problems

  • Human labeling doesn’t scale cheaply — quality preference data is expensive, and annotators disagree.
  • Reward hacking — the proxy can be gamed (above).
  • Sycophancy & homogeneity — optimizing aggregate preference makes models agree too readily and collapses output diversity. The NeurIPS 2025 Best Paper “Artificial Hivemind” documents this and shows reward models are poorly calibrated to the real spread of human ratings.
  • Bias & distribution shift — the RM reflects its labelers and degrades off-distribution.

RLHF in practice

Modern post-training stacks combine these rather than picking one: SFT, then a mix of DPO/PPO/GRPO, rejection sampling, and (for reasoning) RLVR. The open-source TRL library implements SFT, reward modeling, PPO, DPO and GRPO; OpenRLHF and veRL target larger scale. Focused behaviors need ~tens of thousands of comparisons; a general assistant needs far more.

Building RL environments, reward pipelines, and human-preference data at production scale is its own industry — see the companies building RLHF environments and data.

▶ RLHF with the full math derivations and PyTorch code — Umar Jamil (the deep version)

Researcher takes

Nathan Lambert frames the evolution of RLHF pipelines as another instance of the bitter lesson: the gains come from scaling and removing humans from the loop.

Frequently asked questions

Is RLHF the same as reinforcement learning?

Not quite. It borrows RL’s machinery — a policy optimized against a reward — but the reward is a learned model of human preference, and training stays on a short KL leash. Some researchers (see Karpathy above) argue it’s “barely RL” next to systems like AlphaGo that optimize a true environment reward.

How is RLHF different from supervised fine-tuning?

SFT imitates demonstrations (“copy these good answers”). RLHF optimizes against a preference signal (“produce answers humans prefer”), capturing quality and safety that demonstrations can’t. RLHF almost always runs after SFT.

Is DPO replacing RLHF?

DPO replaces the PPO step with a simpler direct loss and is now a default for many open models. But RLHF broadly — aligning a model to human preference — is alive and well; PPO, DPO and GRPO are all ways to do it.

Do reasoning models like o1 and R1 use RLHF?

They lean on RLVR (verifiable rewards) to sharpen reasoning, then still use preference-based alignment to stay helpful and safe — both, not either/or. See RL for reasoning.

Key papers

Reward models · PPO · DPO & preference optimization · GRPO · RLVR · RL for reasoning · What is reinforcement learning?