RLHF: Reinforcement Learning from Human Feedback

Key takeaways

RLHF fine-tunes a model to match human preferences instead of a hand-written reward function.
Three stages: supervised fine-tuning (SFT) → train a reward model → optimize with RL (usually PPO) under a KL penalty.
It's the step that turned base models like GPT-3 into assistants like ChatGPT and Claude.
Since 2024 the field split: DPO/GRPO simplify the RL step; RLVR (verifiable rewards) powers reasoning models — but RLHF still owns style, helpfulness and safety.

What is RLHF?

Reinforcement Learning from Human Feedback (RLHF) teaches a model what “good” looks like by showing it, not defining it. Writing a reward function for fuzzy goals like “be helpful” or “be honest” is nearly impossible — but a human can easily say which of two answers is better. RLHF turns that cheap comparison signal into a training target: learn a reward model from human preferences, then use reinforcement learning to push the model toward outputs that reward model scores highly.

The RLHF loop: the policy generates responses, a reward model scores them, and PPO updates the policy toward higher reward — while a KL penalty keeps it close to the reference model.

▶ RLHF, Clearly Explained — StatQuest (the plain-English intuition, ~12 min)

Why RLHF exists

A pretrained language model is a brilliant autocomplete with no built-in notion that helpful and truthful beats plausible but useless. Plain supervised fine-tuning (SFT) helps, but hits two walls: you can’t write enough ideal answers, and “good” is a preference, not a label. RLHF reframes the goal around cheap pairwise comparisons.

The landmark result — OpenAI’s InstructGPT — showed a 1.3B model tuned with RLHF was preferred by humans over the 100× larger 175B GPT-3. That paper is the direct precursor to ChatGPT.

1.3B

InstructGPT params — preferred by humans over…

175B

…GPT-3, a 100× larger base model

3 stages

SFT → reward model → RL optimization

How RLHF works: the three-stage pipeline

Supervised fine-tuning (SFT)

Fine-tune the pretrained model on curated demonstrations (prompt → ideal response) so it already produces reasonable, on-format answers. This SFT model also becomes the reference policy used later to keep training stable.

Train the reward model

Sample multiple responses per prompt and have humans pick the better one, yielding (prompt, chosen, rejected) triples. A separate reward model (RM) learns to score any response so that chosen beats rejected, via the Bradley–Terry objective:

P(A \succ B) = \sigma\big(r(A) - r(B)\big)

Training minimizes −log σ(r(chosen) − r(rejected)) — the RM internalizes human taste.

Optimize the policy with RL

Treat the LM as a policy and optimize it to produce high-reward outputs, minus a penalty for drifting from the reference:

\max_{\pi}\; \mathbb{E}_{x,\,y\sim\pi}\big[\,r(x,y)\,\big]\;-\;\beta\,\mathrm{KL}\big(\pi(\cdot\mid x)\,\|\,\pi_{\text{ref}}(\cdot\mid x)\big)

The KL term is the safety belt; β sets how tight the leash is. The classic optimizer is PPO.

What does a single piece of preference data actually look like? A prompt, two model responses, and a human’s pick:

Prompt”Explain photosynthesis to a 6-year-old.”

✓ chosenPlants are like tiny chefs. They take in sunlight, water and air, and cook it into food so they can grow — and the leftover they breathe out is the oxygen we need.

✗ rejectedPhotosynthesis is the biochemical process by which chlorophyll-bearing organisms convert photonic energy into glucose via the Calvin–Benson cycle.

Both answers are correct. The reward model learns from thousands of such picks that, for this kind of prompt, the first style is preferred — a judgment no hand-written reward function could easily capture. Good preference data is harder than it looks: annotators frequently disagree, the comparisons have to span the model’s real output distribution, and a few thousand clean picks usually beat a noisy hundred thousand.

Go deeper: inside the reward model

The reward model is typically the SFT model with its final layer swapped for a single scalar head, trained on the same preference pairs. Two failure modes dominate: it overfits the data (memorizing annotator quirks) and it’s miscalibrated off-distribution (confidently wrong on prompts unlike anything it trained on). Teams mitigate with held-out validation, larger and more diverse preference sets, and sometimes reward-model ensembles whose disagreement flags unreliable regions. A variant — process reward models (PRMs) — score each reasoning step rather than the final answer; see reward models.

Go deeper: what PPO actually optimizes

PPO maximizes a clipped surrogate objective so each update stays close to the current policy. It scales the advantage by the probability ratio r_t = π_θ(a|s) / π_old(a|s), then clips r_t to [1−ε, 1+ε] so a single update can’t move the policy too far. In RLHF the advantage is driven by the reward-model score minus the per-token KL term. Full derivation on the PPO page.

Reward hacking and over-optimization

Over-optimization: the proxy reward (what you train on) keeps climbing while the gold reward (what you actually want) peaks and then declines. The gap is reward hacking.

Practitioners watch the gold reward (true human preference) diverge from the proxy reward (RM score) as training proceeds. Lilian Weng’s survey on reward hacking is the best single reference.

This fragility is also why some researchers argue RLHF isn’t “real” RL at all — Andrej Karpathy’s widely-shared take:

View Andrej Karpathy's post on X →

How do you know it worked? Evaluating RLHF

You can’t just watch the reward climb — that’s the proxy that gets hacked. Real evaluation triangulates across several lenses:

Method	What it measures	Watch out for
Held-out reward	RM score on prompts it didn’t train on	still just the proxy
Human win-rate	% of times humans prefer the new model over the SFT baseline	slow, expensive — the gold standard
Chatbot Arena (Elo)	crowdsourced head-to-head votes, ranked as Elo	coverage skews to chat-style prompts
LLM-as-judge (AlpacaEval, MT-Bench)	automated approximation of human preference	biased toward length and confident style
Red-teaming	safety: refusals, jailbreak resistance, harm	adversarial coverage is never complete

A model can win on automated judges while regressing in real use — which is exactly why labs never trust a single number. This is also the practical flip side of reward hacking: your evaluation has to be harder to game than your reward.

Beyond PPO: the modern landscape (2024–2026)

The classic PPO recipe works but is heavy — it juggles four models (policy, reference, reward, value) and is finicky. A wave of alternatives reshaped post-training:

Method	What it does	Drops	Best for
PPO	On-policy RL against a learned reward, KL-regularized	—	The original, most controllable recipe
DPO	Turns preferences into a direct classification-style loss on the policy	reward model + RL loop	Simplicity, stability; default for many open models
GRPO	Group-relative advantages from sampled answers	value/critic network	Cheaper RL; reasoning models (DeepSeek)
Rejection sampling	Keep best-of-N by RM score, fine-tune on those	the RL loop	Simple gains (used in Llama 2)
RLAIF	AI feedback from a written “constitution” instead of human labels	human labelers	Scaling the labeling bottleneck
RLVR	Reward from a programmatic verifier (tests pass? answer correct?)	the learned reward model	Math, code, checkable reasoning

Two of these matter most today. DPO showed the RLHF objective can be solved in closed form, collapsing the whole reward-model-plus-PPO machinery into a single classification-style loss on the policy — far simpler and more stable, at some cost in control. GRPO keeps the RL loop but throws out the value network, estimating advantage by comparing a group of sampled answers to each other; it’s cheaper to run and underpins DeepSeek’s reasoning models.

A second axis is where the reward comes from:

RLHF — learn the reward

Reward comes from a model trained on human preferences. Best where “good” is a matter of taste: open-ended quality, tone, helpfulness, safety. See reward models.

RLVR — verify the reward

Reward comes from a programmatic checker (do the unit tests pass? is the math answer correct?). Best for math, code and checkable reasoning — it powers the 2025 reasoning-model boom. See RLVR.

Frontier models increasingly use both: RLVR to sharpen reasoning, RLHF to keep the result helpful and safe. See RL for reasoning.

RLAIF and Constitutional AI

Human labeling is the bottleneck — so why not let an AI do it? RLAIF (RL from AI Feedback) replaces human preference labels with judgments from a capable model, guided by a written set of principles. Anthropic’s Constitutional AI is the canonical example: the model critiques and revises its own answers against a “constitution” (be helpful, avoid harm), generating the preference data that trains the reward model. It scales the labeling step and makes the values explicit and auditable — at the cost of inheriting the judge model’s blind spots. Most modern pipelines blend human and AI feedback rather than choosing one.

A short history of RLHF

2017

Deep RL from Human Preferences

Christiano et al. learn a reward from human pairwise comparisons on Atari and simulated robotics — the core idea.

2020

Learning to Summarize

OpenAI applies the full reward-model + PPO pipeline to language for the first time.

2022

InstructGPT → ChatGPT

RLHF turns GPT-3 into an instruction-follower; a 1.3B model beats 175B on human preference. ChatGPT launches that November.

2022

Constitutional AI

Anthropic introduces RLAIF, scaling feedback with a written constitution.

2023

Llama 2 & DPO

Meta publishes an open RLHF recipe; Rafailov et al. introduce DPO, removing the separate reward model and RL loop.

2024–25

GRPO & the RLVR turn

DeepSeek’s GRPO and verifiable-reward RL drive reasoning models; the field splits reward-from-taste (RLHF) from reward-from-verifier (RLVR).

Where RLHF is used

Virtually every frontier assistant uses RLHF or a direct-alignment descendant in post-training:

Model family	Lab	How alignment is done
ChatGPT / GPT-4o	OpenAI	RLHF with PPO — the original production recipe
Claude	Anthropic	RLHF + RLAIF / Constitutional AI
Llama 2 / 3	Meta	RLHF (rejection sampling + PPO); DPO in later work
Gemini	Google DeepMind	RLHF as part of post-training
DeepSeek-V3 / R1	DeepSeek	GRPO + RLVR for reasoning, preference tuning for chat

Limitations and open problems

Human labeling doesn’t scale cheaply — quality preference data is expensive, and annotators disagree.
Reward hacking — the proxy can be gamed (above).
Sycophancy & homogeneity — optimizing aggregate preference makes models agree too readily and collapses output diversity. The NeurIPS 2025 Best Paper “Artificial Hivemind” documents this and shows reward models are poorly calibrated to the real spread of human ratings.
Bias & distribution shift — the RM reflects its labelers and degrades off-distribution.

RLHF in practice

Modern post-training stacks combine these rather than picking one: SFT, then a mix of DPO/PPO/GRPO, rejection sampling, and (for reasoning) RLVR. The open-source TRL library implements SFT, reward modeling, PPO, DPO and GRPO; OpenRLHF and veRL target larger scale. Focused behaviors need ~tens of thousands of comparisons; a general assistant needs far more.

Building RL environments, reward pipelines, and human-preference data at production scale is its own industry — see the companies building RLHF environments and data.

▶ RLHF with the full math derivations and PyTorch code — Umar Jamil (the deep version)

Researcher takes

Nathan Lambert frames the evolution of RLHF pipelines as another instance of the bitter lesson: the gains come from scaling and removing humans from the loop.

View Nathan Lambert's post on X →

Frequently asked questions

Is RLHF the same as reinforcement learning?

Not quite. It borrows RL’s machinery — a policy optimized against a reward — but the reward is a learned model of human preference, and training stays on a short KL leash. Some researchers (see Karpathy above) argue it’s “barely RL” next to systems like AlphaGo that optimize a true environment reward.

How is RLHF different from supervised fine-tuning?

SFT imitates demonstrations (“copy these good answers”). RLHF optimizes against a preference signal (“produce answers humans prefer”), capturing quality and safety that demonstrations can’t. RLHF almost always runs after SFT.

Is DPO replacing RLHF?

DPO replaces the PPO step with a simpler direct loss and is now a default for many open models. But RLHF broadly — aligning a model to human preference — is alive and well; PPO, DPO and GRPO are all ways to do it.

Do reasoning models like o1 and R1 use RLHF?

They lean on RLVR (verifiable rewards) to sharpen reasoning, then still use preference-based alignment to stay helpful and safe — both, not either/or. See RL for reasoning.

Key papers

Deep RL from Human Preferences — Christiano et al., 2017 — the conceptual seed.
Learning to Summarize from Human Feedback — Stiennon et al., 2020 — first full RM+PPO pipeline for language.
InstructGPT — Ouyang et al., 2022 — the canonical RLHF-for-LLMs paper.
Constitutional AI — Bai et al., 2022 — RLAIF.
Llama 2 — Touvron et al., 2023 — a detailed open RLHF recipe.
Direct Preference Optimization — Rafailov et al., 2023 — the RM-free alternative.

Reward models · PPO · DPO & preference optimization · GRPO · RLVR · RL for reasoning · What is reinforcement learning?

RLHF: Reinforcement Learning from Human Feedback

What is RLHF?

Why RLHF exists

How RLHF works: the three-stage pipeline

Reward hacking and over-optimization

How do you know it worked? Evaluating RLHF

Beyond PPO: the modern landscape (2024–2026)

RLAIF and Constitutional AI

A short history of RLHF

Where RLHF is used

Limitations and open problems

RLHF in practice

Researcher takes

Frequently asked questions

Key papers

Related