- RLHF fine-tunes a model to match human preferences instead of a hand-written reward function.
- Three stages: supervised fine-tuning (SFT) → train a reward model → optimize with RL (usually PPO) under a KL penalty.
- It's the step that turned base models like GPT-3 into assistants like ChatGPT and Claude.
- Since 2024 the field split: DPO/GRPO simplify the RL step; RLVR (verifiable rewards) powers reasoning models — but RLHF still owns style, helpfulness and safety.
What is RLHF?
Reinforcement Learning from Human Feedback (RLHF) teaches a model what “good” looks like by showing it, not defining it. Writing a reward function for fuzzy goals like “be helpful” or “be honest” is nearly impossible — but a human can easily say which of two answers is better. RLHF turns that cheap comparison signal into a training target: learn a reward model from human preferences, then use reinforcement learning to push the model toward outputs that reward model scores highly.
Why RLHF exists
A pretrained language model is a brilliant autocomplete with no built-in notion that helpful and truthful beats plausible but useless. Plain supervised fine-tuning (SFT) helps, but hits two walls: you can’t write enough ideal answers, and “good” is a preference, not a label. RLHF reframes the goal around cheap pairwise comparisons.
The landmark result — OpenAI’s InstructGPT — showed a 1.3B model tuned with RLHF was preferred by humans over the 100× larger 175B GPT-3. That paper is the direct precursor to ChatGPT.
How RLHF works: the three-stage pipeline
Fine-tune the pretrained model on curated demonstrations (prompt → ideal response) so it already produces reasonable, on-format answers. This SFT model also becomes the reference policy used later to keep training stable.
Sample multiple responses per prompt and have humans pick the better one, yielding (prompt, chosen, rejected) triples. A separate reward model (RM) learns to score any response so that chosen beats rejected, via the Bradley–Terry objective:
Training minimizes −log σ(r(chosen) − r(rejected)) — the RM internalizes human taste.
Treat the LM as a policy and optimize it to produce high-reward outputs, minus a penalty for drifting from the reference:
The KL term is the safety belt; β sets how tight the leash is. The classic optimizer is PPO.
What does a single piece of preference data actually look like? A prompt, two model responses, and a human’s pick:
Both answers are correct. The reward model learns from thousands of such picks that, for this kind of prompt, the first style is preferred — a judgment no hand-written reward function could easily capture. Good preference data is harder than it looks: annotators frequently disagree, the comparisons have to span the model’s real output distribution, and a few thousand clean picks usually beat a noisy hundred thousand.
Go deeper: inside the reward model
The reward model is typically the SFT model with its final layer swapped for a single scalar head, trained on the same preference pairs. Two failure modes dominate: it overfits the data (memorizing annotator quirks) and it’s miscalibrated off-distribution (confidently wrong on prompts unlike anything it trained on). Teams mitigate with held-out validation, larger and more diverse preference sets, and sometimes reward-model ensembles whose disagreement flags unreliable regions. A variant — process reward models (PRMs) — score each reasoning step rather than the final answer; see reward models.
Go deeper: what PPO actually optimizes
PPO maximizes a clipped surrogate objective so each update stays close to the current policy. It scales the advantage by the probability ratio r_t = π_θ(a|s) / π_old(a|s), then clips r_t to [1−ε, 1+ε] so a single update can’t move the policy too far. In RLHF the advantage is driven by the reward-model score minus the per-token KL term. Full derivation on the PPO page.
Reward hacking and over-optimization
Practitioners watch the gold reward (true human preference) diverge from the proxy reward (RM score) as training proceeds. Lilian Weng’s survey on reward hacking is the best single reference.
This fragility is also why some researchers argue RLHF isn’t “real” RL at all — Andrej Karpathy’s widely-shared take:
How do you know it worked? Evaluating RLHF
You can’t just watch the reward climb — that’s the proxy that gets hacked. Real evaluation triangulates across several lenses:
| Method | What it measures | Watch out for |
|---|---|---|
| Held-out reward | RM score on prompts it didn’t train on | still just the proxy |
| Human win-rate | % of times humans prefer the new model over the SFT baseline | slow, expensive — the gold standard |
| Chatbot Arena (Elo) | crowdsourced head-to-head votes, ranked as Elo | coverage skews to chat-style prompts |
| LLM-as-judge (AlpacaEval, MT-Bench) | automated approximation of human preference | biased toward length and confident style |
| Red-teaming | safety: refusals, jailbreak resistance, harm | adversarial coverage is never complete |
A model can win on automated judges while regressing in real use — which is exactly why labs never trust a single number. This is also the practical flip side of reward hacking: your evaluation has to be harder to game than your reward.
Beyond PPO: the modern landscape (2024–2026)
The classic PPO recipe works but is heavy — it juggles four models (policy, reference, reward, value) and is finicky. A wave of alternatives reshaped post-training:
| Method | What it does | Drops | Best for |
|---|---|---|---|
| PPO | On-policy RL against a learned reward, KL-regularized | — | The original, most controllable recipe |
| DPO | Turns preferences into a direct classification-style loss on the policy | reward model + RL loop | Simplicity, stability; default for many open models |
| GRPO | Group-relative advantages from sampled answers | value/critic network | Cheaper RL; reasoning models (DeepSeek) |
| Rejection sampling | Keep best-of-N by RM score, fine-tune on those | the RL loop | Simple gains (used in Llama 2) |
| RLAIF | AI feedback from a written “constitution” instead of human labels | human labelers | Scaling the labeling bottleneck |
| RLVR | Reward from a programmatic verifier (tests pass? answer correct?) | the learned reward model | Math, code, checkable reasoning |
Two of these matter most today. DPO showed the RLHF objective can be solved in closed form, collapsing the whole reward-model-plus-PPO machinery into a single classification-style loss on the policy — far simpler and more stable, at some cost in control. GRPO keeps the RL loop but throws out the value network, estimating advantage by comparing a group of sampled answers to each other; it’s cheaper to run and underpins DeepSeek’s reasoning models.
A second axis is where the reward comes from:
Reward comes from a model trained on human preferences. Best where “good” is a matter of taste: open-ended quality, tone, helpfulness, safety. See reward models.
Reward comes from a programmatic checker (do the unit tests pass? is the math answer correct?). Best for math, code and checkable reasoning — it powers the 2025 reasoning-model boom. See RLVR.
Frontier models increasingly use both: RLVR to sharpen reasoning, RLHF to keep the result helpful and safe. See RL for reasoning.
RLAIF and Constitutional AI
Human labeling is the bottleneck — so why not let an AI do it? RLAIF (RL from AI Feedback) replaces human preference labels with judgments from a capable model, guided by a written set of principles. Anthropic’s Constitutional AI is the canonical example: the model critiques and revises its own answers against a “constitution” (be helpful, avoid harm), generating the preference data that trains the reward model. It scales the labeling step and makes the values explicit and auditable — at the cost of inheriting the judge model’s blind spots. Most modern pipelines blend human and AI feedback rather than choosing one.
A short history of RLHF
Where RLHF is used
Virtually every frontier assistant uses RLHF or a direct-alignment descendant in post-training:
| Model family | Lab | How alignment is done |
|---|---|---|
| ChatGPT / GPT-4o | OpenAI | RLHF with PPO — the original production recipe |
| Claude | Anthropic | RLHF + RLAIF / Constitutional AI |
| Llama 2 / 3 | Meta | RLHF (rejection sampling + PPO); DPO in later work |
| Gemini | Google DeepMind | RLHF as part of post-training |
| DeepSeek-V3 / R1 | DeepSeek | GRPO + RLVR for reasoning, preference tuning for chat |
Limitations and open problems
- Human labeling doesn’t scale cheaply — quality preference data is expensive, and annotators disagree.
- Reward hacking — the proxy can be gamed (above).
- Sycophancy & homogeneity — optimizing aggregate preference makes models agree too readily and collapses output diversity. The NeurIPS 2025 Best Paper “Artificial Hivemind” documents this and shows reward models are poorly calibrated to the real spread of human ratings.
- Bias & distribution shift — the RM reflects its labelers and degrades off-distribution.
RLHF in practice
Modern post-training stacks combine these rather than picking one: SFT, then a mix of DPO/PPO/GRPO, rejection sampling, and (for reasoning) RLVR. The open-source TRL library implements SFT, reward modeling, PPO, DPO and GRPO; OpenRLHF and veRL target larger scale. Focused behaviors need ~tens of thousands of comparisons; a general assistant needs far more.
Building RL environments, reward pipelines, and human-preference data at production scale is its own industry — see the companies building RLHF environments and data.
Researcher takes
Nathan Lambert frames the evolution of RLHF pipelines as another instance of the bitter lesson: the gains come from scaling and removing humans from the loop.
Frequently asked questions
Is RLHF the same as reinforcement learning?
Not quite. It borrows RL’s machinery — a policy optimized against a reward — but the reward is a learned model of human preference, and training stays on a short KL leash. Some researchers (see Karpathy above) argue it’s “barely RL” next to systems like AlphaGo that optimize a true environment reward.
How is RLHF different from supervised fine-tuning?
SFT imitates demonstrations (“copy these good answers”). RLHF optimizes against a preference signal (“produce answers humans prefer”), capturing quality and safety that demonstrations can’t. RLHF almost always runs after SFT.
Is DPO replacing RLHF?
DPO replaces the PPO step with a simpler direct loss and is now a default for many open models. But RLHF broadly — aligning a model to human preference — is alive and well; PPO, DPO and GRPO are all ways to do it.
Do reasoning models like o1 and R1 use RLHF?
They lean on RLVR (verifiable rewards) to sharpen reasoning, then still use preference-based alignment to stay helpful and safe — both, not either/or. See RL for reasoning.
Key papers
- Deep RL from Human Preferences — Christiano et al., 2017 — the conceptual seed.
- Learning to Summarize from Human Feedback — Stiennon et al., 2020 — first full RM+PPO pipeline for language.
- InstructGPT — Ouyang et al., 2022 — the canonical RLHF-for-LLMs paper.
- Constitutional AI — Bai et al., 2022 — RLAIF.
- Llama 2 — Touvron et al., 2023 — a detailed open RLHF recipe.
- Direct Preference Optimization — Rafailov et al., 2023 — the RM-free alternative.
Related
Reward models · PPO · DPO & preference optimization · GRPO · RLVR · RL for reasoning · What is reinforcement learning?