reinforcement-learning.com
// RL FOR LLMS & AGENTS

DPO and Preference Optimization

What DPO is, how it replaces the RLHF reward model and PPO loop with one classification loss, the math behind it, and how IPO, KTO, ORPO and SimPO compare in 2026.

Updated 2026-06-07 17 min read
Key takeaways
  • Preference optimization tunes a model from comparisons (answer A beats B) instead of a numeric reward — DPO is the breakthrough that does it without a reward model or RL loop.
  • DPO's trick: the optimal RLHF policy has a closed form, so 'your language model is secretly a reward model' and alignment collapses into one classification-style loss over chosen/rejected pairs.
  • The frozen reference model and the β knob bake in an implicit KL constraint — the same leash RLHF uses, without the PPO machinery.
  • Variants fix specific weaknesses: IPO (overfitting), KTO (no pairs needed), ORPO/SimPO (no reference model) — but tuned DPO is still a very strong default in 2026.

What is preference optimization?

Preference optimization is a family of methods for fine-tuning a language model to behave the way humans prefer by training directly on comparisons — “response A is better than response B” — instead of a hand-written numeric score. It’s the same goal as RLHF: turn cheap human judgments of which answer is better into a training signal. The difference is mechanism.

Direct Preference Optimization (DPO) is the technique that made this dramatically simpler. Classic RLHF needs two heavy stages: train a separate reward model, then optimize the policy against it with reinforcement learning (usually PPO). DPO showed that both stages can be folded into a single supervised loss on the policy — no reward model, no sampling, no RL loop. IPO, KTO, ORPO and SimPO are popular descendants that each tweak DPO’s loss to fix a specific weakness.

▶ Direct Preference Optimization (DPO) explained: Bradley-Terry model, log probabilities, math (the full derivation)

Where it fits: SFT, then alignment

Post-training a base model is a pipeline. Pretraining gives broad knowledge. Supervised fine-tuning (SFT) teaches the model to follow instructions by imitating good demonstrations. But SFT can only copy answers — it can’t express that a helpful, honest answer should beat a plausible-but-useless one. That’s a preference, not a label, and it’s what the alignment stage handles.

DPO and its variants live in that alignment stage, almost always after SFT. The SFT model becomes the starting point — and, for most variants, also the reference model that keeps training from drifting too far.

Pretrainingbroad knowledgeSFTfollow instructionsPreference optimizationalign to human tasteDeployassistantDPO / IPO / KTO / ORPO / SimPO
Where DPO sits: pretraining gives knowledge, SFT teaches format, and preference optimization aligns the model to human taste. DPO replaces RLHF's reward-model + PPO stages with a single loss.

The problem DPO solves: classic RLHF is heavy

The original RLHF recipe is powerful but fiddly. To optimize a policy you juggle four models in memory — the policy, a frozen reference, the reward model, and PPO’s value/critic network — and run an on-policy RL loop that samples fresh generations every step.

4 models
PPO RLHF juggles: policy, reference, reward, value
2 → 1
DPO collapses RM-training + RL into one loss
0
reward models DPO trains separately

That machinery brings three pains: it’s expensive (more GPUs, more memory), it’s unstable (PPO is notoriously sensitive to hyperparameters), and it’s complex to implement correctly. The reward model can also be reward-hacked — the policy learns to exploit quirks the RM loves but humans don’t. DPO’s pitch: get most of the benefit while deleting the reward model and the RL loop entirely.

How DPO works in plain English

DPO starts from the same data RLHF uses — (prompt, chosen, rejected) triples — but skips straight to tuning the policy. The intuition: make the chosen answer more likely and the rejected answer less likely, but only relative to where the reference model started.

1
Start from the SFT model

Take the supervised-fine-tuned model. Make a frozen copy of it — the reference policy π_ref. The trainable copy is the policy π_θ.

2
Score each completion against the reference

For a chosen and a rejected response, compute the log-ratio of how likely the policy makes it versus how likely the reference makes it. This log-ratio is an implicit reward — no separate reward model needed.

3
Push the margin with a classification loss

Increase the implicit-reward margin between chosen and rejected via a logistic (sigmoid) loss — exactly like training a binary classifier. Gradient descent does the rest; there’s no sampling and no RL.

A subtle but important detail from practice: DPO often achieves the margin mostly by lowering the probability of the rejected response, not by raising the chosen one. That’s fine for ranking, but it’s the seed of the failure modes we’ll see later.

Prompt”Explain photosynthesis to a 6-year-old.”
✓ chosenPlants are like tiny chefs. They take in sunlight, water and air, and cook it into food so they can grow — and the leftover they breathe out is the oxygen we need.
✗ rejectedPhotosynthesis is the biochemical process by which chlorophyll-bearing organisms convert photonic energy into glucose via the Calvin-Benson cycle.

DPO never sees a numeric score for either answer — only the ordering. From thousands of such orderings it learns the policy directly.

Under the hood: Bradley-Terry and the implicit reward

The statistical foundation is the Bradley-Terry model, the same one RLHF reward models use. It says the probability that response y1y_1 beats y2y_2 is a sigmoid of their reward difference:

P(y1y2x)=σ(r(x,y1)r(x,y2))P(y_1 \succ y_2 \mid x) = \sigma\big(r(x,y_1) - r(x,y_2)\big)

RLHF fits a reward model rr to this, then optimizes the KL-regularized objective

maxπ  Ex,yπ[r(x,y)]    βKL(π(x)πref(x))\max_{\pi}\; \mathbb{E}_{x,\,y\sim\pi}\big[\,r(x,y)\,\big]\;-\;\beta\,\mathrm{KL}\big(\pi(\cdot\mid x)\,\|\,\pi_{\text{ref}}(\cdot\mid x)\big)

The DPO insight is that this objective has a known closed-form solution. The optimal policy is

π(yx)=1Z(x)πref(yx)exp ⁣(1βr(x,y))\pi^{*}(y\mid x) = \frac{1}{Z(x)}\,\pi_{\text{ref}}(y\mid x)\,\exp\!\Big(\tfrac{1}{\beta}\,r(x,y)\Big)

Rearranging for rr shows the reward is implied by the policy itself:

r(x,y)=βlogπ(yx)πref(yx)+βlogZ(x)r(x,y) = \beta \,\log\frac{\pi^{*}(y\mid x)}{\pi_{\text{ref}}(y\mid x)} + \beta\log Z(x)

This is the “your language model is secretly a reward model” line. Substitute this expression back into Bradley-Terry and the intractable partition function Z(x)Z(x) cancels (it’s the same for both responses to a prompt). You’re left with a loss in the policy’s parameters only — no reward model to train.

Go deeper: why the partition function cancels

Z(x)=yπref(yx)exp(r(x,y)/β)Z(x) = \sum_y \pi_{\text{ref}}(y\mid x)\exp(r(x,y)/\beta) is a sum over all possible responses — computationally hopeless. But Bradley-Terry only ever uses the difference r(x,ychosen)r(x,yrejected)r(x, y_{\text{chosen}}) - r(x, y_{\text{rejected}}). Both terms carry the identical +βlogZ(x)+\beta\log Z(x), so it subtracts away. That cancellation is the whole trick: it turns an intractable normalization into a constant you never have to compute.

The DPO loss, the reference model, and β

Plugging the implicit reward into Bradley-Terry gives the DPO loss in its canonical form (this is exactly what Hugging Face TRL implements as the default sigmoid loss):

LDPO=E(x,yw,yl) ⁣[logσ ⁣(βlogπθ(ywx)πref(ywx)βlogπθ(ylx)πref(ylx))]\mathcal{L}_{\text{DPO}} = -\,\mathbb{E}_{(x,\,y_w,\,y_l)}\!\left[\log\sigma\!\left(\beta\log\frac{\pi_\theta(y_w\mid x)}{\pi_{\text{ref}}(y_w\mid x)} - \beta\log\frac{\pi_\theta(y_l\mid x)}{\pi_{\text{ref}}(y_l\mid x)}\right)\right]

where ywy_w is the chosen (winning) response and yly_l the rejected (losing) one.

Two pieces carry the weight of the whole method:

The reference model π_ref

A frozen copy of the SFT model. Every log-probability is measured relative to it, which anchors the policy and supplies an implicit KL constraint — the same leash RLHF gets from its explicit KL penalty, baked into the loss for free. The cost: you hold a second model in memory.

The temperature β

Controls how hard the policy may move from the reference. Small β = a loose leash, the policy can change a lot (and overfit/reward-hack). Large β = a tight leash, conservative updates. Typical values sit around 0.1; it’s the single most important DPO knob.

What the data looks like

DPO’s appeal is partly that its data format is trivially simple — a prompt plus two completions, one marked chosen and one rejected:

# Standard preference example (HF TRL format)
{
  "prompt":   "Explain photosynthesis to a 6-year-old.",
  "chosen":   "Plants are like tiny chefs. They take in sunlight...",
  "rejected": "Photosynthesis is the biochemical process by which...",
}

Public preference datasets in this shape power most open models: UltraFeedback (binarized into chosen/rejected), Anthropic’s HH-RLHF (helpful/harmless), and many task-specific sets. The comparisons can come from humans or, increasingly, from a strong “judge” model (RLAIF style). Quality matters more than quantity — a few tens of thousands of clean, on-distribution pairs typically beat a noisy hundred thousand.

The variants and what each fixes

DPO works, but it has sharp edges: it can overfit when preferences are near-deterministic, it requires paired data, and it requires a reference model (extra memory). Each major variant targets one of these. The cleanest way to see the differences is to put the four losses side by side — notice they only change what goes inside the loss, not the data philosophy.

VariantCore change vs DPOFixes
DPOlogσ(β(r^wr^l))-\log\sigma\big(\beta\,(\hat r_w - \hat r_l)\big), where r^=logπθπref\hat r = \log\frac{\pi_\theta}{\pi_{\text{ref}}}the baseline
IPOReplaces the logsigmoid with a bounded squared loss: ((r^wr^l)12β)2\big((\hat r_w - \hat r_l) - \tfrac{1}{2\beta}\big)^2overfitting when preferences are deterministic
KTODrops pairs; a prospect-theory (HALO) utility on each example labeled desirable/undesirableneeding paired data
ORPOAppends a log-odds-ratio penalty to the plain SFT loss — one stage, no referencethe reference model and the two-stage pipeline
SimPOReference-free reward = length-normalized average log-prob + a target margin γreference memory and length bias

IPO — stopping DPO from overfitting

IPO (Identity Preference Optimization), from Azar et al., diagnoses a real DPO weakness: when a preference is near-deterministic (chosen always beats rejected), the sigmoid loss keeps pushing the margin toward infinity, ignoring the KL constraint and overfitting. IPO swaps the logsigmoid for a bounded regression target — it drives the margin toward a finite value 12β\tfrac{1}{2\beta} instead of \infty — so the implicit KL term can’t be steamrolled. It’s available in TRL as loss_type="ipo".

KTO — aligning from thumbs-up / thumbs-down

KTO (Kahneman-Tversky Optimization), from Ethayarajh et al., removes DPO’s hardest data requirement: paired comparisons. In production, paired data is scarce, but a stream of binary signals — a thumbs-up or thumbs-down per response — is cheap and abundant. KTO frames alignment as a HALO (Human-Aware Loss) using a Kahneman-Tversky prospect-theory utility, weighting losses against gains asymmetrically (humans feel losses more sharply). It still uses a reference model, but each example only needs a desirable/undesirable label, no partner response.

ORPO — one stage, no reference model

ORPO (Odds-Ratio Preference Optimization), from Hong et al., is the most radical simplification: it fuses SFT and alignment into a single stage and drops the reference model entirely. It appends a small log-odds-ratio penalty — which directly discourages the rejected response — to the ordinary negative-log-likelihood SFT loss. No separate alignment phase, no frozen reference in memory, and no SFT→DPO distribution shift between stages.

SimPO and cDPO — reference-free and noise-robust

SimPO (Meng et al., NeurIPS 2024) also goes reference-free: it defines the implicit reward as the length-normalized average log-probability of a sequence and adds a target reward margin γ. Length-normalization is deliberate — it directly attacks DPO’s length bias (below). SimPO reported strong AlpacaEval 2 results and is exposed in TRL as loss_type="sigmoid_norm". cDPO (conservative DPO) takes a different angle: it assumes preference labels are noisy (some fraction are flipped) and applies label smoothing so the loss can’t be dominated by mislabeled pairs — TRL implements the related idea via loss_type="robust" with a label_smoothing parameter.

Go deeper: the reference-model memory question

DPO, IPO and KTO all need a frozen reference model to compute the denominator log-probs. In practice that means a second full copy of the model in GPU memory (or precomputing reference log-probs once and caching them). ORPO and SimPO drop the reference entirely — ORPO by anchoring to the SFT loss itself, SimPO by using a raw length-normalized log-prob as the reward. That’s a real memory and complexity win, but it removes the explicit anchor that keeps DPO close to a known-good starting point, which is partly why results are mixed.

When to use which

There is no universal winner, and benchmark leaderboards are noisy. The honest 2026 picture: a well-tuned DPO (or IPO when overfitting bites) remains a very strong default, with the variants winning in specific situations rather than across the board.

Your situationReach for
Standard paired preference data, want a reliable baselineDPO
Preferences are near-deterministic / DPO is overfittingIPO
Only binary thumbs-up/down signals, no pairsKTO
Want to skip a separate alignment stage / save reference memoryORPO
Memory-constrained and fighting length biasSimPO
Labels are known to be noisycDPO / robust DPO

Failure modes to watch

DPO’s simplicity hides some subtle traps that most introductory explainers skip.

training steps →log-probchosenrejectedmargin grows…but chosen falls too
Likelihood displacement: DPO widens the chosen-minus-rejected margin, but can do so by pushing BOTH probabilities down — sometimes lowering the chosen response's likelihood, even shifting mass to unintended outputs.
  • Likelihood displacement. DPO can lower the probability of the preferred response while widening the margin — and in the worst case shift probability mass to opposite-meaning or even unsafe outputs. Razin et al.’s Unintentional Unalignment (ICLR 2025) documents this; the fix is careful data curation (avoid chosen/rejected pairs that are too similar) and watching the logps/chosen metric, not just the margin.
  • Length / verbosity bias. Because longer sequences accumulate more log-prob terms, DPO tends to reward longer answers regardless of quality. LD-DPO and SimPO’s length-normalization both target this directly.
  • Over-optimization. With too small a β the policy drifts far from the reference and quality degrades — the offline cousin of RLHF reward over-optimization.

Online vs offline, and iterative DPO

The deepest difference between DPO and PPO isn’t the loss — it’s where the data comes from.

  • Offline (vanilla DPO): trains on a fixed dataset of pairs. The model never generates fresh samples during training. Simple and cheap, but the model can only learn from a static snapshot of preferences.
  • Online (PPO-style): the policy generates new responses each step, which are scored and learned from. More expensive, but the signal tracks the model’s current behavior — and it’s harder to over-optimize a stale dataset.

Iterative DPO is the bridge: alternate between generating fresh responses from the current policy, labeling them (by humans or a judge/reward model) into new pairs, and running another DPO round. This recovers much of online RL’s benefit while keeping DPO’s simple loss — and it’s how several open models were tuned. Pushed further, you arrive back at full on-policy RL like PPO or GRPO.

Go deeper: DPO vs GRPO in modern pipelines

DPO and GRPO solve different problems. DPO is offline, needs preference pairs, and is ideal for style and helpfulness alignment. GRPO is online, samples a group of responses per prompt and uses their relative scores as advantage — it shines with verifiable rewards (RLVR) for math and code. Many 2025-2026 stacks use both: DPO/IPO for taste, GRPO+RLVR for reasoning. They’re complementary, not competing.

How to run it in practice

The standard toolkit is Hugging Face TRL, whose DPOTrainer covers DPO and — via loss_type — IPO, SimPO, robust/cDPO and a dozen more, plus dedicated KTOTrainer and ORPOTrainer. A minimal DPO run is genuinely a few lines:

from trl import DPOConfig, DPOTrainer
from datasets import load_dataset

dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train")

trainer = DPOTrainer(
    model="Qwen/Qwen3-0.6B",          # your SFT model
    args=DPOConfig(beta=0.1, loss_type="sigmoid"),
    train_dataset=dataset,            # prompt / chosen / rejected
)
trainer.train()

If ref_model is left out, TRL automatically uses the initial state of model as the frozen reference. Switching to IPO is just loss_type="ipo"; SimPO is loss_type="sigmoid_norm". The Hugging Face Alignment Handbook packages full SFT→DPO recipes (the Zephyr models came out of it), and TRL’s DPO docs list every loss variant.

Building the preference data and reward pipelines behind this — at production scale, with clean comparisons and good coverage — is its own industry; see the preference-data and RL environment companies.

A short history

2022
RLHF goes mainstream
InstructGPT and ChatGPT prove reward-model + PPO alignment; the reward model + RL loop becomes the standard but heavy recipe.
2023
DPO
Rafailov et al. show the optimal RLHF policy has a closed form — alignment becomes one classification loss. “Your language model is secretly a reward model.”
2023
IPO
Azar et al. give a general theoretical paradigm and a bounded objective that stops DPO overfitting near-deterministic preferences.
2024
KTO & ORPO
KTO aligns from binary thumbs-up/down via prospect theory; ORPO fuses SFT and alignment into one reference-free stage.
2024
SimPO & failure-mode research
SimPO drops the reference with a length-normalized reward; likelihood-displacement and length-bias studies expose DPO’s edges.
2025–26
DPO as a default; online resurgence
DPO/iterative DPO post-trains many open models (Llama 3, Zephyr lineage), running alongside GRPO + RLVR for reasoning.

DPO vs RLHF/PPO at a glance

DimensionDPORLHF with PPO
Separate reward modelNo (implicit)Yes
RL loop / on-policy samplingNoYes
Models in memory2 (policy + reference)4 (policy, ref, reward, value)
StabilityHigh (supervised loss)Lower (PPO is finicky)
DataOffline preference pairsOnline generations + RM
ControllabilityLess (one β knob)More (full reward shaping)
Best forSimple, stable open-model alignmentMaximum control; online signal

Researcher takes

A clear correction of the most common misconception about how DPO actually works under the hood.

A co-author of the original DPO paper stakes out the pro-DPO position and challenges PPO advocates to justify its extra cost.

Frequently asked questions

Is DPO reinforcement learning?

Not in the usual sense. DPO optimizes the same objective as RLHF (a KL-regularized preference objective), but solves it with a supervised classification loss — no policy rollouts, no reward model, no RL loop. It’s “RL-free” alignment that targets the RL solution analytically.

Does DPO still need a reference model?

Vanilla DPO, IPO and KTO do — a frozen copy of the SFT model whose log-probs anchor the policy and supply the implicit KL constraint. ORPO and SimPO are reference-free, trading that anchor for lower memory use.

DPO or one of the variants — which is best in 2026?

There’s no universal winner. A well-tuned DPO (or IPO if it overfits) is still a strong default. Pick KTO for unpaired binary data, ORPO to skip a stage, SimPO when memory and length bias matter. Tune β first — many “wins” are really tuning differences.

Why does DPO sometimes make the chosen response less likely?

Because the loss only cares about the margin between chosen and rejected, it can widen that gap by pushing the rejected probability down faster than the chosen — sometimes lowering the chosen too. This is likelihood displacement; watch logps/chosen, not just the margin, and avoid near-identical chosen/rejected pairs.

Key papers

RLHF · Reward models · PPO · GRPO · RLVR · RL for reasoning · What is reinforcement learning?