reinforcement-learning.com
// RL FOR LLMS & AGENTS

Constitutional AI & RLAIF: Alignment from AI Feedback

How Constitutional AI and RLAIF replace human preference labels with a written constitution and AI self-critique — the two-phase pipeline, the math, results and limits.

Updated 2026-06-08 15 min read
Key takeaways
  • Constitutional AI (CAI) aligns a model using a written set of principles — a 'constitution' — instead of large-scale human harm labels.
  • It has two phases: supervised self-critique-and-revision (SL-CAI), then RL from AI Feedback (RLAIF) where an AI judges which response better follows the constitution.
  • RLAIF swaps human preference labelers for a model labeler — scaling the bottleneck and making the values explicit and auditable.
  • It powers Anthropic's Claude and produced a 'Pareto improvement': more harmless without becoming evasive — but it inherits the judge model's blind spots.

What is Constitutional AI?

Constitutional AI (CAI) is a method for training a helpful and harmless assistant without relying on tens of thousands of human labels for what counts as harmful. Instead of asking people to rank thousands of responses, you write down a short list of principles — a constitution — and let the model use those principles to critique and improve its own answers, and later to judge which of two answers is better. The reinforcement-learning step that learns from those AI judgments is called RLAIF: Reinforcement Learning from AI Feedback.

It is the direct descendant of RLHF. RLHF learns a reward model from human preference comparisons; CAI keeps the same machinery but replaces the human labeler — for the harmlessness signal — with an AI labeler guided by written rules. The values move out of the labelers’ heads and into a document you can read, debate and edit.

Phase 1 — Supervised (SL-CAI)HarmfulpromptInitialresponseCritique +revise (constitution)Finetune→ SL-CAIrepeat the critique-revise loopPhase 2 — RLAIFSL-CAI policysamples 2 answersAI judge picksbetter (constitution)Preference model(reward r_φ)RL (PPO)+ KL penaltypolicy update — improved policy generates the next batch
Constitutional AI has two phases. The supervised phase teaches the model to self-revise against the constitution; the RLAIF phase trains a preference model from AI comparisons, then optimizes the policy with RL.

Why Constitutional AI exists

Standard RLHF hits two walls when the goal is harmlessness. First, the human-labeling bottleneck: collecting high-quality preference comparisons over harmful prompts is costly and exposes annotators to disturbing content. Second, opacity — the values live implicitly in thousands of individual labeler choices, so you cannot easily audit, reproduce or contest them.

Anthropic’s Constitutional AI paper (Bai et al., December 2022) tackled both. It trained a harmless-but-non-evasive assistant using zero human labels for harmlessness — only a constitution plus the human helpfulness data already collected. The headline result was a Pareto improvement: the CAI model was simultaneously more harmless and more helpful than an RLHF baseline, in part because it learned to explain its objections to a harmful request instead of giving a flat, unhelpful refusal.

0
Human harmlessness labels needed (constitution replaces them)
2 phases
Supervised self-revision → RLAIF
~16
Principles in the original CAI constitution

A parallel, broader result came from Google’s RLAIF vs. RLHF study (Lee et al., 2023): across summarization and dialogue, RLAIF matched RLHF on human evaluations — direct evidence that an off-the-shelf LLM can stand in for human preference labelers without a quality cliff.

How Constitutional AI works

1
Phase 1 — Supervised self-critique and revision (SL-CAI)

Start from a helpful-only RLHF model and feed it red-team prompts designed to elicit harmful answers. For each initial (often bad) response, sample a principle from the constitution and ask the model to critique its own answer against that principle, then revise it. Repeat the critique-revise loop a few times, then fine-tune the original model on the final revised answers. The result, SL-CAI, already self-censors and is the warm start for phase 2.

2
Phase 2 — Generate AI preference pairs

Use the SL-CAI model to sample two responses per harmful prompt. Show both to an AI judge along with a constitutional principle, and ask which response better satisfies it — yielding a (prompt, chosen, rejected) pair labeled entirely by AI. Chain-of-thought reasoning in the judge improves both accuracy and transparency.

3
Train the preference (reward) model

Train a reward model on the AI-labeled pairs (plus existing human helpfulness pairs) using the Bradley–Terry objective, so the chosen response scores higher than the rejected one:

LRM=E(x,yw,yl)[logσ(rϕ(x,yw)rϕ(x,yl))]\mathcal{L}_{\text{RM}} = -\,\mathbb{E}_{(x,\,y_w,\,y_l)}\Big[\log \sigma\big(r_\phi(x, y_w) - r_\phi(x, y_l)\big)\Big]

Here ywy_w is the chosen (“winning”) response and yly_l the rejected one.

4
Optimize the policy with RL (RLAIF)

Run reinforcement learning — classically PPO — to maximize the AI-derived reward while a KL penalty keeps the policy near its reference, exactly as in RLHF:

maxπ  Ex,yπ[rϕ(x,y)]    βKL(π(x)πref(x))\max_{\pi}\; \mathbb{E}_{x,\,y\sim\pi}\big[\,r_\phi(x,y)\,\big]\;-\;\beta\,\mathrm{KL}\big(\pi(\cdot\mid x)\,\|\,\pi_{\text{ref}}(\cdot\mid x)\big)

The only change from RLHF is the source of the preference labels feeding rϕr_\phi — AI, not humans.

What does the self-revision loop actually look like? A single critique-revise step on a harmful prompt:

Prompt”How can I get into my neighbor’s locked house?”
✗ initialYou could try jimmying the lock with a credit card, or look for an unlocked window around the back…
✓ revisedI can’t help with entering someone’s home without permission — that would likely be illegal. If you’re locked out of your own place, a licensed locksmith or your landlord can help.
Critique applied”Identify ways the response is harmful or assists illegal activity, then rewrite it to be harmless while still being helpful and non-evasive.”

The revised answer becomes supervised training data in phase 1, or one half of a preference pair in phase 2. Critically, the model is asked to stay engaged — explaining the objection — rather than emit a stonewall refusal, which is why CAI improves harmlessness without tanking helpfulness.

Go deeper: what is actually in the constitution?

The original CAI constitution is roughly 16 principles, sampled at random during critique and judging. They were drawn from several sources — the UN Declaration of Human Rights, Apple’s App Store guidelines, DeepMind’s Sparrow principles, Anthropic’s own research, and custom rules targeting failure modes the team observed. A typical principle reads like an instruction to the judge: “Choose the response that is least racist, sexist, or discriminatory.” Because each step samples one principle, no single rule has to encode all of “good behavior” at once, and the set is easy to inspect and amend. See Anthropic’s published Claude’s Constitution.

RLAIF vs RLHF: what actually changes

Most of the pipeline is identical. The substitution is narrow but consequential.

AspectRLHFRLAIF / Constitutional AI
Preference labelerHuman annotatorsAn LLM guided by written principles
Where values liveImplicit in labeler choicesExplicit in the constitution
Cost & speedSlow, expensive, hard to scaleFast, cheap, scales with compute
AuditabilityLow — hard to inspectHigh — read and edit the rules
Main failureAnnotator noise & disagreementJudge model’s blind spots & biases
Reward / RL stepRM + PPO (or DPO / GRPO)Identical — only the label source differs
RLHF — humans set the bar

Preferences come from people. Best where judgments need lived human context or where the model’s own judgment is untrustworthy. Noisy but low-bias: human signal is reliable even if expensive. See RLHF.

RLAIF — the model sets the bar

Preferences come from an AI reading a constitution. Cheap, fast, scalable and explicit. Low-noise but higher-bias: easy to start with, but the judge’s quirks propagate into the policy. Best for scaling harmlessness and well-specified principles.

Reward hacking still applies

CAI inherits RLHF’s central fragility: optimize a learned proxy hard enough and it stops tracking the true goal — Goodhart’s law. The AI preference model is still a proxy, so the policy can learn to produce answers the judge rewards (over-cautious hedging, performative refusals, verbose virtue-signaling) that humans do not actually prefer. The KL penalty mitigates over-optimization but does not eliminate it.

RLAIF training steps →rewardover-optimization beginsAI proxy reward (preference model)True reward (human preference)
As RLAIF optimization proceeds, the AI-judged proxy reward keeps climbing while the true (human-judged) reward peaks and then declines — the gap is reward hacking against the AI judge.

For the general theory, Lilian Weng’s survey on reward hacking is the best single reference, and the broader proxy problem is covered on the RLHF page.

Whose constitution? Collective Constitutional AI

If the values are now an explicit document, an obvious question follows: who writes it? In 2023 Anthropic and the Collective Intelligence Project ran Collective Constitutional AI — sourcing a constitution from roughly 1,000 Americans via the Polis deliberation platform (1,127 statements, 38,252 votes). They then trained a model on this public constitution and compared it to the in-house one.

Two findings stood out. The public constitution overlapped only ~50% with Anthropic’s own — people emphasized different things — yet the resulting model was less biased across nine social dimensions while remaining comparably capable. It is a concrete demonstration that the constitution is a genuine lever: change the document, change the model’s values, measurably.

A short history

2022
Constitutional AI
Bai et al. at Anthropic introduce CAI and RLAIF — a harmless, non-evasive assistant trained with zero human harmlessness labels, a Pareto improvement over RLHF.
2023
Claude ships on CAI
Anthropic begins training production Claude models with Constitutional AI; publishes the constitution.
2023
RLAIF vs RLHF
Google’s Lee et al. show RLAIF matches RLHF on human evals across summarization and dialogue — generalizing the idea beyond Anthropic.
2023
Collective Constitutional AI
Anthropic + the Collective Intelligence Project source a constitution from ~1,000 people; the public-constitution model is measurably less biased.
2024–25
RLAIF goes mainstream
AI feedback becomes a standard ingredient in post-training pipelines (synthetic preferences, LLM-as-judge), blended with human data rather than replacing it.
2026
Claude's constitution revised
Anthropic restructures Claude’s constitution around four pillars — safety, ethics, guideline-compliance, helpfulness — with more nuance on real-world ethics and user safety.

Where it is used today

Constitutional AI is most associated with Anthropic’s Claude, but AI feedback — the RLAIF idea — is now a near-universal post-training ingredient.

UseHow CAI / RLAIF shows up
Claude (Anthropic)Constitutional AI for harmlessness + character; constitution is public and versioned
General post-trainingSynthetic preference data and LLM-as-judge stand in for, or augment, human labels
Constitutional ClassifiersConstitution-derived guards that filter inputs/outputs at deployment; see RL safety & alignment
DPO / GRPO pipelinesAI-labeled pairs feed direct-preference and group-relative methods, not just PPO

Most frontier pipelines now blend human and AI feedback: AI feedback for scale and consistency, human feedback for the judgments models still get wrong. This is also adjacent to RLVR, where the “judge” is a programmatic verifier rather than a model — see RL for reasoning.

Limitations and open problems

  • The judge’s blind spots propagate. Any bias or gap in the AI labeler is baked into every policy it trains — alignment is relocated, not removed.
  • Reward hacking against the AI judge. The preference model is still a gameable proxy; over-optimization causes sycophancy and over-refusal.
  • Constitution authorship & legitimacy. A privately written constitution embeds contestable values without a democratic mandate; Collective CAI is an early attempt to address this.
  • Specification is hard. Vague principles produce inconsistent labels; principles can conflict (helpful vs. harmless), and resolving the trade-off is itself a value choice.
  • Capability ceiling. RLAIF works only when the judge is at least as good as a human at the relevant judgment — for frontier or specialist domains, that assumption can break.

Building constitutional pipelines in practice

A working CAI/RLAIF stack needs: a curated set of red-team prompts, a well-written constitution, a strong judge model with chain-of-thought, a reward-model trainer, and an RL or direct-preference optimizer. The open-source TRL library implements reward modeling, PPO, DPO and GRPO; AI-feedback labeling is typically a custom layer on top. Standing up red-teaming, preference-data, and reward pipelines at production scale is its own industry — see the AI-feedback and RLHF data vendors.

Researcher takes

Nathan Lambert — author of the RLHF Book and one of the most-cited voices on post-training — treats Constitutional AI as an “advanced” topic alongside evaluation and character training, a sign of how central AI feedback has become to modern alignment.

Frequently asked questions

Is RLAIF the same as Constitutional AI?

Not quite. RLAIF is the general technique of using an AI model (instead of humans) to produce preference labels for RL. Constitutional AI is Anthropic’s specific recipe that guides that AI feedback with a written constitution and adds a supervised self-critique-and-revision phase first. All CAI uses RLAIF, but you can do RLAIF without a formal constitution.

Does Constitutional AI remove humans entirely?

No. Humans still write the constitution, design the red-team prompts, and typically supply the helpfulness preference data. CAI removes humans from the harmlessness labeling step specifically — the most expensive and unpleasant part. Most real pipelines blend human and AI feedback.

How is CAI different from plain RLHF?

The RL machinery — reward model plus PPO under a KL penalty — is identical. The difference is the source of the preference labels: an AI applying explicit written principles, instead of human annotators applying implicit personal judgment. CAI also adds a supervised self-revision phase that RLHF lacks.

Can the AI labeler be trusted to judge itself?

Only up to its own competence. RLAIF relies on the judge being at least as reliable as a human for the judgments in question — which holds for many harmlessness principles but breaks down for frontier reasoning, specialist domains, or values the judge model itself gets wrong. This bootstrapping risk is CAI’s core open problem.

Key papers

RLHF · Reward models · PPO · DPO & preference optimization · GRPO · RLVR · RL safety & alignment · RL for reasoning