- Constitutional AI (CAI) aligns a model using a written set of principles — a 'constitution' — instead of large-scale human harm labels.
- It has two phases: supervised self-critique-and-revision (SL-CAI), then RL from AI Feedback (RLAIF) where an AI judges which response better follows the constitution.
- RLAIF swaps human preference labelers for a model labeler — scaling the bottleneck and making the values explicit and auditable.
- It powers Anthropic's Claude and produced a 'Pareto improvement': more harmless without becoming evasive — but it inherits the judge model's blind spots.
What is Constitutional AI?
Constitutional AI (CAI) is a method for training a helpful and harmless assistant without relying on tens of thousands of human labels for what counts as harmful. Instead of asking people to rank thousands of responses, you write down a short list of principles — a constitution — and let the model use those principles to critique and improve its own answers, and later to judge which of two answers is better. The reinforcement-learning step that learns from those AI judgments is called RLAIF: Reinforcement Learning from AI Feedback.
It is the direct descendant of RLHF. RLHF learns a reward model from human preference comparisons; CAI keeps the same machinery but replaces the human labeler — for the harmlessness signal — with an AI labeler guided by written rules. The values move out of the labelers’ heads and into a document you can read, debate and edit.
Why Constitutional AI exists
Standard RLHF hits two walls when the goal is harmlessness. First, the human-labeling bottleneck: collecting high-quality preference comparisons over harmful prompts is costly and exposes annotators to disturbing content. Second, opacity — the values live implicitly in thousands of individual labeler choices, so you cannot easily audit, reproduce or contest them.
Anthropic’s Constitutional AI paper (Bai et al., December 2022) tackled both. It trained a harmless-but-non-evasive assistant using zero human labels for harmlessness — only a constitution plus the human helpfulness data already collected. The headline result was a Pareto improvement: the CAI model was simultaneously more harmless and more helpful than an RLHF baseline, in part because it learned to explain its objections to a harmful request instead of giving a flat, unhelpful refusal.
A parallel, broader result came from Google’s RLAIF vs. RLHF study (Lee et al., 2023): across summarization and dialogue, RLAIF matched RLHF on human evaluations — direct evidence that an off-the-shelf LLM can stand in for human preference labelers without a quality cliff.
How Constitutional AI works
Start from a helpful-only RLHF model and feed it red-team prompts designed to elicit harmful answers. For each initial (often bad) response, sample a principle from the constitution and ask the model to critique its own answer against that principle, then revise it. Repeat the critique-revise loop a few times, then fine-tune the original model on the final revised answers. The result, SL-CAI, already self-censors and is the warm start for phase 2.
Use the SL-CAI model to sample two responses per harmful prompt. Show both to an AI judge along with a constitutional principle, and ask which response better satisfies it — yielding a (prompt, chosen, rejected) pair labeled entirely by AI. Chain-of-thought reasoning in the judge improves both accuracy and transparency.
Train a reward model on the AI-labeled pairs (plus existing human helpfulness pairs) using the Bradley–Terry objective, so the chosen response scores higher than the rejected one:
Here is the chosen (“winning”) response and the rejected one.
What does the self-revision loop actually look like? A single critique-revise step on a harmful prompt:
The revised answer becomes supervised training data in phase 1, or one half of a preference pair in phase 2. Critically, the model is asked to stay engaged — explaining the objection — rather than emit a stonewall refusal, which is why CAI improves harmlessness without tanking helpfulness.
Go deeper: what is actually in the constitution?
The original CAI constitution is roughly 16 principles, sampled at random during critique and judging. They were drawn from several sources — the UN Declaration of Human Rights, Apple’s App Store guidelines, DeepMind’s Sparrow principles, Anthropic’s own research, and custom rules targeting failure modes the team observed. A typical principle reads like an instruction to the judge: “Choose the response that is least racist, sexist, or discriminatory.” Because each step samples one principle, no single rule has to encode all of “good behavior” at once, and the set is easy to inspect and amend. See Anthropic’s published Claude’s Constitution.
RLAIF vs RLHF: what actually changes
Most of the pipeline is identical. The substitution is narrow but consequential.
| Aspect | RLHF | RLAIF / Constitutional AI |
|---|---|---|
| Preference labeler | Human annotators | An LLM guided by written principles |
| Where values live | Implicit in labeler choices | Explicit in the constitution |
| Cost & speed | Slow, expensive, hard to scale | Fast, cheap, scales with compute |
| Auditability | Low — hard to inspect | High — read and edit the rules |
| Main failure | Annotator noise & disagreement | Judge model’s blind spots & biases |
| Reward / RL step | RM + PPO (or DPO / GRPO) | Identical — only the label source differs |
Preferences come from people. Best where judgments need lived human context or where the model’s own judgment is untrustworthy. Noisy but low-bias: human signal is reliable even if expensive. See RLHF.
Preferences come from an AI reading a constitution. Cheap, fast, scalable and explicit. Low-noise but higher-bias: easy to start with, but the judge’s quirks propagate into the policy. Best for scaling harmlessness and well-specified principles.
Reward hacking still applies
CAI inherits RLHF’s central fragility: optimize a learned proxy hard enough and it stops tracking the true goal — Goodhart’s law. The AI preference model is still a proxy, so the policy can learn to produce answers the judge rewards (over-cautious hedging, performative refusals, verbose virtue-signaling) that humans do not actually prefer. The KL penalty mitigates over-optimization but does not eliminate it.
For the general theory, Lilian Weng’s survey on reward hacking is the best single reference, and the broader proxy problem is covered on the RLHF page.
Whose constitution? Collective Constitutional AI
If the values are now an explicit document, an obvious question follows: who writes it? In 2023 Anthropic and the Collective Intelligence Project ran Collective Constitutional AI — sourcing a constitution from roughly 1,000 Americans via the Polis deliberation platform (1,127 statements, 38,252 votes). They then trained a model on this public constitution and compared it to the in-house one.
Two findings stood out. The public constitution overlapped only ~50% with Anthropic’s own — people emphasized different things — yet the resulting model was less biased across nine social dimensions while remaining comparably capable. It is a concrete demonstration that the constitution is a genuine lever: change the document, change the model’s values, measurably.
A short history
Where it is used today
Constitutional AI is most associated with Anthropic’s Claude, but AI feedback — the RLAIF idea — is now a near-universal post-training ingredient.
| Use | How CAI / RLAIF shows up |
|---|---|
| Claude (Anthropic) | Constitutional AI for harmlessness + character; constitution is public and versioned |
| General post-training | Synthetic preference data and LLM-as-judge stand in for, or augment, human labels |
| Constitutional Classifiers | Constitution-derived guards that filter inputs/outputs at deployment; see RL safety & alignment |
| DPO / GRPO pipelines | AI-labeled pairs feed direct-preference and group-relative methods, not just PPO |
Most frontier pipelines now blend human and AI feedback: AI feedback for scale and consistency, human feedback for the judgments models still get wrong. This is also adjacent to RLVR, where the “judge” is a programmatic verifier rather than a model — see RL for reasoning.
Limitations and open problems
- The judge’s blind spots propagate. Any bias or gap in the AI labeler is baked into every policy it trains — alignment is relocated, not removed.
- Reward hacking against the AI judge. The preference model is still a gameable proxy; over-optimization causes sycophancy and over-refusal.
- Constitution authorship & legitimacy. A privately written constitution embeds contestable values without a democratic mandate; Collective CAI is an early attempt to address this.
- Specification is hard. Vague principles produce inconsistent labels; principles can conflict (helpful vs. harmless), and resolving the trade-off is itself a value choice.
- Capability ceiling. RLAIF works only when the judge is at least as good as a human at the relevant judgment — for frontier or specialist domains, that assumption can break.
Building constitutional pipelines in practice
A working CAI/RLAIF stack needs: a curated set of red-team prompts, a well-written constitution, a strong judge model with chain-of-thought, a reward-model trainer, and an RL or direct-preference optimizer. The open-source TRL library implements reward modeling, PPO, DPO and GRPO; AI-feedback labeling is typically a custom layer on top. Standing up red-teaming, preference-data, and reward pipelines at production scale is its own industry — see the AI-feedback and RLHF data vendors.
Researcher takes
Nathan Lambert — author of the RLHF Book and one of the most-cited voices on post-training — treats Constitutional AI as an “advanced” topic alongside evaluation and character training, a sign of how central AI feedback has become to modern alignment.
Frequently asked questions
Is RLAIF the same as Constitutional AI?
Not quite. RLAIF is the general technique of using an AI model (instead of humans) to produce preference labels for RL. Constitutional AI is Anthropic’s specific recipe that guides that AI feedback with a written constitution and adds a supervised self-critique-and-revision phase first. All CAI uses RLAIF, but you can do RLAIF without a formal constitution.
Does Constitutional AI remove humans entirely?
No. Humans still write the constitution, design the red-team prompts, and typically supply the helpfulness preference data. CAI removes humans from the harmlessness labeling step specifically — the most expensive and unpleasant part. Most real pipelines blend human and AI feedback.
How is CAI different from plain RLHF?
The RL machinery — reward model plus PPO under a KL penalty — is identical. The difference is the source of the preference labels: an AI applying explicit written principles, instead of human annotators applying implicit personal judgment. CAI also adds a supervised self-revision phase that RLHF lacks.
Can the AI labeler be trusted to judge itself?
Only up to its own competence. RLAIF relies on the judge being at least as reliable as a human for the judgments in question — which holds for many harmlessness principles but breaks down for frontier reasoning, specialist domains, or values the judge model itself gets wrong. This bootstrapping risk is CAI’s core open problem.
Key papers
- Constitutional AI: Harmlessness from AI Feedback — Bai et al., Anthropic, 2022 — the foundational paper.
- RLAIF vs. RLHF: Scaling RL from Human Feedback with AI Feedback — Lee et al., Google, 2023 — shows RLAIF matches RLHF.
- Collective Constitutional AI — Huang et al., Anthropic, 2024 — sourcing a constitution from public input.
- InstructGPT — Ouyang et al., 2022 — the RLHF baseline CAI builds on.
Related
RLHF · Reward models · PPO · DPO & preference optimization · GRPO · RLVR · RL safety & alignment · RL for reasoning