Imitation Learning & Inverse RL, Explained

Key takeaways

Imitation learning teaches a policy from expert demonstrations instead of a reward function — useful when 'good' is easy to show but hard to score.
Behavioral cloning is just supervised learning on (state, action) pairs; its fatal flaw is compounding error from covariate shift, which DAgger fixes by querying the expert on the learner's own states.
Inverse RL flips the problem: instead of copying actions, infer the reward function the expert was optimizing — then plan against it, so the policy generalizes beyond the demonstrated states.
Modern methods unify both: max-entropy IRL resolves reward ambiguity, GAIL casts imitation as adversarial distribution matching, and the same recipe underpins today's robot manipulation policies.

What is imitation learning?

Imitation learning trains an agent to act by copying an expert rather than by maximizing a reward you wrote down. You collect demonstrations — a human driving, a teleoperated robot arm, a chess grandmaster’s games — and learn a policy that reproduces them. It is the natural answer to a recurring problem in reinforcement learning: for many tasks, designing a reward function is harder than just showing the behavior. How would you score “drive like a careful human”? You can’t easily — but you have thousands of hours of careful humans driving.

Inverse reinforcement learning (IRL) is the deeper cousin. Instead of mimicking actions directly, it asks: what reward function would make this expert’s behavior optimal? Recover that reward, and you can plan against it with ordinary RL — generalizing to states the expert never visited, and even surpassing the expert. The two families share a goal (learn from demonstrations) but differ in what they output: imitation learning returns a policy; inverse RL returns a reward.

Two routes from expert demonstrations to a policy. Imitation learning maps states straight to actions (top); inverse RL first infers the reward the expert optimized, then runs RL against it (bottom).

▶ CS 285 Lecture 20: Inverse Reinforcement Learning (Sergey Levine, Berkeley Deep RL course)

Behavioral cloning: imitation as supervised learning

The simplest form of imitation learning is behavioral cloning (BC): treat the demonstrations as a labeled dataset of (state, action) pairs and fit a policy with plain supervised learning. Minimize, over policy parameters $\theta$ , the prediction error against the expert action $a^*$ :

\min_{\theta}\; \mathbb{E}_{(s,\,a^*)\sim \mathcal{D}_{\text{expert}}}\big[\,\ell\big(\pi_\theta(s),\,a^*\big)\,\big]

That’s it — no environment interaction, no reward, no rollouts. For categorical actions it’s cross-entropy; for continuous control it’s regression (or a generative head). BC is fast, stable, and shockingly effective when you have enough data covering the right states. It’s why it remains the workhorse of robot learning.

environment interactions needed to train BC

O(εT²)

worst-case error growth from covariate shift over horizon T

10 min

demonstrations ACT used to learn hard bimanual tasks (Zhao 2023)

The compounding-error problem

BC’s elegance hides a deadly flaw. It learns the expert’s action distribution only on states the expert visited. The moment the learner makes a small mistake, it drifts into states the expert never saw — where the policy was never trained, so it makes a bigger mistake, drifting further. Errors compound quadratically in the time horizon.

This mismatch between the training-state distribution and the states the learner actually induces is called covariate shift (or distribution shift): the policy is tested on a distribution it was never trained on. A self-driving model cloned from perfect human laps has no idea how to recover from the shoulder of the road, because perfect humans never drove there.

Covariate shift in behavioral cloning. The expert stays on a narrow data manifold; a single learner error pushes it off-distribution, where untrained behavior causes the next, larger error — trajectories diverge.

Go deeper: why the error is quadratic

If the cloned policy makes a mistake with probability $\epsilon$ at any state on-distribution, and a single mistake can put it off-distribution where it has no recovery guarantee for the remaining steps, the expected number of mistakes over a horizon $T$ scales like $\epsilon T^2$ rather than the $\epsilon T$ you’d get if errors didn’t change the state distribution. Ross and Bagnell’s analysis makes this precise: standard supervised learning bounds assume i.i.d. data, but in sequential control the learner’s own actions change the test distribution, breaking that assumption. DAgger (next section) restores a linear $\epsilon T$ bound by training on the learner’s induced distribution.

DAgger: fixing covariate shift

DAgger (Dataset Aggregation; Ross, Gordon & Bagnell, 2011) is the canonical cure. The insight: the training distribution is wrong, so change it — collect labels on the states the learner actually visits, not just the ones the expert visited. It turns the offline problem into an interactive one and provably reduces imitation learning to no-regret online learning, restoring linear error growth.

Bootstrap with behavioral cloning

Train an initial policy $\pi_1$ on the expert’s demonstrations by ordinary BC. It’s flawed, but good enough to start rolling out.

Roll out the learner, collect its states

Run the current policy in the environment. Record the states it actually visits — crucially, including the off-distribution ones it drifts into.

Ask the expert to label those states

For each visited state, query the expert for the action it would have taken. This is the labeling cost — DAgger needs an interactive (queryable) expert, not just a fixed dataset.

Aggregate and retrain

Add the new (state, expert action) pairs to the dataset and retrain on the union of all data. Repeat. Over iterations the dataset comes to cover the learner’s own distribution, and covariate shift vanishes.

Go deeper: DART, noise injection, and queryable-expert limits

DAgger’s Achilles heel is that it needs an expert available during training to label arbitrary states — expensive for humans and sometimes dangerous (you’re deliberately driving the learner into bad states). Two practical responses: DART (Laskey et al., 2017) injects noise into the expert’s demonstrations so the offline dataset already covers near-miss states — getting much of DAgger’s robustness without online queries. And disagreement-regularized variants use an ensemble’s uncertainty as a proxy for “off-distribution,” steering the learner back toward covered states without an expert in the loop.

Inverse RL: recover the reward, not the actions

Behavioral cloning and DAgger copy what the expert did. Inverse reinforcement learning asks a more ambitious question — why did they do it? — by recovering the reward function the expert appears to be optimizing. Formally (Ng & Russell, 2000): given an MDP without its reward, plus expert trajectories, find a reward $r$ under which the expert’s policy is (near-)optimal.

Why bother, when BC already gives you a policy? Because a reward generalizes. A reward function is a compact explanation of intent that transfers to new dynamics, new start states, and longer horizons — places where a cloned policy simply has no data. Recover the reward, then run standard RL (value-based or policy-gradient) against it, and you can match or exceed the expert.

The fundamental ambiguity

IRL is famously ill-posed. Many reward functions explain the same behavior — most trivially, $r \equiv 0$ makes every policy optimal, including the expert’s. Even excluding degenerate cases, an entire family of rewards is consistent with any finite set of demonstrations. The history of IRL is largely the history of principled tie-breakers for this ambiguity:

Approach	Tie-breaking principle	Key reference
Linear / feature matching	Reward is linear in features; match expert feature expectations	Ng & Russell 2000
Apprenticeship / max-margin	Prefer reward where expert beats alternatives by the largest margin	Abbeel & Ng 2004; Ratliff 2006
Maximum entropy	Among rewards that match features, pick the one whose trajectory distribution is least committed (highest entropy)	Ziebart et al. 2008
Adversarial (GAIL)	Skip the explicit reward; match the expert’s state-action occupancy via a discriminator	Ho & Ermon 2016

Maximum-entropy IRL

The most influential resolution of the ambiguity is maximum-entropy IRL (Ziebart et al., 2008). The idea: of all reward functions that reproduce the expert’s observed feature expectations, choose the one that commits to nothing else — the maximum-entropy distribution over trajectories. This makes trajectories with higher reward exponentially more likely, but otherwise stays maximally uncertain, which neatly resolves which of the infinitely many consistent rewards to pick.

Under a reward linear in features, $r(\tau) = \theta^\top \mathbf{f}(\tau)$ , MaxEnt IRL models the probability of a trajectory $\tau$ as proportional to its exponentiated reward:

P(\tau \mid \theta) = \frac{1}{Z(\theta)}\,\exp\!\big(\theta^\top \mathbf{f}(\tau)\big), \qquad Z(\theta) = \sum_{\tau} \exp\!\big(\theta^\top \mathbf{f}(\tau)\big)

Training maximizes the log-likelihood of the expert trajectories under this model. The gradient has a clean, intuitive form — the difference between the expert’s empirical feature counts and the feature counts expected under the current reward:

\nabla_\theta \mathcal{L} = \mathbf{f}_{\text{expert}} - \mathbb{E}_{P(\tau\mid\theta)}\big[\mathbf{f}(\tau)\big]

You push the reward weights so that planning under them produces the same feature statistics the expert exhibited. The partition function $Z(\theta)$ is the hard part — it requires solving the forward RL problem in the inner loop. Later work (guided cost learning, maximum-causal-entropy IRL for stochastic dynamics, and adversarial methods) made this tractable at scale and extended it beyond near-deterministic MDPs.

GAIL: imitation as distribution matching

Generative Adversarial Imitation Learning (Ho & Ermon, 2016) is the bridge between IRL and modern deep learning — and the most-used deep imitation method. Its key result: you can imitate an expert without ever recovering an explicit reward, by directly matching the expert’s distribution over state-action pairs (its occupancy measure). It borrows the GAN recipe:

A discriminator $D$ learns to tell expert state-action pairs apart from the learner’s.
The policy (generator) is trained with policy gradients to fool the discriminator — i.e., to visit state-action pairs that look expert-like.

The discriminator’s output plays the role of a learned reward: state-action pairs it judges “expert-like” get high reward, and the policy is optimized (typically with TRPO or PPO) against that signal. The minimax objective is:

\min_{\pi}\,\max_{D}\;\; \mathbb{E}_{\pi}\big[\log D(s,a)\big] + \mathbb{E}_{\pi_E}\big[\log(1 - D(s,a))\big] - \lambda\, H(\pi)

where $\pi_E$ is the expert, $H(\pi)$ is a causal-entropy regularizer, and $\lambda$ trades off how strongly to keep the policy stochastic. GAIL hit expert performance from a handful of trajectories on continuous-control benchmarks where behavioral cloning needed far more data — because, like DAgger, it trains on the learner’s own induced distribution, sidestepping covariate shift.

Behavioral cloning / DAgger — copy actions

Output a policy that reproduces expert actions. Fast and simple; BC suffers covariate shift, DAgger fixes it but needs a queryable expert. Best when demonstrations densely cover the deployment distribution.

Inverse RL / GAIL — recover intent

Output (explicitly or implicitly) the reward the expert optimized, then plan against it. Generalizes off-distribution and can exceed the expert; costlier, needs environment interaction. See reward shaping.

A short history

2000

Ng & Russell: Algorithms for IRL

Frames inverse RL formally and flags the core ambiguity: many rewards explain the same behavior.

2004

Abbeel & Ng: Apprenticeship learning

Max-margin feature-expectation matching — learn to act as well as the expert under its own (unknown) reward.

2008

Ziebart et al.: Maximum-entropy IRL

Resolves reward ambiguity with the max-entropy principle; becomes the dominant classical IRL formulation.

2011

Ross et al.: DAgger

Reduces imitation to no-regret online learning, killing the compounding-error problem of behavioral cloning.

2016

Ho & Ermon: GAIL

Casts imitation as adversarial occupancy matching — deep, scalable, reward-free imitation.

2023–24

ACT, Diffusion Policy & the BC renaissance

Action-chunking transformers and diffusion policies make behavioral cloning the backbone of real-world robot manipulation.

Where it’s used

Domain	Method	Why
Robot manipulation	BC variants (ACT, Diffusion Policy)	Teleoperated demos are cheap; rewards for dexterous tasks are nearly impossible to write
Autonomous driving	BC + DAgger, GAIL	Human driving logs are abundant; covariate-shift recovery is critical for safety
Game & character animation	GAIL / adversarial imitation	Match human-like motion style without hand-scoring “natural”
LLM post-training	SFT = behavioral cloning	RLHF and RLVR build on an SFT (cloned-demonstration) base
Bootstrapping RL	Imitation pre-training	Warm-start policy gradients to skip slow early exploration

The robotics renaissance is mostly behavioral cloning, scaled. Zhao et al.’s Action Chunking with Transformers (ACT) on the low-cost ALOHA platform learned fine bimanual tasks (threading, slotting a battery) from ~10 minutes of demonstrations by predicting chunks of future actions — a clever way to blunt compounding error. Diffusion Policy models the demonstration action distribution as a denoising process, handling multimodal expert behavior that naive regression averages away. Building demonstration pipelines, teleop rigs, and the surrounding RL environments at production scale is its own industry — see the human-demonstration data and RL environment vendors.

Limitations and open problems

You can’t beat a bad teacher (with BC). Behavioral cloning is capped at expert quality and inherits the expert’s mistakes; only reward-recovering methods (IRL) can exceed the demonstrator.
Reward ambiguity never fully goes away. IRL’s solutions are principled tie-breakers, not proofs of the true reward — the recovered reward may transfer poorly.
Adversarial instability. GAIL inherits GAN training pathologies: mode collapse, brittle discriminator-generator balance, sensitivity to hyperparameters.
Expert availability. DAgger needs an online, queryable expert; pure offline data limits you to BC or offline RL.
Compounding error persists. Even DAgger and GAIL only mitigate covariate shift; long horizons and tight tolerances remain hard.

Researcher takes

Levine frames the practical BC-vs-offline-RL decision as a subtle, data-dependent question rather than a settled default toward imitation.

View Sergey Levine's post on X →

Frequently asked questions

What’s the difference between imitation learning and inverse RL?

Imitation learning outputs a policy that reproduces expert behavior (e.g. behavioral cloning maps states to actions directly). Inverse RL outputs a reward function that explains the behavior; you then run ordinary RL against it to get a policy. IRL generalizes better off-distribution and can surpass the expert, but is more expensive and needs environment interaction. GAIL sits between them — it imitates without writing down an explicit reward.

Why does behavioral cloning fail when it’s just supervised learning?

Because the i.i.d. assumption of supervised learning breaks in sequential control. The policy’s own actions change which states it sees, so a small error pushes it off the training distribution, where it makes larger errors — compounding error from covariate shift, growing like $\epsilon T^2$ over a horizon $T$ . DAgger fixes this by training on the learner’s induced state distribution.

Is supervised fine-tuning of an LLM the same as imitation learning?

Essentially yes — SFT is behavioral cloning on (prompt, ideal response) demonstrations. It inherits BC’s strengths (simple, stable) and limits (capped at demonstration quality, no recovery from off-distribution states). That’s exactly why labs follow it with preference-based methods like RLHF and DPO, which can push beyond the demonstrations.

When should I use GAIL instead of DAgger?

Use DAgger when you have an expert you can query online and want a simple, provably good method. Use GAIL when you only have a fixed set of expert trajectories (no online expert) but can interact with the environment, and you need to generalize beyond the demonstrated states. GAIL is more sample-efficient in demonstrations but harder to train (adversarial instability) and requires environment rollouts.

Key papers

Algorithms for Inverse Reinforcement Learning — Ng & Russell, 2000 — defines the problem and its ambiguity.
Apprenticeship Learning via Inverse Reinforcement Learning — Abbeel & Ng, 2004 — feature-expectation matching.
Maximum Entropy Inverse Reinforcement Learning — Ziebart et al., 2008 — the canonical tie-breaker.
A Reduction of Imitation Learning to No-Regret Online Learning (DAgger) — Ross, Gordon & Bagnell, 2011.
Generative Adversarial Imitation Learning — Ho & Ermon, 2016 — adversarial occupancy matching.
Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware (ACT/ALOHA) — Zhao et al., 2023 — modern BC for robots.

What is reinforcement learning? · Markov decision processes · Reward shaping · Offline RL · Policy gradients · RL in robotics · RLHF