- Imitation learning teaches a policy from expert demonstrations instead of a reward function — useful when 'good' is easy to show but hard to score.
- Behavioral cloning is just supervised learning on (state, action) pairs; its fatal flaw is compounding error from covariate shift, which DAgger fixes by querying the expert on the learner's own states.
- Inverse RL flips the problem: instead of copying actions, infer the reward function the expert was optimizing — then plan against it, so the policy generalizes beyond the demonstrated states.
- Modern methods unify both: max-entropy IRL resolves reward ambiguity, GAIL casts imitation as adversarial distribution matching, and the same recipe underpins today's robot manipulation policies.
What is imitation learning?
Imitation learning trains an agent to act by copying an expert rather than by maximizing a reward you wrote down. You collect demonstrations — a human driving, a teleoperated robot arm, a chess grandmaster’s games — and learn a policy that reproduces them. It is the natural answer to a recurring problem in reinforcement learning: for many tasks, designing a reward function is harder than just showing the behavior. How would you score “drive like a careful human”? You can’t easily — but you have thousands of hours of careful humans driving.
Inverse reinforcement learning (IRL) is the deeper cousin. Instead of mimicking actions directly, it asks: what reward function would make this expert’s behavior optimal? Recover that reward, and you can plan against it with ordinary RL — generalizing to states the expert never visited, and even surpassing the expert. The two families share a goal (learn from demonstrations) but differ in what they output: imitation learning returns a policy; inverse RL returns a reward.
Behavioral cloning: imitation as supervised learning
The simplest form of imitation learning is behavioral cloning (BC): treat the demonstrations as a labeled dataset of (state, action) pairs and fit a policy with plain supervised learning. Minimize, over policy parameters , the prediction error against the expert action :
That’s it — no environment interaction, no reward, no rollouts. For categorical actions it’s cross-entropy; for continuous control it’s regression (or a generative head). BC is fast, stable, and shockingly effective when you have enough data covering the right states. It’s why it remains the workhorse of robot learning.
The compounding-error problem
BC’s elegance hides a deadly flaw. It learns the expert’s action distribution only on states the expert visited. The moment the learner makes a small mistake, it drifts into states the expert never saw — where the policy was never trained, so it makes a bigger mistake, drifting further. Errors compound quadratically in the time horizon.
This mismatch between the training-state distribution and the states the learner actually induces is called covariate shift (or distribution shift): the policy is tested on a distribution it was never trained on. A self-driving model cloned from perfect human laps has no idea how to recover from the shoulder of the road, because perfect humans never drove there.
Go deeper: why the error is quadratic
If the cloned policy makes a mistake with probability at any state on-distribution, and a single mistake can put it off-distribution where it has no recovery guarantee for the remaining steps, the expected number of mistakes over a horizon scales like rather than the you’d get if errors didn’t change the state distribution. Ross and Bagnell’s analysis makes this precise: standard supervised learning bounds assume i.i.d. data, but in sequential control the learner’s own actions change the test distribution, breaking that assumption. DAgger (next section) restores a linear bound by training on the learner’s induced distribution.
DAgger: fixing covariate shift
DAgger (Dataset Aggregation; Ross, Gordon & Bagnell, 2011) is the canonical cure. The insight: the training distribution is wrong, so change it — collect labels on the states the learner actually visits, not just the ones the expert visited. It turns the offline problem into an interactive one and provably reduces imitation learning to no-regret online learning, restoring linear error growth.
Train an initial policy on the expert’s demonstrations by ordinary BC. It’s flawed, but good enough to start rolling out.
Run the current policy in the environment. Record the states it actually visits — crucially, including the off-distribution ones it drifts into.
For each visited state, query the expert for the action it would have taken. This is the labeling cost — DAgger needs an interactive (queryable) expert, not just a fixed dataset.
Add the new (state, expert action) pairs to the dataset and retrain on the union of all data. Repeat. Over iterations the dataset comes to cover the learner’s own distribution, and covariate shift vanishes.
Go deeper: DART, noise injection, and queryable-expert limits
DAgger’s Achilles heel is that it needs an expert available during training to label arbitrary states — expensive for humans and sometimes dangerous (you’re deliberately driving the learner into bad states). Two practical responses: DART (Laskey et al., 2017) injects noise into the expert’s demonstrations so the offline dataset already covers near-miss states — getting much of DAgger’s robustness without online queries. And disagreement-regularized variants use an ensemble’s uncertainty as a proxy for “off-distribution,” steering the learner back toward covered states without an expert in the loop.
Inverse RL: recover the reward, not the actions
Behavioral cloning and DAgger copy what the expert did. Inverse reinforcement learning asks a more ambitious question — why did they do it? — by recovering the reward function the expert appears to be optimizing. Formally (Ng & Russell, 2000): given an MDP without its reward, plus expert trajectories, find a reward under which the expert’s policy is (near-)optimal.
Why bother, when BC already gives you a policy? Because a reward generalizes. A reward function is a compact explanation of intent that transfers to new dynamics, new start states, and longer horizons — places where a cloned policy simply has no data. Recover the reward, then run standard RL (value-based or policy-gradient) against it, and you can match or exceed the expert.
The fundamental ambiguity
IRL is famously ill-posed. Many reward functions explain the same behavior — most trivially, makes every policy optimal, including the expert’s. Even excluding degenerate cases, an entire family of rewards is consistent with any finite set of demonstrations. The history of IRL is largely the history of principled tie-breakers for this ambiguity:
| Approach | Tie-breaking principle | Key reference |
|---|---|---|
| Linear / feature matching | Reward is linear in features; match expert feature expectations | Ng & Russell 2000 |
| Apprenticeship / max-margin | Prefer reward where expert beats alternatives by the largest margin | Abbeel & Ng 2004; Ratliff 2006 |
| Maximum entropy | Among rewards that match features, pick the one whose trajectory distribution is least committed (highest entropy) | Ziebart et al. 2008 |
| Adversarial (GAIL) | Skip the explicit reward; match the expert’s state-action occupancy via a discriminator | Ho & Ermon 2016 |
Maximum-entropy IRL
The most influential resolution of the ambiguity is maximum-entropy IRL (Ziebart et al., 2008). The idea: of all reward functions that reproduce the expert’s observed feature expectations, choose the one that commits to nothing else — the maximum-entropy distribution over trajectories. This makes trajectories with higher reward exponentially more likely, but otherwise stays maximally uncertain, which neatly resolves which of the infinitely many consistent rewards to pick.
Under a reward linear in features, , MaxEnt IRL models the probability of a trajectory as proportional to its exponentiated reward:
Training maximizes the log-likelihood of the expert trajectories under this model. The gradient has a clean, intuitive form — the difference between the expert’s empirical feature counts and the feature counts expected under the current reward:
You push the reward weights so that planning under them produces the same feature statistics the expert exhibited. The partition function is the hard part — it requires solving the forward RL problem in the inner loop. Later work (guided cost learning, maximum-causal-entropy IRL for stochastic dynamics, and adversarial methods) made this tractable at scale and extended it beyond near-deterministic MDPs.
GAIL: imitation as distribution matching
Generative Adversarial Imitation Learning (Ho & Ermon, 2016) is the bridge between IRL and modern deep learning — and the most-used deep imitation method. Its key result: you can imitate an expert without ever recovering an explicit reward, by directly matching the expert’s distribution over state-action pairs (its occupancy measure). It borrows the GAN recipe:
- A discriminator learns to tell expert state-action pairs apart from the learner’s.
- The policy (generator) is trained with policy gradients to fool the discriminator — i.e., to visit state-action pairs that look expert-like.
The discriminator’s output plays the role of a learned reward: state-action pairs it judges “expert-like” get high reward, and the policy is optimized (typically with TRPO or PPO) against that signal. The minimax objective is:
where is the expert, is a causal-entropy regularizer, and trades off how strongly to keep the policy stochastic. GAIL hit expert performance from a handful of trajectories on continuous-control benchmarks where behavioral cloning needed far more data — because, like DAgger, it trains on the learner’s own induced distribution, sidestepping covariate shift.
Output a policy that reproduces expert actions. Fast and simple; BC suffers covariate shift, DAgger fixes it but needs a queryable expert. Best when demonstrations densely cover the deployment distribution.
Output (explicitly or implicitly) the reward the expert optimized, then plan against it. Generalizes off-distribution and can exceed the expert; costlier, needs environment interaction. See reward shaping.
A short history
Where it’s used
| Domain | Method | Why |
|---|---|---|
| Robot manipulation | BC variants (ACT, Diffusion Policy) | Teleoperated demos are cheap; rewards for dexterous tasks are nearly impossible to write |
| Autonomous driving | BC + DAgger, GAIL | Human driving logs are abundant; covariate-shift recovery is critical for safety |
| Game & character animation | GAIL / adversarial imitation | Match human-like motion style without hand-scoring “natural” |
| LLM post-training | SFT = behavioral cloning | RLHF and RLVR build on an SFT (cloned-demonstration) base |
| Bootstrapping RL | Imitation pre-training | Warm-start policy gradients to skip slow early exploration |
The robotics renaissance is mostly behavioral cloning, scaled. Zhao et al.’s Action Chunking with Transformers (ACT) on the low-cost ALOHA platform learned fine bimanual tasks (threading, slotting a battery) from ~10 minutes of demonstrations by predicting chunks of future actions — a clever way to blunt compounding error. Diffusion Policy models the demonstration action distribution as a denoising process, handling multimodal expert behavior that naive regression averages away. Building demonstration pipelines, teleop rigs, and the surrounding RL environments at production scale is its own industry — see the human-demonstration data and RL environment vendors.
Limitations and open problems
- You can’t beat a bad teacher (with BC). Behavioral cloning is capped at expert quality and inherits the expert’s mistakes; only reward-recovering methods (IRL) can exceed the demonstrator.
- Reward ambiguity never fully goes away. IRL’s solutions are principled tie-breakers, not proofs of the true reward — the recovered reward may transfer poorly.
- Adversarial instability. GAIL inherits GAN training pathologies: mode collapse, brittle discriminator-generator balance, sensitivity to hyperparameters.
- Expert availability. DAgger needs an online, queryable expert; pure offline data limits you to BC or offline RL.
- Compounding error persists. Even DAgger and GAIL only mitigate covariate shift; long horizons and tight tolerances remain hard.
Researcher takes
Levine frames the practical BC-vs-offline-RL decision as a subtle, data-dependent question rather than a settled default toward imitation.
Frequently asked questions
What’s the difference between imitation learning and inverse RL?
Imitation learning outputs a policy that reproduces expert behavior (e.g. behavioral cloning maps states to actions directly). Inverse RL outputs a reward function that explains the behavior; you then run ordinary RL against it to get a policy. IRL generalizes better off-distribution and can surpass the expert, but is more expensive and needs environment interaction. GAIL sits between them — it imitates without writing down an explicit reward.
Why does behavioral cloning fail when it’s just supervised learning?
Because the i.i.d. assumption of supervised learning breaks in sequential control. The policy’s own actions change which states it sees, so a small error pushes it off the training distribution, where it makes larger errors — compounding error from covariate shift, growing like over a horizon . DAgger fixes this by training on the learner’s induced state distribution.
Is supervised fine-tuning of an LLM the same as imitation learning?
Essentially yes — SFT is behavioral cloning on (prompt, ideal response) demonstrations. It inherits BC’s strengths (simple, stable) and limits (capped at demonstration quality, no recovery from off-distribution states). That’s exactly why labs follow it with preference-based methods like RLHF and DPO, which can push beyond the demonstrations.
When should I use GAIL instead of DAgger?
Use DAgger when you have an expert you can query online and want a simple, provably good method. Use GAIL when you only have a fixed set of expert trajectories (no online expert) but can interact with the environment, and you need to generalize beyond the demonstrated states. GAIL is more sample-efficient in demonstrations but harder to train (adversarial instability) and requires environment rollouts.
Key papers
- Algorithms for Inverse Reinforcement Learning — Ng & Russell, 2000 — defines the problem and its ambiguity.
- Apprenticeship Learning via Inverse Reinforcement Learning — Abbeel & Ng, 2004 — feature-expectation matching.
- Maximum Entropy Inverse Reinforcement Learning — Ziebart et al., 2008 — the canonical tie-breaker.
- A Reduction of Imitation Learning to No-Regret Online Learning (DAgger) — Ross, Gordon & Bagnell, 2011.
- Generative Adversarial Imitation Learning — Ho & Ermon, 2016 — adversarial occupancy matching.
- Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware (ACT/ALOHA) — Zhao et al., 2023 — modern BC for robots.
Related
What is reinforcement learning? · Markov decision processes · Reward shaping · Offline RL · Policy gradients · RL in robotics · RLHF