- Curiosity gives an agent its own internal reward for finding novel or surprising states, so it explores even when the environment pays out almost nothing.
- Two dominant recipes: prediction error in a learned feature space (ICM) and the prediction error against a fixed random network (RND).
- It cracked sparse-reward 'hard exploration' games like Montezuma's Revenge that ε-greedy and naive bonuses never solved.
- The classic failure is the noisy-TV problem — agents get hypnotized by random, unpredictable stimuli that are novel but worthless to learn.
What is curiosity-driven RL?
Most reinforcement learning assumes the environment hands out a useful reward. But many real tasks are sparse: a robot gets nothing until it assembles the part, an agent in a maze gets nothing until it finds the exit. Plain exploration by random actions (ε-greedy) almost never stumbles onto that one rewarding state in a vast space. The agent wanders, sees zero reward everywhere, and learns nothing.
Intrinsic motivation fixes this by giving the agent a reward it generates itself. Instead of waiting for the environment, the agent is paid a small intrinsic reward for doing things that are novel, surprising, or informative — visiting an unfamiliar room, taking an action whose outcome it couldn’t predict. This is the computational analogue of biological curiosity: a drive to seek out the new for its own sake. The total reward the agent optimizes becomes
where is the (possibly zero) environment reward, is the curiosity bonus, and scales how curious the agent is. Crucially the agent can learn purely from even when is flat zero for thousands of steps.
Why naive exploration fails
In a game like Montezuma’s Revenge, the agent must climb ladders, jump gaps and grab a key before any points appear — hundreds of precise actions deep. The probability that random ε-greedy noise produces that exact sequence is effectively zero, so vanilla DQN scored ~0 for years. The reward landscape is a flat plain with one tiny peak you’ll never find by accident.
Curiosity changes the shape of the problem. Even with no points, every new screen is interesting, so the agent is pulled outward — through rooms, down corridors — building competence until it bumps into the sparse extrinsic reward.
The two big recipes for “surprise”
How do you turn “novelty” into a number? Almost every method is one of two ideas: count how often you’ve seen a state (rare = novel), or measure how wrong your prediction was (unpredictable = novel). Modern deep-RL curiosity is dominated by the prediction-error family.
Reward states you’ve visited rarely. In a tabular world, bonus where is the visit count. The challenge: in pixel worlds you never see the exact same state twice, so you need pseudo-counts from a density model (Bellemare et al., 2016).
Reward states your model fails to predict. Train a predictor of “what happens next”; where it errs, the world is unfamiliar, so pay a bonus. This is the basis of ICM and RND — and the focus below.
ICM: curiosity as prediction error in feature space
The Intrinsic Curiosity Module (ICM) of Pathak et al., 2017 is the canonical prediction-error method. Its key insight: don’t predict raw pixels — predict in a learned feature space that only encodes things the agent can control or affect. That deliberately throws away distractors (swaying trees, flickering backgrounds) the agent can’t influence.
ICM has three pieces working together:
A network maps a raw state to features . The trick is how it’s trained — not to reconstruct pixels, but to be useful for the inverse model below, so it keeps only action-relevant structure.
Given the features of two consecutive states, predict the action that linked them: . Minimizing this loss forces to encode consequences of the agent’s actions and ignore uncontrollable noise.
Predict the next state’s features from the current features and action: . The intrinsic reward is the prediction error:
A big error means “I didn’t see that coming” — a novel, learnable transition — so the agent is rewarded for going there.
RND: surprise against a frozen random network
Burda et al., 2018 introduced Random Network Distillation (RND) — simpler than ICM and the method that first beat average human performance on Montezuma’s Revenge. It needs no forward/inverse dynamics at all.
The setup is two networks over a single observation:
- A target network with fixed, random weights — never trained. It maps an observation to a feature vector. It is an arbitrary but deterministic function.
- A predictor network that is trained to match the target’s output on states the agent visits.
The intrinsic reward is the distillation error:
For states seen often, the predictor has learned to mimic the target, so the error — and the bonus — shrinks toward zero. For novel states, the predictor hasn’t been trained there yet, so it’s wrong and the bonus is high. Novelty becomes “how much haven’t I fit this region yet.” Because the target depends only on the current observation (not on predicting a stochastic future), RND sidesteps a major source of unlearnable noise.
The noisy-TV problem
The Achilles’ heel of every prediction-error method is stochasticity that is novel but useless to learn. The classic thought experiment: put a TV showing random static in the maze, and a remote in the agent’s hand. Each press produces an unpredictable new image — so prediction error stays permanently high, and the agent is rewarded forever for just watching the noise. It becomes a “couch potato” and stops exploring the actual world.
Mitigations each attack the confusion between reducible (epistemic) and irreducible (aleatoric) uncertainty:
| Approach | Idea | Cost |
|---|---|---|
| ICM feature space | Predict only controllable features, so uncontrollable noise is filtered before it’s measured | Inverse model can still leak some noise |
| RND | Predict a deterministic function of the current state, not a stochastic future | A perfectly novel-but-noisy state can still score high |
| Learning progress | Reward the rate of error reduction, not the error itself; static noise never improves so it pays nothing | Harder to estimate; Schmidhuber’s original framing |
| Reward the variance away | Estimate aleatoric uncertainty explicitly and subtract it (Mavor-Parker et al., 2021) | Extra modeling machinery |
Go deeper: epistemic vs aleatoric uncertainty
The whole noisy-TV issue is a confusion of two uncertainties. Epistemic uncertainty is uncertainty about the world that more data removes — an unexplored room you simply haven’t seen. Aleatoric uncertainty is intrinsic randomness no amount of data removes — the static on a TV, a die roll. Good curiosity should chase epistemic uncertainty (it points at things worth learning) and ignore aleatoric (a dead end). Prediction error mixes the two: both produce a bad prediction. Schmidhuber’s original proposal — reward learning progress, i.e. the derivative of prediction error — is elegant precisely because pure noise has zero learning progress: you never get better at predicting a die, so it pays nothing.
Beyond prediction error
Curiosity is one branch of a wider intrinsic-motivation family. Other signals an agent can generate for itself:
- Empowerment — reward states from which the agent has the most control over its future (maximizing mutual information between actions and outcomes). Be drawn to where you have options.
- Skill / diversity objectives — learn a set of distinguishable behaviors with no reward at all, e.g. DIAYN (“Diversity is All You Need”, Eysenbach et al., 2018), where the bonus rewards skills that lead to distinguishable states. Closely tied to hierarchical RL and unsupervised RL.
- Goal-conditioned novelty — set your own goals at the frontier of what you can currently reach (automatic curriculum learning).
- Episodic novelty — count novelty within an episode so the agent keeps moving (used in NGU and Agent57).
These often complement curiosity rather than replace it: a frontier exploration agent might combine episodic novelty, lifelong RND-style novelty, and an extrinsic reward.
Go deeper: Go-Explore and “first return, then explore”
Go-Explore (Ecoffet et al.; the refined version published in Nature, 2021) argued that prediction-error curiosity suffers from detachment (forgetting how to get back to promising frontiers) and derailment (exploration noise knocking you off course before you reach them). Its fix is structural rather than a new reward: explicitly remember promising visited states in an archive, deterministically return to one, then explore from there. It blew past prior records — scoring over 2,000,000 on Montezuma’s Revenge and reliably solving the full game — and is a useful counterpoint that better memory and returning can matter as much as a better novelty signal. See also exploration vs exploitation.
A short history
Where curiosity is used
| Setting | Why curiosity helps |
|---|---|
| Hard-exploration games (Montezuma’s Revenge, Pitfall) | The original proving ground for sparse-reward exploration |
| Robotics with sparse task reward | Drives a robot to discover useful contacts and motions before any task reward appears — see RL in robotics |
| Procedurally generated / open-ended worlds | Novelty keeps an agent exploring an endless map (Minecraft-style) |
| Pretraining / unsupervised RL | Build broad skills with no reward, then fine-tune on the real task |
| Multi-agent sparse settings | Each agent explores a huge joint space — see multi-agent RL |
The deeper connection: curiosity is exploration folded into the reward, and a curiosity bonus is a form of reward shaping — so the usual shaping caution applies. Build it badly and you change the optimal policy; build it as a vanishing bonus and you only change how the agent gets there.
Limitations and open problems
- Noisy-TV / stochasticity — the central failure mode; still no universally clean solution.
- Vanishing curiosity — once everything is explored, the bonus dies, and if the extrinsic reward is still zero the agent has nothing to optimize. Curiosity bootstraps exploration; it doesn’t replace a goal.
- Scale sensitivity — the bonus weight and reward normalization are notoriously finicky; intrinsic and extrinsic rewards live on different, drifting scales (RND uses running normalization and separate value heads to cope).
- Reproducibility — results are sensitive to implementation details; the RLeXplore benchmark exists precisely to make intrinsic-reward research comparable.
Curiosity in practice
If you are adding curiosity to an agent, the pragmatic default is RND on top of PPO: two small networks, an extra distillation loss, a normalized intrinsic reward summed into the advantage, and ideally two value heads (one for extrinsic, one for intrinsic) since the two rewards have very different horizons. Reach for ICM when you need the controllable-features filter because your observations are full of distractors. Use libraries — RLeXplore bundles many intrinsic-reward methods, and general RL frameworks host PPO baselines to attach them to.
Curiosity is best seen as a complement to the broader exploration vs exploitation toolkit, not a silver bullet — pair it with good environment design and, where possible, a little extrinsic signal to anchor the goal. Building and benchmarking these exploration stacks at scale is its own tooling problem; see the list of RL environment startups.
Researcher takes
Jeff Clune on hard-exploration games: a minimal intrinsic signal — rewarding the agent simply for visiting new states within its lifetime — was enough to match the state of the art on Montezuma’s Revenge.
Frequently asked questions
What’s the difference between ICM and RND?
Both reward prediction error, but on different targets. ICM predicts the next state’s features given the action, in a feature space trained to capture only controllable things — principled but heavier. RND predicts the output of a fixed random network on the current state — no dynamics model, far simpler, and more robust to stochastic transitions. RND is the usual first choice; ICM when you specifically need its noise-filtering features.
Is a curiosity bonus the same as reward shaping?
Yes — it is a learned, state-dependent shaping term added to the reward. The key difference from classic reward shaping is that it’s self-generated and non-stationary: the bonus shrinks as the agent learns a region. To avoid changing the optimal policy, you generally want it to vanish over training so it influences how the agent explores, not what it ultimately optimizes.
Can an agent learn with no extrinsic reward at all?
Often, surprisingly far. OpenAI’s large-scale study showed agents driven purely by curiosity learn to play many Atari games competently, because in human-designed games progress and novelty happen to be correlated. But pure curiosity has no notion of the task — when novelty and the goal diverge, you still need an extrinsic signal to point the way.
How does curiosity relate to exploration like ε-greedy?
ε-greedy explores by injecting random actions — undirected and blind. Curiosity explores by directing the agent toward states it judges novel or informative, which is dramatically more sample-efficient in sparse, high-dimensional worlds. It’s directed exploration baked into the reward rather than bolted onto the action selection.
Key papers
- Artificial Curiosity / Formal Theory of Creativity — Schmidhuber, 1991– — learning progress as intrinsic reward.
- Unifying Count-Based Exploration and Intrinsic Motivation — Bellemare et al., 2016 — pseudo-counts.
- Curiosity-Driven Exploration by Self-Supervised Prediction — Pathak et al., 2017 — the ICM.
- Large-Scale Study of Curiosity-Driven Learning — Burda et al., 2018 — pure curiosity across 54 environments.
- Exploration by Random Network Distillation — Burda et al., 2018 — RND, human-level Montezuma’s Revenge.
- First Return, Then Explore (Go-Explore) — Ecoffet et al., Nature 2021 — structural exploration.
Related
Exploration vs exploitation · Reward shaping · Deep Q-networks · PPO · Curriculum learning · Hierarchical RL · RL in robotics · What is reinforcement learning?