reinforcement-learning.com
// ADVANCED TOPICS

Curiosity & Intrinsic Motivation in RL

How curiosity-driven RL agents explore via intrinsic rewards — prediction error (ICM), random network distillation (RND), count-based novelty, the noisy-TV problem and where it's used.

Updated 2026-06-08 15 min read
Key takeaways
  • Curiosity gives an agent its own internal reward for finding novel or surprising states, so it explores even when the environment pays out almost nothing.
  • Two dominant recipes: prediction error in a learned feature space (ICM) and the prediction error against a fixed random network (RND).
  • It cracked sparse-reward 'hard exploration' games like Montezuma's Revenge that ε-greedy and naive bonuses never solved.
  • The classic failure is the noisy-TV problem — agents get hypnotized by random, unpredictable stimuli that are novel but worthless to learn.

What is curiosity-driven RL?

Most reinforcement learning assumes the environment hands out a useful reward. But many real tasks are sparse: a robot gets nothing until it assembles the part, an agent in a maze gets nothing until it finds the exit. Plain exploration by random actions (ε-greedy) almost never stumbles onto that one rewarding state in a vast space. The agent wanders, sees zero reward everywhere, and learns nothing.

Intrinsic motivation fixes this by giving the agent a reward it generates itself. Instead of waiting for the environment, the agent is paid a small intrinsic reward for doing things that are novel, surprising, or informative — visiting an unfamiliar room, taking an action whose outcome it couldn’t predict. This is the computational analogue of biological curiosity: a drive to seek out the new for its own sake. The total reward the agent optimizes becomes

rt=rtext+βrtintr_t = r_t^{\text{ext}} + \beta \, r_t^{\text{int}}

where rtextr_t^{\text{ext}} is the (possibly zero) environment reward, rtintr_t^{\text{int}} is the curiosity bonus, and β\beta scales how curious the agent is. Crucially the agent can learn purely from rtintr_t^{\text{int}} even when rtextr_t^{\text{ext}} is flat zero for thousands of steps.

Agent(policy)Environmentaction a(t)state s(t+1), extrinsic r-ext (often 0)Curiosity moduleintrinsic r-intobserves s(t), a(t), s(t+1)
Total reward = extrinsic reward from the environment plus a scaled intrinsic curiosity bonus. When extrinsic reward is sparse (mostly zero), the intrinsic signal is what actually drives learning and exploration.

Why naive exploration fails

In a game like Montezuma’s Revenge, the agent must climb ladders, jump gaps and grab a key before any points appear — hundreds of precise actions deep. The probability that random ε-greedy noise produces that exact sequence is effectively zero, so vanilla DQN scored ~0 for years. The reward landscape is a flat plain with one tiny peak you’ll never find by accident.

Curiosity changes the shape of the problem. Even with no points, every new screen is interesting, so the agent is pulled outward — through rooms, down corridors — building competence until it bumps into the sparse extrinsic reward.

0
Score classic DQN reached on Montezuma's Revenge for years
22 / 24
Level-1 rooms an RND agent explored — first to beat average human
54
Atari/benchmark environments in OpenAI's pure-curiosity study

The two big recipes for “surprise”

How do you turn “novelty” into a number? Almost every method is one of two ideas: count how often you’ve seen a state (rare = novel), or measure how wrong your prediction was (unpredictable = novel). Modern deep-RL curiosity is dominated by the prediction-error family.

Count-based novelty

Reward states you’ve visited rarely. In a tabular world, bonus 1/N(s)\propto 1/\sqrt{N(s)} where N(s)N(s) is the visit count. The challenge: in pixel worlds you never see the exact same state twice, so you need pseudo-counts from a density model (Bellemare et al., 2016).

Prediction-error novelty

Reward states your model fails to predict. Train a predictor of “what happens next”; where it errs, the world is unfamiliar, so pay a bonus. This is the basis of ICM and RND — and the focus below.

ICM: curiosity as prediction error in feature space

The Intrinsic Curiosity Module (ICM) of Pathak et al., 2017 is the canonical prediction-error method. Its key insight: don’t predict raw pixels — predict in a learned feature space that only encodes things the agent can control or affect. That deliberately throws away distractors (swaying trees, flickering backgrounds) the agent can’t influence.

ICM has three pieces working together:

1
Encoder φ

A network maps a raw state sts_t to features ϕ(st)\phi(s_t). The trick is how it’s trained — not to reconstruct pixels, but to be useful for the inverse model below, so it keeps only action-relevant structure.

2
Inverse model — what shapes the features

Given the features of two consecutive states, predict the action that linked them: a^t=g(ϕ(st),ϕ(st+1))\hat{a}_t = g(\phi(s_t), \phi(s_{t+1})). Minimizing this loss forces ϕ\phi to encode consequences of the agent’s actions and ignore uncontrollable noise.

3
Forward model — the curiosity signal

Predict the next state’s features from the current features and action: ϕ^(st+1)=f(ϕ(st),at)\hat{\phi}(s_{t+1}) = f(\phi(s_t), a_t). The intrinsic reward is the prediction error:

rtint=η2ϕ^(st+1)ϕ(st+1)22r_t^{\text{int}} = \tfrac{\eta}{2}\,\big\lVert \hat{\phi}(s_{t+1}) - \phi(s_{t+1}) \big\rVert_2^2

A big error means “I didn’t see that coming” — a novel, learnable transition — so the agent is rewarded for going there.

state s(t)state s(t+1)encoder φφ(s t)encoder φφ(s t+1)inverse modelpredict a(t)forward modelpredict φ(s t+1)+ a(t)prediction error= intrinsic reward r-intcompare to actual φ(s t+1)
ICM: the encoder turns states into features; the inverse model trains those features to capture only what the agent's actions affect; the forward model's prediction error on the next-state features becomes the intrinsic reward.

RND: surprise against a frozen random network

Burda et al., 2018 introduced Random Network Distillation (RND) — simpler than ICM and the method that first beat average human performance on Montezuma’s Revenge. It needs no forward/inverse dynamics at all.

The setup is two networks over a single observation:

  • A target network ff with fixed, random weights — never trained. It maps an observation to a feature vector. It is an arbitrary but deterministic function.
  • A predictor network f^\hat{f} that is trained to match the target’s output on states the agent visits.

The intrinsic reward is the distillation error:

rtint=f^(st+1)f(st+1)22r_t^{\text{int}} = \big\lVert \hat{f}(s_{t+1}) - f(s_{t+1}) \big\rVert_2^2

For states seen often, the predictor has learned to mimic the target, so the error — and the bonus — shrinks toward zero. For novel states, the predictor hasn’t been trained there yet, so it’s wrong and the bonus is high. Novelty becomes “how much haven’t I fit this region yet.” Because the target depends only on the current observation (not on predicting a stochastic future), RND sidesteps a major source of unlearnable noise.

The noisy-TV problem

The Achilles’ heel of every prediction-error method is stochasticity that is novel but useless to learn. The classic thought experiment: put a TV showing random static in the maze, and a remote in the agent’s hand. Each press produces an unpredictable new image — so prediction error stays permanently high, and the agent is rewarded forever for just watching the noise. It becomes a “couch potato” and stops exploring the actual world.

Epistemic (good)an unexplored roomerror is high now,but shrinks as you learn→ keep exploringAleatoric (trap)TV showing staticerror stays high forever,nothing to learn→ agent gets stuck
The noisy-TV trap. A prediction-error agent confuses two kinds of uncertainty: epistemic (reducible — worth exploring) and aleatoric (irreducible randomness — a dead end). Static on a TV has high prediction error forever, so a naive agent gets stuck.

Mitigations each attack the confusion between reducible (epistemic) and irreducible (aleatoric) uncertainty:

ApproachIdeaCost
ICM feature spacePredict only controllable features, so uncontrollable noise is filtered before it’s measuredInverse model can still leak some noise
RNDPredict a deterministic function of the current state, not a stochastic futureA perfectly novel-but-noisy state can still score high
Learning progressReward the rate of error reduction, not the error itself; static noise never improves so it pays nothingHarder to estimate; Schmidhuber’s original framing
Reward the variance awayEstimate aleatoric uncertainty explicitly and subtract it (Mavor-Parker et al., 2021)Extra modeling machinery
Go deeper: epistemic vs aleatoric uncertainty

The whole noisy-TV issue is a confusion of two uncertainties. Epistemic uncertainty is uncertainty about the world that more data removes — an unexplored room you simply haven’t seen. Aleatoric uncertainty is intrinsic randomness no amount of data removes — the static on a TV, a die roll. Good curiosity should chase epistemic uncertainty (it points at things worth learning) and ignore aleatoric (a dead end). Prediction error mixes the two: both produce a bad prediction. Schmidhuber’s original proposal — reward learning progress, i.e. the derivative of prediction error — is elegant precisely because pure noise has zero learning progress: you never get better at predicting a die, so it pays nothing.

Beyond prediction error

Curiosity is one branch of a wider intrinsic-motivation family. Other signals an agent can generate for itself:

  • Empowerment — reward states from which the agent has the most control over its future (maximizing mutual information between actions and outcomes). Be drawn to where you have options.
  • Skill / diversity objectives — learn a set of distinguishable behaviors with no reward at all, e.g. DIAYN (“Diversity is All You Need”, Eysenbach et al., 2018), where the bonus rewards skills that lead to distinguishable states. Closely tied to hierarchical RL and unsupervised RL.
  • Goal-conditioned novelty — set your own goals at the frontier of what you can currently reach (automatic curriculum learning).
  • Episodic novelty — count novelty within an episode so the agent keeps moving (used in NGU and Agent57).

These often complement curiosity rather than replace it: a frontier exploration agent might combine episodic novelty, lifelong RND-style novelty, and an extrinsic reward.

Go deeper: Go-Explore and “first return, then explore”

Go-Explore (Ecoffet et al.; the refined version published in Nature, 2021) argued that prediction-error curiosity suffers from detachment (forgetting how to get back to promising frontiers) and derailment (exploration noise knocking you off course before you reach them). Its fix is structural rather than a new reward: explicitly remember promising visited states in an archive, deterministically return to one, then explore from there. It blew past prior records — scoring over 2,000,000 on Montezuma’s Revenge and reliably solving the full game — and is a useful counterpoint that better memory and returning can matter as much as a better novelty signal. See also exploration vs exploitation.

A short history

1991
Artificial curiosity
Schmidhuber proposes rewarding an RL agent for the learning progress of its world model — the formal seed of intrinsic motivation. See his overview.
2016
Pseudo-counts
Bellemare et al. derive pseudo-counts from a density model, scaling count-based novelty to raw Atari pixels and cracking early progress on Montezuma’s Revenge.
2017
ICM
Pathak et al. introduce the Intrinsic Curiosity Module — prediction error in a learned, controllable feature space.
2018
Large-scale curiosity & RND
Burda et al. study pure curiosity across 54 environments, then ship RND — first to beat average human on Montezuma’s Revenge without demos.
2019–21
Go-Explore & Agent57
Structural exploration (remember-and-return) and combined novelty signals solve all unsolved Atari hard-exploration games.

Where curiosity is used

SettingWhy curiosity helps
Hard-exploration games (Montezuma’s Revenge, Pitfall)The original proving ground for sparse-reward exploration
Robotics with sparse task rewardDrives a robot to discover useful contacts and motions before any task reward appears — see RL in robotics
Procedurally generated / open-ended worldsNovelty keeps an agent exploring an endless map (Minecraft-style)
Pretraining / unsupervised RLBuild broad skills with no reward, then fine-tune on the real task
Multi-agent sparse settingsEach agent explores a huge joint space — see multi-agent RL

The deeper connection: curiosity is exploration folded into the reward, and a curiosity bonus is a form of reward shaping — so the usual shaping caution applies. Build it badly and you change the optimal policy; build it as a vanishing bonus and you only change how the agent gets there.

Limitations and open problems

  • Noisy-TV / stochasticity — the central failure mode; still no universally clean solution.
  • Vanishing curiosity — once everything is explored, the bonus dies, and if the extrinsic reward is still zero the agent has nothing to optimize. Curiosity bootstraps exploration; it doesn’t replace a goal.
  • Scale sensitivity — the bonus weight β\beta and reward normalization are notoriously finicky; intrinsic and extrinsic rewards live on different, drifting scales (RND uses running normalization and separate value heads to cope).
  • Reproducibility — results are sensitive to implementation details; the RLeXplore benchmark exists precisely to make intrinsic-reward research comparable.

Curiosity in practice

If you are adding curiosity to an agent, the pragmatic default is RND on top of PPO: two small networks, an extra distillation loss, a normalized intrinsic reward summed into the advantage, and ideally two value heads (one for extrinsic, one for intrinsic) since the two rewards have very different horizons. Reach for ICM when you need the controllable-features filter because your observations are full of distractors. Use libraries — RLeXplore bundles many intrinsic-reward methods, and general RL frameworks host PPO baselines to attach them to.

Curiosity is best seen as a complement to the broader exploration vs exploitation toolkit, not a silver bullet — pair it with good environment design and, where possible, a little extrinsic signal to anchor the goal. Building and benchmarking these exploration stacks at scale is its own tooling problem; see the list of RL environment startups.

Researcher takes

Jeff Clune on hard-exploration games: a minimal intrinsic signal — rewarding the agent simply for visiting new states within its lifetime — was enough to match the state of the art on Montezuma’s Revenge.

Frequently asked questions

What’s the difference between ICM and RND?

Both reward prediction error, but on different targets. ICM predicts the next state’s features given the action, in a feature space trained to capture only controllable things — principled but heavier. RND predicts the output of a fixed random network on the current state — no dynamics model, far simpler, and more robust to stochastic transitions. RND is the usual first choice; ICM when you specifically need its noise-filtering features.

Is a curiosity bonus the same as reward shaping?

Yes — it is a learned, state-dependent shaping term added to the reward. The key difference from classic reward shaping is that it’s self-generated and non-stationary: the bonus shrinks as the agent learns a region. To avoid changing the optimal policy, you generally want it to vanish over training so it influences how the agent explores, not what it ultimately optimizes.

Can an agent learn with no extrinsic reward at all?

Often, surprisingly far. OpenAI’s large-scale study showed agents driven purely by curiosity learn to play many Atari games competently, because in human-designed games progress and novelty happen to be correlated. But pure curiosity has no notion of the task — when novelty and the goal diverge, you still need an extrinsic signal to point the way.

How does curiosity relate to exploration like ε-greedy?

ε-greedy explores by injecting random actions — undirected and blind. Curiosity explores by directing the agent toward states it judges novel or informative, which is dramatically more sample-efficient in sparse, high-dimensional worlds. It’s directed exploration baked into the reward rather than bolted onto the action selection.

Key papers

Exploration vs exploitation · Reward shaping · Deep Q-networks · PPO · Curriculum learning · Hierarchical RL · RL in robotics · What is reinforcement learning?