reinforcement-learning.com
// FOUNDATIONS

Exploration vs Exploitation

The exploration-exploitation tradeoff in reinforcement learning: epsilon-greedy, UCB, Thompson sampling, regret, and deep RL exploration like curiosity and RND.

Updated 2026-06-07 15 min read
Key takeaways
  • An RL agent must choose between exploiting the best action it knows and exploring to discover something better — it can't do both at once.
  • Pure greedy behavior gets stuck on the first decent option; pure random behavior never cashes in what it learns. The art is balancing the two.
  • Classic strategies trade simplicity for efficiency: epsilon-greedy (dumb but robust), UCB (optimism under uncertainty), and Thompson sampling (Bayesian).
  • In deep RL with sparse rewards, random exploration fails — agents need directed exploration like curiosity, count bonuses, and Random Network Distillation.

What is the exploration-exploitation tradeoff?

Every learning agent faces the same fork in the road. It can exploit — pick the action that looks best given everything it has learned so far — or it can explore — try something uncertain in the hope of discovering a better option. The catch is that information and reward are coupled: the only way to know whether an untried action is good is to try it, and trying it means giving up the reward you’d have collected by playing it safe.

This is the exploration-exploitation dilemma, and it sits at the heart of reinforcement learning. It has no free answer. Explore too little and you converge prematurely on a mediocre policy, never learning that a much better action existed. Explore too much and you keep paying the cost of trying things you already know are bad. A good agent shifts the balance over time: explore aggressively when it knows little, exploit confidently once it does.

Agentwhich action?Exploitbest known rewardExploreuncertain, maybe betterThe tradeoffsafe, no new infocostly, gains infoshift from explore → exploit as confidence grows
The dilemma in one picture: exploitation pulls toward the current best-known action; exploration pulls toward uncertain actions that might be better. Good strategies blend the two and shift the balance over time.
▶ RL Course by David Silver — Lecture 9: Exploration and Exploitation (the canonical lecture)

The multi-armed bandit: the dilemma in its purest form

Strip away states and transitions and you get the cleanest version of the problem: the multi-armed bandit. Imagine a row of kk slot machines (“one-armed bandits”). Each arm aa pays out from an unknown reward distribution with mean μa\mu_a. On each of TT rounds you pull one arm and observe its reward. Your goal is to maximize total reward — but you don’t know which arm is best, and the only way to learn is to pull.

The natural way to score a strategy is regret: how much reward you lost compared to an oracle that pulls the best arm every time. If μ\*=maxaμa\mu^\* = \max_a \mu_a is the best arm’s mean, the expected regret after TT rounds is

RT=Tμ\*E[t=1Tμat]R_T = T\,\mu^\* - \mathbb{E}\Big[\sum_{t=1}^{T} \mu_{a_t}\Big]

A strategy that explores too little has linear regret (it keeps pulling a suboptimal arm forever). The remarkable result of bandit theory is that the best strategies achieve logarithmic regret — they identify the best arm fast enough that the cost of exploration grows only like logT\log T.

O(log T)
optimal regret growth for stochastic bandits
Ω(log T)
matching lower bound — UCB is optimal up to a constant
1933
Thompson sampling first proposed (clinical trials)

Four classic strategies

1
Greedy — pure exploitation (the trap)

Always pick the action with the highest estimated value Q^(a)\hat{Q}(a). Simple, and broken: an unlucky early sample can make the truly-best arm look bad, and a greedy agent will never pull it again to find out. Greedy has linear regret. It’s the cautionary baseline, not a real strategy.

2
Epsilon-greedy — exploit, but randomly explore

With probability 1ε1-\varepsilon act greedily; with probability ε\varepsilon pick a random action. Dead simple and surprisingly robust — it’s the default in Q-learning and DQN. Its weakness: exploration is undirected. It wastes pulls on actions already known to be bad just as readily as on promising-but-uncertain ones. Decaying ε\varepsilon over time (high early, low late) helps.

3
UCB — optimism in the face of uncertainty

Don’t explore randomly — explore the action you’re most uncertain about. UCB1 (Auer et al., 2002) adds an exploration bonus that shrinks as an arm is pulled more:

at=argmaxa[Q^t(a)+2lntNt(a)]a_t = \arg\max_a \left[\, \hat{Q}_t(a) + \sqrt{\frac{2\ln t}{N_t(a)}} \,\right]

The bonus is large for rarely-pulled arms (small Nt(a)N_t(a)) and decays as evidence accumulates. UCB achieves the optimal O(logT)O(\log T) regret and is deterministic — no randomness needed.

4
Thompson sampling — be Bayesian

Keep a posterior distribution over each arm’s mean. Each round, sample one value from each arm’s posterior and play the arm with the highest sample, then update. Arms you’re unsure about have wide posteriors, so they occasionally sample high and get explored; proven winners get played most of the time. Often the best empirical performer, and also O(logT)O(\log T).

Go deeper: why the UCB bonus has that exact shape

The 2lnt/Nt(a)\sqrt{2\ln t / N_t(a)} term is a confidence interval from Hoeffding’s inequality. If you’ve pulled arm aa exactly Nt(a)N_t(a) times, the true mean lies within roughly (lnt)/Nt(a)\sqrt{(\ln t)/N_t(a)} of your estimate with high probability. UCB acts on the optimistic end of that interval — it assumes each arm is as good as its uncertainty allows. If the arm really is that good, great, you found it. If not, you pull it, the interval shrinks, and the optimism evaporates. This “optimism in the face of uncertainty” principle is the single most reusable idea in exploration. Auer, Cesa-Bianchi & Fischer prove the resulting regret is bounded by roughly a:Δa>08lnTΔa\sum_{a:\,\Delta_a>0} \frac{8\ln T}{\Delta_a}, where Δa=μ\*μa\Delta_a = \mu^\* - \mu_a is the gap to the best arm.

Here’s how the four behave on the same problem:

StrategyExploration is…RegretTuningNotes
Greedynonelinear (bad)gets stuck; baseline only
Epsilon-greedyrandom, undirectedlinear (const ε) / log (decaying ε)ε schedulesimple, robust, ubiquitous in deep RL
UCBdirected by uncertaintyO(logT)O(\log T) optimalexploration constantdeterministic, strong guarantees
Thompson samplingdirected by posteriorO(logT)O(\log T) optimalchoice of prioroften best in practice; needs a model

From bandits to full RL

In a bandit, exploration costs you one round of reward. In a sequential MDP it’s far harder, because actions have consequences: a single exploratory move can take you somewhere new, and to benefit you may need a whole sequence of unlucky-looking actions before any reward appears. This is the deep exploration problem — and it’s why naive epsilon-greedy collapses in hard environments.

The poster child is Atari’s Montezuma’s Revenge. Rewards are extremely sparse — you must climb ladders, dodge enemies, and grab a key before anything good happens. A DQN agent doing epsilon-greedy exploration essentially never stumbles onto the key by random button-mashing, and scores zero. The reward signal alone gives the agent no reason to explore systematically.

Random (ε-greedy)Directed (curiosity / novelty)startrewardstalls far from rewardstartrewarddrives toward novel states
Why random exploration fails on sparse rewards: epsilon-greedy diffuses outward slowly and rarely reaches distant reward, while directed exploration (curiosity / novelty bonuses) pushes the agent toward unseen states.

Directed exploration in deep RL

When the reward is too sparse to guide exploration, the fix is to manufacture an exploration signal — an intrinsic reward that pays the agent for visiting novel states, added on top of the (sparse) extrinsic reward. The agent is then optimized normally, but it now has a reason to seek out the unfamiliar.

Count-based bonuses

Reward states inversely to how often they’ve been visited: a bonus like 1/N(s)1/\sqrt{N(s)}. A direct generalization of UCB to MDPs. For huge/continuous state spaces, use pseudo-counts from a density model since exact counts are impossible.

Curiosity / prediction error

Reward the agent for states it can’t predict. The Intrinsic Curiosity Module (Pathak et al., 2017) measures the error of a learned forward model in feature space — high error means “novel,” so the agent is drawn there. Learns to play Mario and VizDoom with no extrinsic reward.

Random Network Distillation

RND (OpenAI, 2018): train a network to predict the output of a fixed, random network on each state. Prediction error is high for unfamiliar states. Simple, robust, and the first method to beat average human performance on Montezuma’s Revenge without demos.

Go-Explore

First remember and return to promising states, then explore from them — fixing the “detachment” problem where agents forget frontiers they’d discovered. Cracked Montezuma’s Revenge and Pitfall with record scores.

Go deeper: optimism, posterior sampling, and entropy bonuses in deep RL

Three threads from bandit theory scale up. (1) Optimism / count bonuses generalize UCB — pseudo-count methods (CTS, PixelCNN density models) gave the first big Montezuma’s Revenge gains. (2) Posterior sampling generalizes Thompson sampling — Bootstrapped DQN trains an ensemble of value heads and acts greedily w.r.t. a randomly sampled head per episode, giving temporally-consistent “deep” exploration that ε-greedy can’t. (3) Entropy regularization is the workhorse in policy-gradient methods: adding +βH(π(s))+\beta\,\mathcal{H}(\pi(\cdot\mid s)) to the objective keeps the policy stochastic and discourages premature collapse. It’s the exploration mechanism in PPO, actor-critic, and SAC (whose maximum-entropy objective makes exploration a first-class goal). See policy gradients.

How the tradeoff shows up across modern RL

Exploration isn’t just a bandit footnote — it shapes design choices everywhere in the field:

SettingHow exploration is handled
Tabular / DQNε-greedy with a decay schedule; optimistic initialization
Policy gradients / PPOentropy bonus keeps the policy stochastic (PPO)
Max-entropy RL (SAC)exploration baked into the objective as an entropy term
Hard-exploration gamescuriosity, RND, count bonuses, Go-Explore
Model-based RLplan to reduce model uncertainty (information-directed)
Offline RLthe opposite problem — no exploration allowed; must stay near the data
RLHF / RL for reasoningsampling temperature and KL/entropy terms control how far the LLM strays from its prior

A short history of exploration

1933
Thompson sampling
Thompson proposes posterior sampling for choosing between medical treatments — the first principled exploration strategy, rediscovered decades later.
1985
Lai & Robbins lower bound
Lai and Robbins prove the Ω(log T) regret lower bound, establishing what “optimal” exploration even means.
2002
UCB1
Auer, Cesa-Bianchi & Fischer give a simple algorithm achieving the optimal log-regret bound: optimism in the face of uncertainty, made concrete.
2016
Count-based deep exploration
Pseudo-count bonuses (Bellemare et al.) bring UCB-style optimism to Atari, making real progress on Montezuma’s Revenge.
2017
Curiosity (ICM)
Pathak et al. show prediction-error curiosity drives exploration with no extrinsic reward at all.
2018
RND & Go-Explore
Random Network Distillation and (2019) Go-Explore finally crack Montezuma’s Revenge and Pitfall, beating human scores on the hardest exploration benchmarks.

Limitations and open problems

  • Sparse/deceptive rewards remain hard. Many real tasks give almost no signal until late, and some give misleading early signal that lures agents into local optima.
  • Intrinsic rewards can be gamed. The noisy-TV problem and reward-hacking-style behaviors mean novelty bonuses need careful design.
  • Exploration doesn’t transfer cleanly. A novelty signal tuned for one game often fails on the next; principled, general-purpose exploration is still open.
  • Safe exploration. In robotics, healthcare or finance, an exploratory action can be catastrophic — you can’t just try random moves. Constrained and risk-aware exploration is an active area.
  • Sample efficiency. Even good strategies need huge interaction budgets; closing the gap to human-like one-shot exploration is unsolved.

Researcher takes

Clune, who led Go-Explore and Deep Curiosity Search, makes a pointed argument about why standard RL exploration is inefficient: agents explore by adding noise to reward-maximizing behavior, whereas humans deliberately explore to acquire information — a distinction motivating his First-Explore work.

Frequently asked questions

What is the difference between exploration and exploitation?

Exploitation means choosing the action that currently looks best, to maximize immediate reward. Exploration means choosing a less-certain action to gather information that might lead to higher reward later. The tradeoff exists because you can only do one per decision, and the only way to learn an action’s value is to try it.

Why is epsilon-greedy so popular if UCB and Thompson sampling are better?

Because it’s trivially simple, has essentially no assumptions, and works “well enough” across an enormous range of problems — including large deep-RL settings where maintaining UCB counts or Bayesian posteriors is awkward. UCB and Thompson sampling have better guarantees, but ε-greedy with a decay schedule is the pragmatic default in DQN and friends.

What is regret in the bandit setting?

Regret is the total reward you lost by not always playing the best arm. Formally it’s Tμ\*E[tμat]T\mu^\* - \mathbb{E}[\sum_t \mu_{a_t}]. The best stochastic-bandit strategies achieve logarithmic regret, O(logT)O(\log T) — meaning the per-step cost of exploration vanishes as you play longer.

How do agents explore when rewards are extremely sparse?

They add an intrinsic reward for novelty on top of the sparse extrinsic reward. Approaches include count-based bonuses, curiosity (prediction error, as in ICM), Random Network Distillation, and Go-Explore. These turn “I haven’t seen this before” into a reward signal, which directs exploration toward unvisited parts of the environment.

Key papers

What is reinforcement learning? · Markov decision processes · Value functions · Q-learning · Deep Q-networks · Policy gradients · PPO · Offline RL