reinforcement-learning.com
// FOUNDATIONS

Multi-Armed Bandits

The multi-armed bandit problem explained: exploration vs exploitation, regret, and the core algorithms — epsilon-greedy, UCB, and Thompson sampling — with the math.

Updated 2026-06-08 15 min read
Key takeaways
  • A multi-armed bandit is the simplest reinforcement learning problem: one state, several actions, and the pure exploration–exploitation dilemma stripped of everything else.
  • Success is measured by regret — how much reward you lost versus always pulling the best arm. The goal is to make regret grow slowly.
  • Three canonical algorithms: epsilon-greedy (explore at random), UCB (optimism in the face of uncertainty), and Thompson sampling (probability-matching by sampling).
  • Lai & Robbins proved regret must grow at least logarithmically; UCB and Thompson sampling match that bound, and contextual bandits power real recommendation and ad systems.

What is a multi-armed bandit?

Imagine a row of slot machines — old gambling machines were nicknamed “one-armed bandits.” Each machine (an arm) pays out from its own unknown probability distribution. You have a fixed number of pulls. Which arm do you play, and how often? Pull the arm that has looked best so far and you might be sitting on a mediocre machine while a better one goes untried. Keep trying new arms and you waste pulls on duds. That tension — exploit what you know versus explore to learn more — is the entire problem.

The multi-armed bandit (MAB) is the simplest possible reinforcement learning problem. A full Markov decision process has states that change in response to your actions; a bandit has just one state. There is no long-term consequence, no delayed reward, no credit assignment across time. What is left is the pure kernel of RL: exploration vs exploitation. Master it here, and you understand the dilemma that haunts every harder RL setting.

Arm 1q*=0.3Arm 2 (best)q*=0.7Arm kq*=0.5Agentchoose arm A(t)Reward R(t)noisy sampleupdate estimate, repeat
A k-armed bandit: each arm has an unknown reward distribution with mean q*(a). The agent repeatedly chooses an arm and observes a noisy reward, trying to find and favor the best arm.

The formal setup

A k-armed bandit has kk actions. Each action aa has a fixed but unknown value — its expected reward:

q(a)=E[RtAt=a]q_*(a) = \mathbb{E}[R_t \mid A_t = a]

At each timestep tt you select an action AtA_t, receive a reward RtR_t drawn from that arm’s distribution, and update your estimate Qt(a)Q_t(a) of q(a)q_*(a). The simplest estimate is the sample average of rewards seen from arm aa so far, which can be maintained incrementally:

Qn+1=Qn+1n(RnQn)Q_{n+1} = Q_n + \frac{1}{n}\big(R_n - Q_n\big)

This “new estimate = old estimate + step-size × (target − old estimate)” form is the same update rule that reappears throughout RL in temporal-difference learning and value functions. A greedy action is simply argmaxaQt(a)\arg\max_a Q_t(a) — picking the arm that currently looks best.

Regret: how we keep score

You cannot grade a bandit algorithm by total reward alone — that depends on the arms. The standard yardstick is regret: how much worse you did than an oracle that always pulled the best arm. If μ=maxaq(a)\mu^* = \max_a q_*(a) is the best arm’s value, the expected cumulative regret after TT rounds is:

Regret(T)=TμE ⁣[t=1TRt]=aΔaE[NT(a)]\mathrm{Regret}(T) = T\mu^* - \mathbb{E}\!\left[\sum_{t=1}^{T} R_t\right] = \sum_{a}\Delta_a\,\mathbb{E}[N_T(a)]

where Δa=μq(a)\Delta_a = \mu^* - q_*(a) is the suboptimality gap of arm aa and NT(a)N_T(a) is how many times you pulled it. The message is sharp: regret is the gap of each bad arm times how often you pull it. A good algorithm pulls clearly-bad arms only a handful of times and concentrates pulls on near-optimal ones.

O(log T)
Lai & Robbins lower bound on regret for any good algorithm
1985
Year that logarithmic bound was proved
2002
UCB1 — a simple algorithm that matches it (Auer et al.)

The celebrated Lai–Robbins lower bound (1985) says that for any consistent algorithm, regret must grow at least logarithmically in TT — you can never drive it below O(logT)O(\log T), with a constant set by the KL divergence between each suboptimal arm and the best one. You cannot escape paying to learn; the best you can do is pay logarithmically. The art is hitting that floor.

The three canonical algorithms

1
Epsilon-greedy — explore by coin flip

The simplest workable strategy. Most of the time act greedily; with small probability ε\varepsilon pick a random arm:

At={argmaxaQt(a)with prob. 1εrandom armwith prob. εA_t = \begin{cases} \arg\max_a Q_t(a) & \text{with prob. } 1-\varepsilon \\ \text{random arm} & \text{with prob. } \varepsilon \end{cases}

Robust and trivial to implement, but it explores blindly — it wastes pulls on arms already known to be bad, and a fixed ε\varepsilon never stops exploring, so its regret grows linearly. Decaying ε\varepsilon over time recovers logarithmic regret.

2
UCB — optimism in the face of uncertainty

Instead of exploring at random, explore the arm that could be best. UCB1 adds an exploration bonus that is large for rarely-tried arms and shrinks as you gather data:

At=argmaxa[Qt(a)+clntNt(a)]A_t = \arg\max_a \left[\, Q_t(a) + c\sqrt{\frac{\ln t}{N_t(a)}}\, \right]

The first term is exploitation (the estimate); the second is exploration (the confidence width). An arm gets pulled if it is either genuinely good or under-explored — exactly the right reason to try something. UCB1 achieves O(logT)O(\log T) regret.

3
Thompson sampling — bet on the posterior

A Bayesian approach: keep a probability distribution over each arm’s value, sample one value from each, and pull the arm with the highest sample. For Bernoulli rewards, each arm has a Beta posterior updated by its wins and losses:

θaBeta(αa,βa),At=argmaxaθa\theta_a \sim \mathrm{Beta}(\alpha_a, \beta_a), \quad A_t = \arg\max_a \theta_a

Arms you are unsure about have wide posteriors, so they occasionally produce high samples and get explored. Often the best empirical performer, also O(logT)O(\log T).

estimated valueArm 1Arm 2Arm 3highest upper bound → pick arm 3
Optimism in the face of uncertainty (UCB). Each arm's bar is its estimate Q(a); the whisker is the confidence bonus. UCB picks the highest upper tip — here arm 3 wins on uncertainty despite a lower estimate, so it gets explored.
Go deeper: why the UCB bonus has that exact shape

The lnt/Nt(a)\sqrt{\ln t / N_t(a)} term is Hoeffding’s inequality wearing a disguise. With Nt(a)N_t(a) samples, the sample mean concentrates around the true mean within roughly ln(1/δ)/Nt(a)\sqrt{\ln(1/\delta)/N_t(a)} with probability 1δ1-\delta. Setting the failure probability δ\delta to shrink like t4t^{-4} makes the numerator lnt\ln t. So UCB is not a heuristic: the bonus is a high-probability upper confidence bound on the arm’s true value, and “act greedily with respect to the optimistic estimate” provably yields logarithmic regret. The constant cc tunes how aggressively you trust that bound. The full finite-time proof is in Auer, Cesa-Bianchi & Fischer (2002).

Choosing an algorithm

Reach for Thompson sampling

The usual default. Strong empirical performance, naturally handles delayed and batched feedback, and extends cleanly to contextual settings. Needs a posterior you can sample — easy for Bernoulli (Beta) and Gaussian rewards.

Reach for UCB

When you want deterministic, reproducible behavior and clean theoretical guarantees without specifying a prior. Tighter to reason about; can be slow to react to non-stationary changes unless you discount old data.

AlgorithmExploration mechanismRegretNotes
Greedynonelinear O(T)O(T)can lock onto a bad arm forever
Epsilon-greedyrandom, with prob. ε\varepsilonlinear (fixed ε\varepsilon); O(logT)O(\log T) if decayedsimplest baseline; explores blindly
Optimistic initinflated starting QQ valueslow early, then greedyone-line trick; only works while estimates are young
UCB1confidence bonusO(logT)O(\log T)optimism in the face of uncertainty; no prior needed
Thompson samplingposterior samplingO(logT)O(\log T)often best in practice; Bayesian

Contextual bandits: where bandits earn their keep

Plain bandits assume every decision is identical. Real systems get a context first — who the user is, the time of day, the device — and the best arm depends on it. A contextual bandit observes a feature vector xtx_t, chooses an action, and receives a reward; it learns a policy mapping context to action rather than a single best arm.

This is the workhorse behind news and content recommendation, ad selection, and personalized layouts. The canonical algorithm is LinUCB (Li et al., 2010), which assumes the expected reward is linear in the context, E[Rtxt,a]=xtθa\mathbb{E}[R_t \mid x_t, a] = x_t^\top \theta_a, and applies UCB on top of online ridge regression. Yahoo used it to personalize front-page news, reporting double-digit click-through gains over a context-free baseline.

Go deeper: the Gittins index and Bayesian optimality

For a Bayesian bandit with geometric discounting, there is a genuinely optimal policy — the Gittins index (Gittins, 1979). It assigns each arm a scalar “index” computed in isolation from the others, and the optimal move is always to pull the arm with the highest index. This decomposition is remarkable: it turns a coupled kk-arm planning problem into kk independent one-arm calculations. The catch is that computing Gittins indices is expensive and the result is fragile to the discounting and independence assumptions, so in practice UCB and Thompson sampling — which are near-optimal, cheap, and robust — dominate. The definitive modern reference is Lattimore & Szepesvári’s free textbook Bandit Algorithms.

A short history

1933
Thompson sampling
William R. Thompson proposes posterior sampling for choosing between medical treatments — decades ahead of its time, rediscovered in the 2010s.
1952
The problem is named
Herbert Robbins formalizes the sequential design of experiments — the modern multi-armed bandit problem.
1979
Gittins index
John Gittins gives the Bayes-optimal policy for discounted bandits via per-arm indices.
1985
The logarithmic lower bound
Lai and Robbins prove regret must grow at least as O(log T) for any consistent policy.
2002
UCB1
Auer, Cesa-Bianchi and Fischer give a simple algorithm with finite-time logarithmic regret.
2010–12
Contextual bandits go to production
LinUCB powers Yahoo news recommendation; Thompson sampling gets its modern regret analysis and an industry revival.

Where bandits are used

  • A/B testing on autopilot. Classic A/B tests split traffic 50/50 for the whole experiment. A bandit shifts traffic toward the winning variant as evidence accumulates, cutting the reward lost to the losing arm — the same regret idea applied to product experiments.
  • Recommendation and ranking. Choosing which article, video, or product to surface, often as a contextual bandit keyed on user features.
  • Online advertising. Selecting which ad to show to maximize click-through, with the budget and feedback delays that make real systems hard.
  • Clinical trials. The original motivation — adaptively assigning more patients to the treatment that is working.
  • Hyperparameter search. Methods like Hyperband frame configuration selection as a bandit over partially-trained models.

Many production RL and experimentation systems are built and operated by specialized vendors — see the RL environment and experimentation vendors.

Bandits vs full reinforcement learning

Multi-armed banditFull RL (MDP)
Statesone (or a context that does not change with actions)many, and actions change them
Rewardimmediatecan be delayed across many steps
Core challengeexploration vs exploitationexploration + temporal credit assignment
Examplewhich ad to show nowwhich move wins the game 40 moves later

A bandit is what is left when you delete the “future” from RL. That is exactly why it is the right place to start: the exploration–exploitation machinery you learn here — optimism, posterior sampling, confidence bounds — reappears, scaled up, in Q-learning, deep Q-networks, and modern exploration methods like curiosity and intrinsic motivation.

Frequently asked questions

Is a multi-armed bandit reinforcement learning?

Yes — it is the simplest case of RL: a single-state Markov decision process with immediate rewards. It contains RL’s exploration–exploitation dilemma but drops the harder parts (changing states, delayed reward, temporal credit assignment). That is why most RL courses start here.

UCB or Thompson sampling — which should I use?

Thompson sampling is the more common default: it tends to perform better empirically, handles delayed and batched feedback gracefully, and extends naturally to contextual settings. Reach for UCB when you want deterministic, reproducible behavior or cleaner theoretical guarantees without committing to a Bayesian prior. Both achieve logarithmic regret.

What is regret and why minimize it instead of maximizing reward?

Regret is the reward you lost compared to an oracle that always pulled the best arm. It is the better yardstick because total reward depends on the specific arms, whereas regret isolates how good your decisions were. Sublinear regret means your average per-round loss versus the oracle shrinks to zero over time.

How is a contextual bandit different from full RL?

A contextual bandit sees a context, picks an action, and gets an immediate reward — but the action does not change the next context. Full RL adds exactly that: your action moves the environment to a new state with downstream consequences, forcing you to assign credit across time. If your actions have lasting effects, you need an MDP, not a bandit.

Key papers and resources

Exploration vs exploitation · What is reinforcement learning? · Markov decision processes · Value functions · Q-learning · Curiosity and intrinsic motivation