Multi-Armed Bandits, Explained

Key takeaways

A multi-armed bandit is the simplest reinforcement learning problem: one state, several actions, and the pure exploration–exploitation dilemma stripped of everything else.
Success is measured by regret — how much reward you lost versus always pulling the best arm. The goal is to make regret grow slowly.
Three canonical algorithms: epsilon-greedy (explore at random), UCB (optimism in the face of uncertainty), and Thompson sampling (probability-matching by sampling).
Lai & Robbins proved regret must grow at least logarithmically; UCB and Thompson sampling match that bound, and contextual bandits power real recommendation and ad systems.

What is a multi-armed bandit?

Imagine a row of slot machines — old gambling machines were nicknamed “one-armed bandits.” Each machine (an arm) pays out from its own unknown probability distribution. You have a fixed number of pulls. Which arm do you play, and how often? Pull the arm that has looked best so far and you might be sitting on a mediocre machine while a better one goes untried. Keep trying new arms and you waste pulls on duds. That tension — exploit what you know versus explore to learn more — is the entire problem.

The multi-armed bandit (MAB) is the simplest possible reinforcement learning problem. A full Markov decision process has states that change in response to your actions; a bandit has just one state. There is no long-term consequence, no delayed reward, no credit assignment across time. What is left is the pure kernel of RL: exploration vs exploitation. Master it here, and you understand the dilemma that haunts every harder RL setting.

A k-armed bandit: each arm has an unknown reward distribution with mean q*(a). The agent repeatedly chooses an arm and observes a noisy reward, trying to find and favor the best arm.

The formal setup

A k-armed bandit has $k$ actions. Each action $a$ has a fixed but unknown value — its expected reward:

q_*(a) = \mathbb{E}[R_t \mid A_t = a]

At each timestep $t$ you select an action $A_t$ , receive a reward $R_t$ drawn from that arm’s distribution, and update your estimate $Q_t(a)$ of $q_*(a)$ . The simplest estimate is the sample average of rewards seen from arm $a$ so far, which can be maintained incrementally:

Q_{n+1} = Q_n + \frac{1}{n}\big(R_n - Q_n\big)

This “new estimate = old estimate + step-size × (target − old estimate)” form is the same update rule that reappears throughout RL in temporal-difference learning and value functions. A greedy action is simply $\arg\max_a Q_t(a)$ — picking the arm that currently looks best.

Regret: how we keep score

You cannot grade a bandit algorithm by total reward alone — that depends on the arms. The standard yardstick is regret: how much worse you did than an oracle that always pulled the best arm. If $\mu^* = \max_a q_*(a)$ is the best arm’s value, the expected cumulative regret after $T$ rounds is:

\mathrm{Regret}(T) = T\mu^* - \mathbb{E}\!\left[\sum_{t=1}^{T} R_t\right] = \sum_{a}\Delta_a\,\mathbb{E}[N_T(a)]

where $\Delta_a = \mu^* - q_*(a)$ is the suboptimality gap of arm $a$ and $N_T(a)$ is how many times you pulled it. The message is sharp: regret is the gap of each bad arm times how often you pull it. A good algorithm pulls clearly-bad arms only a handful of times and concentrates pulls on near-optimal ones.

O(log T)

Lai & Robbins lower bound on regret for any good algorithm

1985

Year that logarithmic bound was proved

2002

UCB1 — a simple algorithm that matches it (Auer et al.)

The celebrated Lai–Robbins lower bound (1985) says that for any consistent algorithm, regret must grow at least logarithmically in $T$ — you can never drive it below $O(\log T)$ , with a constant set by the KL divergence between each suboptimal arm and the best one. You cannot escape paying to learn; the best you can do is pay logarithmically. The art is hitting that floor.

The three canonical algorithms

Epsilon-greedy — explore by coin flip

The simplest workable strategy. Most of the time act greedily; with small probability $\varepsilon$ pick a random arm:

A_t = \begin{cases} \arg\max_a Q_t(a) & \text{with prob. } 1-\varepsilon \\ \text{random arm} & \text{with prob. } \varepsilon \end{cases}

Robust and trivial to implement, but it explores blindly — it wastes pulls on arms already known to be bad, and a fixed $\varepsilon$ never stops exploring, so its regret grows linearly. Decaying $\varepsilon$ over time recovers logarithmic regret.

UCB — optimism in the face of uncertainty

Instead of exploring at random, explore the arm that could be best. UCB1 adds an exploration bonus that is large for rarely-tried arms and shrinks as you gather data:

A_t = \arg\max_a \left[\, Q_t(a) + c\sqrt{\frac{\ln t}{N_t(a)}}\, \right]

The first term is exploitation (the estimate); the second is exploration (the confidence width). An arm gets pulled if it is either genuinely good or under-explored — exactly the right reason to try something. UCB1 achieves $O(\log T)$ regret.

Thompson sampling — bet on the posterior

A Bayesian approach: keep a probability distribution over each arm’s value, sample one value from each, and pull the arm with the highest sample. For Bernoulli rewards, each arm has a Beta posterior updated by its wins and losses:

\theta_a \sim \mathrm{Beta}(\alpha_a, \beta_a), \quad A_t = \arg\max_a \theta_a

Arms you are unsure about have wide posteriors, so they occasionally produce high samples and get explored. Often the best empirical performer, also $O(\log T)$ .

Optimism in the face of uncertainty (UCB). Each arm's bar is its estimate Q(a); the whisker is the confidence bonus. UCB picks the highest upper tip — here arm 3 wins on uncertainty despite a lower estimate, so it gets explored.

Go deeper: why the UCB bonus has that exact shape

The $\sqrt{\ln t / N_t(a)}$ term is Hoeffding’s inequality wearing a disguise. With $N_t(a)$ samples, the sample mean concentrates around the true mean within roughly $\sqrt{\ln(1/\delta)/N_t(a)}$ with probability $1-\delta$ . Setting the failure probability $\delta$ to shrink like $t^{-4}$ makes the numerator $\ln t$ . So UCB is not a heuristic: the bonus is a high-probability upper confidence bound on the arm’s true value, and “act greedily with respect to the optimistic estimate” provably yields logarithmic regret. The constant $c$ tunes how aggressively you trust that bound. The full finite-time proof is in Auer, Cesa-Bianchi & Fischer (2002).

Choosing an algorithm

Reach for Thompson sampling

The usual default. Strong empirical performance, naturally handles delayed and batched feedback, and extends cleanly to contextual settings. Needs a posterior you can sample — easy for Bernoulli (Beta) and Gaussian rewards.

Reach for UCB

When you want deterministic, reproducible behavior and clean theoretical guarantees without specifying a prior. Tighter to reason about; can be slow to react to non-stationary changes unless you discount old data.

Algorithm	Exploration mechanism	Regret	Notes
Greedy	none	linear $O(T)$	can lock onto a bad arm forever
Epsilon-greedy	random, with prob. $\varepsilon$	linear (fixed $\varepsilon$ ); $O(\log T)$ if decayed	simplest baseline; explores blindly
Optimistic init	inflated starting $Q$ values	low early, then greedy	one-line trick; only works while estimates are young
UCB1	confidence bonus	$O(\log T)$	optimism in the face of uncertainty; no prior needed
Thompson sampling	posterior sampling	$O(\log T)$	often best in practice; Bayesian

Contextual bandits: where bandits earn their keep

Plain bandits assume every decision is identical. Real systems get a context first — who the user is, the time of day, the device — and the best arm depends on it. A contextual bandit observes a feature vector $x_t$ , chooses an action, and receives a reward; it learns a policy mapping context to action rather than a single best arm.

This is the workhorse behind news and content recommendation, ad selection, and personalized layouts. The canonical algorithm is LinUCB (Li et al., 2010), which assumes the expected reward is linear in the context, $\mathbb{E}[R_t \mid x_t, a] = x_t^\top \theta_a$ , and applies UCB on top of online ridge regression. Yahoo used it to personalize front-page news, reporting double-digit click-through gains over a context-free baseline.

Go deeper: the Gittins index and Bayesian optimality

For a Bayesian bandit with geometric discounting, there is a genuinely optimal policy — the Gittins index (Gittins, 1979). It assigns each arm a scalar “index” computed in isolation from the others, and the optimal move is always to pull the arm with the highest index. This decomposition is remarkable: it turns a coupled $k$ -arm planning problem into $k$ independent one-arm calculations. The catch is that computing Gittins indices is expensive and the result is fragile to the discounting and independence assumptions, so in practice UCB and Thompson sampling — which are near-optimal, cheap, and robust — dominate. The definitive modern reference is Lattimore & Szepesvári’s free textbook Bandit Algorithms.

A short history

1933

Thompson sampling

William R. Thompson proposes posterior sampling for choosing between medical treatments — decades ahead of its time, rediscovered in the 2010s.

1952

The problem is named

Herbert Robbins formalizes the sequential design of experiments — the modern multi-armed bandit problem.

1979

Gittins index

John Gittins gives the Bayes-optimal policy for discounted bandits via per-arm indices.

1985

The logarithmic lower bound

Lai and Robbins prove regret must grow at least as O(log T) for any consistent policy.

2002

UCB1

Auer, Cesa-Bianchi and Fischer give a simple algorithm with finite-time logarithmic regret.

2010–12

Contextual bandits go to production

LinUCB powers Yahoo news recommendation; Thompson sampling gets its modern regret analysis and an industry revival.

Where bandits are used

A/B testing on autopilot. Classic A/B tests split traffic 50/50 for the whole experiment. A bandit shifts traffic toward the winning variant as evidence accumulates, cutting the reward lost to the losing arm — the same regret idea applied to product experiments.
Recommendation and ranking. Choosing which article, video, or product to surface, often as a contextual bandit keyed on user features.
Online advertising. Selecting which ad to show to maximize click-through, with the budget and feedback delays that make real systems hard.
Clinical trials. The original motivation — adaptively assigning more patients to the treatment that is working.
Hyperparameter search. Methods like Hyperband frame configuration selection as a bandit over partially-trained models.

Many production RL and experimentation systems are built and operated by specialized vendors — see the RL environment and experimentation vendors.

Bandits vs full reinforcement learning

	Multi-armed bandit	Full RL (MDP)
States	one (or a context that does not change with actions)	many, and actions change them
Reward	immediate	can be delayed across many steps
Core challenge	exploration vs exploitation	exploration + temporal credit assignment
Example	which ad to show now	which move wins the game 40 moves later

A bandit is what is left when you delete the “future” from RL. That is exactly why it is the right place to start: the exploration–exploitation machinery you learn here — optimism, posterior sampling, confidence bounds — reappears, scaled up, in Q-learning, deep Q-networks, and modern exploration methods like curiosity and intrinsic motivation.

Frequently asked questions

Is a multi-armed bandit reinforcement learning?

Yes — it is the simplest case of RL: a single-state Markov decision process with immediate rewards. It contains RL’s exploration–exploitation dilemma but drops the harder parts (changing states, delayed reward, temporal credit assignment). That is why most RL courses start here.

UCB or Thompson sampling — which should I use?

Thompson sampling is the more common default: it tends to perform better empirically, handles delayed and batched feedback gracefully, and extends naturally to contextual settings. Reach for UCB when you want deterministic, reproducible behavior or cleaner theoretical guarantees without committing to a Bayesian prior. Both achieve logarithmic regret.

What is regret and why minimize it instead of maximizing reward?

Regret is the reward you lost compared to an oracle that always pulled the best arm. It is the better yardstick because total reward depends on the specific arms, whereas regret isolates how good your decisions were. Sublinear regret means your average per-round loss versus the oracle shrinks to zero over time.

How is a contextual bandit different from full RL?

A contextual bandit sees a context, picks an action, and gets an immediate reward — but the action does not change the next context. Full RL adds exactly that: your action moves the environment to a new state with downstream consequences, forcing you to assign credit across time. If your actions have lasting effects, you need an MDP, not a bandit.

Key papers and resources

Some aspects of the sequential design of experiments — Robbins, 1952 — the problem, formalized.
Asymptotically efficient adaptive allocation rules — Lai & Robbins, 1985 — the logarithmic lower bound.
Finite-time Analysis of the Multiarmed Bandit Problem — Auer, Cesa-Bianchi & Fischer, 2002 — UCB1.
A Contextual-Bandit Approach to Personalized News Article Recommendation — Li et al., 2010 — LinUCB in production.
Bandit Algorithms — Lattimore & Szepesvári, 2020 — the free, definitive textbook.

Exploration vs exploitation · What is reinforcement learning? · Markov decision processes · Value functions · Q-learning · Curiosity and intrinsic motivation