- A multi-armed bandit is the simplest reinforcement learning problem: one state, several actions, and the pure exploration–exploitation dilemma stripped of everything else.
- Success is measured by regret — how much reward you lost versus always pulling the best arm. The goal is to make regret grow slowly.
- Three canonical algorithms: epsilon-greedy (explore at random), UCB (optimism in the face of uncertainty), and Thompson sampling (probability-matching by sampling).
- Lai & Robbins proved regret must grow at least logarithmically; UCB and Thompson sampling match that bound, and contextual bandits power real recommendation and ad systems.
What is a multi-armed bandit?
Imagine a row of slot machines — old gambling machines were nicknamed “one-armed bandits.” Each machine (an arm) pays out from its own unknown probability distribution. You have a fixed number of pulls. Which arm do you play, and how often? Pull the arm that has looked best so far and you might be sitting on a mediocre machine while a better one goes untried. Keep trying new arms and you waste pulls on duds. That tension — exploit what you know versus explore to learn more — is the entire problem.
The multi-armed bandit (MAB) is the simplest possible reinforcement learning problem. A full Markov decision process has states that change in response to your actions; a bandit has just one state. There is no long-term consequence, no delayed reward, no credit assignment across time. What is left is the pure kernel of RL: exploration vs exploitation. Master it here, and you understand the dilemma that haunts every harder RL setting.
The formal setup
A k-armed bandit has actions. Each action has a fixed but unknown value — its expected reward:
At each timestep you select an action , receive a reward drawn from that arm’s distribution, and update your estimate of . The simplest estimate is the sample average of rewards seen from arm so far, which can be maintained incrementally:
This “new estimate = old estimate + step-size × (target − old estimate)” form is the same update rule that reappears throughout RL in temporal-difference learning and value functions. A greedy action is simply — picking the arm that currently looks best.
Regret: how we keep score
You cannot grade a bandit algorithm by total reward alone — that depends on the arms. The standard yardstick is regret: how much worse you did than an oracle that always pulled the best arm. If is the best arm’s value, the expected cumulative regret after rounds is:
where is the suboptimality gap of arm and is how many times you pulled it. The message is sharp: regret is the gap of each bad arm times how often you pull it. A good algorithm pulls clearly-bad arms only a handful of times and concentrates pulls on near-optimal ones.
The celebrated Lai–Robbins lower bound (1985) says that for any consistent algorithm, regret must grow at least logarithmically in — you can never drive it below , with a constant set by the KL divergence between each suboptimal arm and the best one. You cannot escape paying to learn; the best you can do is pay logarithmically. The art is hitting that floor.
The three canonical algorithms
The simplest workable strategy. Most of the time act greedily; with small probability pick a random arm:
Robust and trivial to implement, but it explores blindly — it wastes pulls on arms already known to be bad, and a fixed never stops exploring, so its regret grows linearly. Decaying over time recovers logarithmic regret.
Instead of exploring at random, explore the arm that could be best. UCB1 adds an exploration bonus that is large for rarely-tried arms and shrinks as you gather data:
The first term is exploitation (the estimate); the second is exploration (the confidence width). An arm gets pulled if it is either genuinely good or under-explored — exactly the right reason to try something. UCB1 achieves regret.
A Bayesian approach: keep a probability distribution over each arm’s value, sample one value from each, and pull the arm with the highest sample. For Bernoulli rewards, each arm has a Beta posterior updated by its wins and losses:
Arms you are unsure about have wide posteriors, so they occasionally produce high samples and get explored. Often the best empirical performer, also .
Go deeper: why the UCB bonus has that exact shape
The term is Hoeffding’s inequality wearing a disguise. With samples, the sample mean concentrates around the true mean within roughly with probability . Setting the failure probability to shrink like makes the numerator . So UCB is not a heuristic: the bonus is a high-probability upper confidence bound on the arm’s true value, and “act greedily with respect to the optimistic estimate” provably yields logarithmic regret. The constant tunes how aggressively you trust that bound. The full finite-time proof is in Auer, Cesa-Bianchi & Fischer (2002).
Choosing an algorithm
The usual default. Strong empirical performance, naturally handles delayed and batched feedback, and extends cleanly to contextual settings. Needs a posterior you can sample — easy for Bernoulli (Beta) and Gaussian rewards.
When you want deterministic, reproducible behavior and clean theoretical guarantees without specifying a prior. Tighter to reason about; can be slow to react to non-stationary changes unless you discount old data.
| Algorithm | Exploration mechanism | Regret | Notes |
|---|---|---|---|
| Greedy | none | linear | can lock onto a bad arm forever |
| Epsilon-greedy | random, with prob. | linear (fixed ); if decayed | simplest baseline; explores blindly |
| Optimistic init | inflated starting values | low early, then greedy | one-line trick; only works while estimates are young |
| UCB1 | confidence bonus | optimism in the face of uncertainty; no prior needed | |
| Thompson sampling | posterior sampling | often best in practice; Bayesian |
Contextual bandits: where bandits earn their keep
Plain bandits assume every decision is identical. Real systems get a context first — who the user is, the time of day, the device — and the best arm depends on it. A contextual bandit observes a feature vector , chooses an action, and receives a reward; it learns a policy mapping context to action rather than a single best arm.
This is the workhorse behind news and content recommendation, ad selection, and personalized layouts. The canonical algorithm is LinUCB (Li et al., 2010), which assumes the expected reward is linear in the context, , and applies UCB on top of online ridge regression. Yahoo used it to personalize front-page news, reporting double-digit click-through gains over a context-free baseline.
Go deeper: the Gittins index and Bayesian optimality
For a Bayesian bandit with geometric discounting, there is a genuinely optimal policy — the Gittins index (Gittins, 1979). It assigns each arm a scalar “index” computed in isolation from the others, and the optimal move is always to pull the arm with the highest index. This decomposition is remarkable: it turns a coupled -arm planning problem into independent one-arm calculations. The catch is that computing Gittins indices is expensive and the result is fragile to the discounting and independence assumptions, so in practice UCB and Thompson sampling — which are near-optimal, cheap, and robust — dominate. The definitive modern reference is Lattimore & Szepesvári’s free textbook Bandit Algorithms.
A short history
Where bandits are used
- A/B testing on autopilot. Classic A/B tests split traffic 50/50 for the whole experiment. A bandit shifts traffic toward the winning variant as evidence accumulates, cutting the reward lost to the losing arm — the same regret idea applied to product experiments.
- Recommendation and ranking. Choosing which article, video, or product to surface, often as a contextual bandit keyed on user features.
- Online advertising. Selecting which ad to show to maximize click-through, with the budget and feedback delays that make real systems hard.
- Clinical trials. The original motivation — adaptively assigning more patients to the treatment that is working.
- Hyperparameter search. Methods like Hyperband frame configuration selection as a bandit over partially-trained models.
Many production RL and experimentation systems are built and operated by specialized vendors — see the RL environment and experimentation vendors.
Bandits vs full reinforcement learning
| Multi-armed bandit | Full RL (MDP) | |
|---|---|---|
| States | one (or a context that does not change with actions) | many, and actions change them |
| Reward | immediate | can be delayed across many steps |
| Core challenge | exploration vs exploitation | exploration + temporal credit assignment |
| Example | which ad to show now | which move wins the game 40 moves later |
A bandit is what is left when you delete the “future” from RL. That is exactly why it is the right place to start: the exploration–exploitation machinery you learn here — optimism, posterior sampling, confidence bounds — reappears, scaled up, in Q-learning, deep Q-networks, and modern exploration methods like curiosity and intrinsic motivation.
Frequently asked questions
Is a multi-armed bandit reinforcement learning?
Yes — it is the simplest case of RL: a single-state Markov decision process with immediate rewards. It contains RL’s exploration–exploitation dilemma but drops the harder parts (changing states, delayed reward, temporal credit assignment). That is why most RL courses start here.
UCB or Thompson sampling — which should I use?
Thompson sampling is the more common default: it tends to perform better empirically, handles delayed and batched feedback gracefully, and extends naturally to contextual settings. Reach for UCB when you want deterministic, reproducible behavior or cleaner theoretical guarantees without committing to a Bayesian prior. Both achieve logarithmic regret.
What is regret and why minimize it instead of maximizing reward?
Regret is the reward you lost compared to an oracle that always pulled the best arm. It is the better yardstick because total reward depends on the specific arms, whereas regret isolates how good your decisions were. Sublinear regret means your average per-round loss versus the oracle shrinks to zero over time.
How is a contextual bandit different from full RL?
A contextual bandit sees a context, picks an action, and gets an immediate reward — but the action does not change the next context. Full RL adds exactly that: your action moves the environment to a new state with downstream consequences, forcing you to assign credit across time. If your actions have lasting effects, you need an MDP, not a bandit.
Key papers and resources
- Some aspects of the sequential design of experiments — Robbins, 1952 — the problem, formalized.
- Asymptotically efficient adaptive allocation rules — Lai & Robbins, 1985 — the logarithmic lower bound.
- Finite-time Analysis of the Multiarmed Bandit Problem — Auer, Cesa-Bianchi & Fischer, 2002 — UCB1.
- A Contextual-Bandit Approach to Personalized News Article Recommendation — Li et al., 2010 — LinUCB in production.
- Bandit Algorithms — Lattimore & Szepesvári, 2020 — the free, definitive textbook.
Related
Exploration vs exploitation · What is reinforcement learning? · Markov decision processes · Value functions · Q-learning · Curiosity and intrinsic motivation