- Monte Carlo methods learn values straight from experience by averaging the actual returns observed after each state — no model of the environment required.
- They wait for an episode to finish, then update: this makes them unbiased but high-variance, and limits them to episodic tasks.
- Prediction estimates V or Q for a fixed policy (first-visit vs every-visit); control alternates evaluation and greedy improvement, using exploring starts or ε-soft policies to keep exploring.
- Off-policy MC uses importance sampling to learn about a target policy from data generated by a different behavior policy.
What are Monte Carlo methods?
Monte Carlo (MC) methods learn value functions and optimal policies purely from sampled experience — complete episodes of interaction — without ever building or assuming a model of the environment’s dynamics. The name borrows from the casino: just as you can estimate a roulette wheel’s odds by spinning it many times and averaging, an MC agent estimates the value of a state by playing out many episodes and averaging the returns it actually received.
This is the simplest answer to a hard question. Value functions are defined as expected returns, and dynamic programming computes them exactly — but only if you know the transition probabilities and rewards. Monte Carlo throws that requirement out. It asks: what return did I actually get when I followed my policy from this state? Average enough of those, and by the law of large numbers the average converges to the true expected value.
The return: what MC actually averages
Everything in MC rests on the return, the total discounted reward from a time step until the episode ends:
The value of a state under policy is the expected return, . Monte Carlo’s whole trick is to replace that expectation with an empirical average: collect many sample returns observed after visiting , and average them.
Because each return is a genuine, unbiased sample from the true distribution of returns, the sample mean converges almost surely to as the number of visits grows — the strong law of large numbers doing the heavy lifting.
First-visit vs every-visit
Within a single episode a state can be visited more than once. That raises a choice about which returns to average:
For each episode, use only the return following the first time the state is visited. Each such return is an independent sample, so the estimator is a clean i.i.d. average that converges to by the law of large numbers. This is the most studied variant.
Average the returns following every visit to the state, even repeats within one episode. The samples are no longer independent, but the estimator is still consistent and converges to — and it extends more naturally to function approximation.
Both converge to the same true value; they differ in bias and variance characteristics on the way there. In practice the difference is usually minor, and every-visit is often simpler to implement.
Set arbitrarily for all states and create an empty list of returns for each state.
Follow policy from start to termination, recording the full trajectory of states, actions, and rewards.
Walk the episode from the last step to the first, accumulating . Computing returns in reverse is a one-pass trick that avoids re-summing.
For each qualifying visit (first-visit: only the first occurrence), append to that state’s returns list and set to the running average. Repeat over many episodes.
Go deeper: incremental and constant-α updates
Storing every return is wasteful. The same average can be computed incrementally: after the -th return for a state,
where counts visits. This is the general “update toward the target by a step-size times the error” pattern that runs through all of RL. Replacing with a fixed step size gives a constant-α MC update that tracks a non-stationary problem (e.g. a policy that keeps changing) by gently forgetting old returns:
Notice the family resemblance to the TD update — the only difference is that the target is the full return rather than a bootstrapped estimate.
From prediction to control
Estimating for a fixed policy is prediction. To find a better policy — control — we follow the same generalized policy iteration loop as dynamic programming: alternate policy evaluation with greedy policy improvement.
There is one wrinkle. With no model, knowing alone does not tell you which action is best — you would need the transition dynamics to look one step ahead. So MC control estimates action-values instead. Then greedy improvement is trivial: pick .
Keeping exploration alive
Pure greedy control has a fatal flaw: if the policy is deterministic, you only ever observe one action per state, so for the other actions never gets estimated. This is the exploration vs exploitation problem in its starkest form. Two classic fixes:
- Exploring starts. Begin each episode at a random state-action pair, so every has a nonzero chance of being the start and getting sampled. This is the assumption behind Monte Carlo ES, which provably finds the optimal policy — but it is often unrealistic (you can’t always reset a real system into an arbitrary state).
- ε-soft policies. Drop exploring starts and instead make the policy itself explore: with probability take the greedy action, otherwise act randomly. This ε-greedy scheme guarantees every action keeps getting tried. It converges to the best ε-soft policy rather than the strictly optimal one — a small price for realism.
On-policy vs off-policy MC
The ε-greedy approach above is on-policy: you improve the very policy you use to gather data, so it must stay a little exploratory forever. The alternative is to separate the two roles, which leads to off-policy Monte Carlo (see on-policy vs off-policy).
In off-policy MC, a behavior policy generates the episodes (it can explore freely), while we learn the value of a different target policy (which can be the greedy, deterministic optimum). The catch: the returns we observe come from ‘s action choices, not ‘s. To correct for that mismatch we reweight each return by the importance sampling ratio — the relative probability that the target vs behavior policy would have produced that exact trajectory:
There are two ways to use these ratios:
| Estimator | Formula idea | Bias | Variance |
|---|---|---|---|
| Ordinary importance sampling | average of | unbiased | high — can be unbounded |
| Weighted importance sampling | -weighted average of | biased (vanishes as data grows) | much lower; preferred in practice |
Monte Carlo vs temporal difference
MC is one of the two foundational ways to do model-free value estimation; the other is temporal-difference (TD) learning. The contrast defines a whole axis of RL.
| Monte Carlo | Temporal Difference | |
|---|---|---|
| Target | full return | bootstrapped: |
| Update timing | end of episode | every step |
| Bootstrapping | no | yes |
| Bias | unbiased | biased (uses own estimates) |
| Variance | high (sum of many random rewards) | low |
| Works on continuing tasks | no | yes |
| Markov assumption | not required | exploited (helps if true) |
Neither dominates. MC’s unbiasedness and indifference to the Markov property make it robust when your state representation is imperfect; TD’s low variance and online updates usually make it faster and more practical. TD(λ) and n-step methods interpolate between the two: λ = 1 recovers Monte Carlo, λ = 0 recovers one-step TD, and intermediate values often beat both extremes.
Go deeper: why MC is high-variance but TD is biased
A Monte Carlo target is a sum of all the random rewards and random transitions from to termination. Every coin flip the environment makes along the way adds noise, so the variance of grows with episode length — but its expectation is exactly , so it is unbiased. A TD target, , depends on only one random reward and transition, so its variance is low — but it leans on the current (wrong) estimate , injecting bias that only disappears as the estimates converge. This bias-variance trade-off is the deepest single idea connecting the two families. The n-step return splits the difference by bootstrapping after real rewards instead of one.
A worked setting: Blackjack
The canonical MC example from Sutton & Barto is Blackjack. It is a perfect fit: the game is naturally episodic (each hand ends), and the true transition dynamics are awkward to write down (the dealer’s hidden behavior, the deck) — but trivial to sample by just dealing hands.
Run thousands of hands under a fixed policy (“stick on 20 or 21, otherwise hit”) and average the returns from each state — defined by your sum, the dealer’s showing card, and whether you hold a usable ace — and first-visit MC traces out the value surface with no model of the cards at all. Switch to Monte Carlo ES and the agent discovers a near-optimal hit/stick policy purely from played-out experience. It is the cleanest demonstration that experience alone can yield both values and good policies.
A short history
Monte Carlo rollouts and tree search remain central far beyond tabular RL: the same “sample many futures and average” logic drives the planning in AlphaZero and MuZero, and full-return estimation reappears in policy gradient methods like REINFORCE.
Researcher takes
Relaying Karpathy’s critique of full-return credit assignment: a single end-of-episode reward is broadcast back across the entire trajectory, upweighting every token (even wrong or irrelevant turns) that happened to precede the right answer — the central weakness of learning from Monte Carlo returns rather than per-step signal.
Frequently asked questions
Why can’t Monte Carlo handle continuing tasks?
Because the update target is the return , which sums rewards until the episode terminates. If the task never ends, there is no complete return to compute, so there is nothing to average. You either need to define episode boundaries or switch to a bootstrapping method like TD learning, which updates from a single step.
Is Monte Carlo the same as REINFORCE?
They share the same engine. REINFORCE is a policy gradient method that estimates the gradient using full Monte Carlo returns — it is “Monte Carlo policy gradient.” The MC methods on this page estimate value functions; REINFORCE uses MC returns to estimate a policy gradient directly. Same unbiased-but-high-variance return at the core.
When should I prefer MC over TD?
MC shines when your state representation may violate the Markov property (MC doesn’t rely on it), when you want unbiased estimates, or when the task is naturally episodic and short. TD is usually preferred for online learning, continuing tasks, and sample efficiency. In practice, n-step or TD(λ) blends are often the sweet spot.
What is the difference between first-visit and every-visit MC?
First-visit averages only the return after the first time a state is visited in each episode, giving independent samples. Every-visit averages the return after every visit, including repeats within an episode. Both converge to the true value; first-visit has cleaner theory, every-visit is simpler and extends better to function approximation.
Key references
- Reinforcement Learning: An Introduction (2nd ed.), Chapter 5 — Sutton & Barto — the canonical treatment of MC prediction, control, exploring starts, and importance sampling.
- Chapter 5 lecture slides — Sutton — the official course slides for Monte Carlo methods.
- RL Course Lecture 4: Model-Free Prediction — David Silver — MC and TD prediction explained from first principles.
- On the Convergence of the Monte Carlo Exploring Starts Algorithm — Chen, 2020 — a proof of convergence for MC ES, long an open theoretical question.
Related
Value functions · Temporal-difference learning · On-policy vs off-policy · Q-learning · Exploration vs exploitation · Policy gradients · Markov decision processes · What is reinforcement learning?