Monte Carlo Methods in Reinforcement Learning

Key takeaways

Monte Carlo methods learn values straight from experience by averaging the actual returns observed after each state — no model of the environment required.
They wait for an episode to finish, then update: this makes them unbiased but high-variance, and limits them to episodic tasks.
Prediction estimates V or Q for a fixed policy (first-visit vs every-visit); control alternates evaluation and greedy improvement, using exploring starts or ε-soft policies to keep exploring.
Off-policy MC uses importance sampling to learn about a target policy from data generated by a different behavior policy.

What are Monte Carlo methods?

Monte Carlo (MC) methods learn value functions and optimal policies purely from sampled experience — complete episodes of interaction — without ever building or assuming a model of the environment’s dynamics. The name borrows from the casino: just as you can estimate a roulette wheel’s odds by spinning it many times and averaging, an MC agent estimates the value of a state by playing out many episodes and averaging the returns it actually received.

This is the simplest answer to a hard question. Value functions are defined as expected returns, and dynamic programming computes them exactly — but only if you know the transition probabilities and rewards. Monte Carlo throws that requirement out. It asks: what return did I actually get when I followed my policy from this state? Average enough of those, and by the law of large numbers the average converges to the true expected value.

A Monte Carlo update: run a full episode, compute the return G from each state to the end, then nudge each visited state's value estimate toward the return it saw.

▶ RL Course by David Silver — Lecture 4: Model-Free Prediction (Monte Carlo & TD)

The return: what MC actually averages

Everything in MC rests on the return, the total discounted reward from a time step until the episode ends:

G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \cdots = \sum_{k=0}^{T-t-1} \gamma^k R_{t+k+1}

The value of a state under policy $\pi$ is the expected return, $v_\pi(s) = \mathbb{E}_\pi[G_t \mid S_t = s]$ . Monte Carlo’s whole trick is to replace that expectation with an empirical average: collect many sample returns $G_t$ observed after visiting $s$ , and average them.

V(s) \;\leftarrow\; \text{average of returns } G_t \text{ following visits to } s

Because each return is a genuine, unbiased sample from the true distribution of returns, the sample mean converges almost surely to $v_\pi(s)$ as the number of visits grows — the strong law of large numbers doing the heavy lifting.

First-visit vs every-visit

Within a single episode a state can be visited more than once. That raises a choice about which returns to average:

First-visit MC

For each episode, use only the return following the first time the state is visited. Each such return is an independent sample, so the estimator is a clean i.i.d. average that converges to $v_\pi(s)$ by the law of large numbers. This is the most studied variant.

Every-visit MC

Average the returns following every visit to the state, even repeats within one episode. The samples are no longer independent, but the estimator is still consistent and converges to $v_\pi(s)$ — and it extends more naturally to function approximation.

Both converge to the same true value; they differ in bias and variance characteristics on the way there. In practice the difference is usually minor, and every-visit is often simpler to implement.

Initialize

Set $V(s)$ arbitrarily for all states and create an empty list of returns for each state.

Generate an episode

Follow policy $\pi$ from start to termination, recording the full trajectory of states, actions, and rewards.

Compute returns backward

Walk the episode from the last step to the first, accumulating $G \leftarrow \gamma G + R_{t+1}$ . Computing returns in reverse is a one-pass trick that avoids re-summing.

Update the estimates

For each qualifying visit (first-visit: only the first occurrence), append $G$ to that state’s returns list and set $V(s)$ to the running average. Repeat over many episodes.

Go deeper: incremental and constant-α updates

Storing every return is wasteful. The same average can be computed incrementally: after the $n$ -th return $G_n$ for a state,

V(s) \leftarrow V(s) + \frac{1}{N(s)}\big(G_n - V(s)\big)

where $N(s)$ counts visits. This is the general “update toward the target by a step-size times the error” pattern that runs through all of RL. Replacing $1/N(s)$ with a fixed step size $\alpha$ gives a constant-α MC update that tracks a non-stationary problem (e.g. a policy that keeps changing) by gently forgetting old returns:

V(s) \leftarrow V(s) + \alpha\big(G_t - V(s)\big)

Notice the family resemblance to the TD update — the only difference is that the target is the full return $G_t$ rather than a bootstrapped estimate.

From prediction to control

Estimating $v_\pi$ for a fixed policy is prediction. To find a better policy — control — we follow the same generalized policy iteration loop as dynamic programming: alternate policy evaluation with greedy policy improvement.

There is one wrinkle. With no model, knowing $V(s)$ alone does not tell you which action is best — you would need the transition dynamics to look one step ahead. So MC control estimates action-values $Q(s,a)$ instead. Then greedy improvement is trivial: pick $\arg\max_a Q(s,a)$ .

Monte Carlo control as generalized policy iteration: each round of episodes refines the action-value estimate, then a greedy step improves the policy, and the two chase each other toward optimal.

Keeping exploration alive

Pure greedy control has a fatal flaw: if the policy is deterministic, you only ever observe one action per state, so $Q(s,a)$ for the other actions never gets estimated. This is the exploration vs exploitation problem in its starkest form. Two classic fixes:

Exploring starts. Begin each episode at a random state-action pair, so every $(s,a)$ has a nonzero chance of being the start and getting sampled. This is the assumption behind Monte Carlo ES, which provably finds the optimal policy — but it is often unrealistic (you can’t always reset a real system into an arbitrary state).
ε-soft policies. Drop exploring starts and instead make the policy itself explore: with probability $1-\varepsilon$ take the greedy action, otherwise act randomly. This ε-greedy scheme guarantees every action keeps getting tried. It converges to the best ε-soft policy rather than the strictly optimal one — a small price for realism.

On-policy vs off-policy MC

The ε-greedy approach above is on-policy: you improve the very policy you use to gather data, so it must stay a little exploratory forever. The alternative is to separate the two roles, which leads to off-policy Monte Carlo (see on-policy vs off-policy).

In off-policy MC, a behavior policy $b$ generates the episodes (it can explore freely), while we learn the value of a different target policy $\pi$ (which can be the greedy, deterministic optimum). The catch: the returns we observe come from $b$ ‘s action choices, not $\pi$ ‘s. To correct for that mismatch we reweight each return by the importance sampling ratio — the relative probability that the target vs behavior policy would have produced that exact trajectory:

\rho_{t:T-1} = \prod_{k=t}^{T-1} \frac{\pi(A_k \mid S_k)}{b(A_k \mid S_k)}

There are two ways to use these ratios:

Estimator	Formula idea	Bias	Variance
Ordinary importance sampling	average of $\rho \cdot G$	unbiased	high — can be unbounded
Weighted importance sampling	$\rho$ -weighted average of $G$	biased (vanishes as data grows)	much lower; preferred in practice

Monte Carlo vs temporal difference

MC is one of the two foundational ways to do model-free value estimation; the other is temporal-difference (TD) learning. The contrast defines a whole axis of RL.

	Monte Carlo	Temporal Difference
Target	full return $G_t$	bootstrapped: $R_{t+1} + \gamma V(S_{t+1})$
Update timing	end of episode	every step
Bootstrapping	no	yes
Bias	unbiased	biased (uses own estimates)
Variance	high (sum of many random rewards)	low
Works on continuing tasks	no	yes
Markov assumption	not required	exploited (helps if true)

0 bias

MC return is an unbiased sample of the true value

High

variance — MC sums many random rewards per estimate

Episodic

only — MC needs episodes to terminate

Neither dominates. MC’s unbiasedness and indifference to the Markov property make it robust when your state representation is imperfect; TD’s low variance and online updates usually make it faster and more practical. TD(λ) and n-step methods interpolate between the two: λ = 1 recovers Monte Carlo, λ = 0 recovers one-step TD, and intermediate values often beat both extremes.

Go deeper: why MC is high-variance but TD is biased

A Monte Carlo target is a sum of all the random rewards and random transitions from $t$ to termination. Every coin flip the environment makes along the way adds noise, so the variance of $G_t$ grows with episode length — but its expectation is exactly $v_\pi(s)$ , so it is unbiased. A TD target, $R_{t+1} + \gamma V(S_{t+1})$ , depends on only one random reward and transition, so its variance is low — but it leans on the current (wrong) estimate $V(S_{t+1})$ , injecting bias that only disappears as the estimates converge. This bias-variance trade-off is the deepest single idea connecting the two families. The n-step return splits the difference by bootstrapping after $n$ real rewards instead of one.

A worked setting: Blackjack

The canonical MC example from Sutton & Barto is Blackjack. It is a perfect fit: the game is naturally episodic (each hand ends), and the true transition dynamics are awkward to write down (the dealer’s hidden behavior, the deck) — but trivial to sample by just dealing hands.

Run thousands of hands under a fixed policy (“stick on 20 or 21, otherwise hit”) and average the returns from each state — defined by your sum, the dealer’s showing card, and whether you hold a usable ace — and first-visit MC traces out the value surface with no model of the cards at all. Switch to Monte Carlo ES and the agent discovers a near-optimal hit/stick policy purely from played-out experience. It is the cleanest demonstration that experience alone can yield both values and good policies.

A short history

1940s

Monte Carlo simulation

Stanislaw Ulam and John von Neumann coin the “Monte Carlo” method for estimating quantities by random sampling, at Los Alamos.

1990s

MC enters RL

Monte Carlo estimation of value functions is formalized as a model-free alternative to dynamic programming for solving MDPs.

1998

Sutton & Barto, Chapter 5

The first edition of Reinforcement Learning: An Introduction codifies first-visit/every-visit prediction, MC control, exploring starts and off-policy importance sampling.

2017+

MCTS at the frontier

Monte Carlo tree search — sampling rollouts to value game positions — powers AlphaGo and AlphaZero, MC’s most famous modern descendant.

Monte Carlo rollouts and tree search remain central far beyond tabular RL: the same “sample many futures and average” logic drives the planning in AlphaZero and MuZero, and full-return estimation reappears in policy gradient methods like REINFORCE.

Researcher takes

Relaying Karpathy’s critique of full-return credit assignment: a single end-of-episode reward is broadcast back across the entire trajectory, upweighting every token (even wrong or irrelevant turns) that happened to precede the right answer — the central weakness of learning from Monte Carlo returns rather than per-step signal.

View Dwarkesh Patel's post on X →

Frequently asked questions

Why can’t Monte Carlo handle continuing tasks?

Because the update target is the return $G_t$ , which sums rewards until the episode terminates. If the task never ends, there is no complete return to compute, so there is nothing to average. You either need to define episode boundaries or switch to a bootstrapping method like TD learning, which updates from a single step.

Is Monte Carlo the same as REINFORCE?

They share the same engine. REINFORCE is a policy gradient method that estimates the gradient using full Monte Carlo returns — it is “Monte Carlo policy gradient.” The MC methods on this page estimate value functions; REINFORCE uses MC returns to estimate a policy gradient directly. Same unbiased-but-high-variance return at the core.

When should I prefer MC over TD?

MC shines when your state representation may violate the Markov property (MC doesn’t rely on it), when you want unbiased estimates, or when the task is naturally episodic and short. TD is usually preferred for online learning, continuing tasks, and sample efficiency. In practice, n-step or TD(λ) blends are often the sweet spot.

What is the difference between first-visit and every-visit MC?

First-visit averages only the return after the first time a state is visited in each episode, giving independent samples. Every-visit averages the return after every visit, including repeats within an episode. Both converge to the true value; first-visit has cleaner theory, every-visit is simpler and extends better to function approximation.

Key references

Reinforcement Learning: An Introduction (2nd ed.), Chapter 5 — Sutton & Barto — the canonical treatment of MC prediction, control, exploring starts, and importance sampling.
Chapter 5 lecture slides — Sutton — the official course slides for Monte Carlo methods.
RL Course Lecture 4: Model-Free Prediction — David Silver — MC and TD prediction explained from first principles.
On the Convergence of the Monte Carlo Exploring Starts Algorithm — Chen, 2020 — a proof of convergence for MC ES, long an open theoretical question.

Value functions · Temporal-difference learning · On-policy vs off-policy · Q-learning · Exploration vs exploitation · Policy gradients · Markov decision processes · What is reinforcement learning?