reinforcement-learning.com
// CORE ALGORITHMS

SARSA: On-Policy TD Control

SARSA explained: the on-policy temporal-difference control algorithm, its State-Action-Reward-State-Action update, the math, convergence, Expected SARSA, and SARSA vs Q-learning.

Updated 2026-06-08 13 min read
Key takeaways
  • SARSA is an on-policy temporal-difference control algorithm: it learns the value of the policy it is actually following, exploration and all.
  • Its name is its update — State, Action, Reward, next State, next Action — the five things that make up one learning step.
  • The one-character difference from Q-learning (using the next action actually taken instead of the max) makes SARSA learn safer, more conservative policies.
  • It converges to the optimal action-value function under GLIE exploration and Robbins-Monro step sizes; Expected SARSA reduces its variance.

What is SARSA?

SARSA is a foundational algorithm for model-free control: learning to act well in an unknown environment without ever knowing its dynamics. It belongs to the family of temporal-difference (TD) learning methods, which update value estimates after every single step rather than waiting for an episode to end like Monte Carlo methods.

The name is an acronym for the quintuple of experience that drives one update: the agent is in a State, takes an Action, receives a Reward, lands in the next State, and chooses the next Action. Those five elements — (St,At,Rt+1,St+1,At+1)(S_t, A_t, R_{t+1}, S_{t+1}, A_{t+1}) — are exactly what SARSA needs to improve its estimate of how good a state-action pair is.

What makes SARSA distinctive is that it is on-policy: it evaluates and improves the same policy that it uses to explore. It learns the value of the behaviour it actually exhibits — including the cost of its own random exploratory mistakes — which is what gives it its characteristic caution.

StateS(t)ActionA(t)RewardR(t+1)StateS(t+1)ActionA(t+1)update Q(S(t), A(t)) toward R(t+1) + γ·Q(S(t+1), A(t+1))A(t+1) is the action the policy actually picks — that is what makes SARSA on-policy
One SARSA transition: the agent acts, observes the reward and next state, then samples the next action under its current policy. All five pieces — S, A, R, S', A' — feed the update.

The SARSA update rule

SARSA estimates the action-value function Q(s,a)Q(s, a) — the expected return from taking action aa in state ss and then following the current policy. After each transition it nudges its estimate toward a one-step bootstrapped TD target:

Q(St,At)Q(St,At)+α[Rt+1+γQ(St+1,At+1)Q(St,At)]Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha\Big[\,R_{t+1} + \gamma\, Q(S_{t+1}, A_{t+1}) - Q(S_t, A_t)\,\Big]

The bracketed quantity is the TD error δt=Rt+1+γQ(St+1,At+1)Q(St,At)\delta_t = R_{t+1} + \gamma\,Q(S_{t+1}, A_{t+1}) - Q(S_t, A_t): the gap between what we just observed (reward plus discounted value of where we actually ended up and what we actually did next) and what we previously believed. The step size α\alpha controls how much we trust the new evidence; the discount factor γ\gamma weights future reward. If St+1S_{t+1} is terminal, Q(St+1,At+1)Q(S_{t+1}, A_{t+1}) is defined as zero.

How the algorithm runs

SARSA interleaves policy evaluation (improving QQ) and policy improvement (acting greedily-ish with respect to QQ) on every step — a form of generalized policy iteration. The standard behaviour policy is ε\varepsilon-greedy: take the current best action most of the time, but with probability ε\varepsilon pick at random to keep exploring.

1
Initialize

Set Q(s,a)Q(s, a) arbitrarily for all state-action pairs (commonly zero), with terminal states fixed at zero. Pick a step size α\alpha, discount γ\gamma, and exploration rate ε\varepsilon.

2
Choose the first action

At the start of an episode, observe state SS and choose action AA from QQ using an ε\varepsilon-greedy policy. Crucially, the action is selected before the loop — SARSA always has its “next action” in hand.

3
Act, then choose the next action

Take AA, observe reward RR and next state SS'. Now choose AA' from SS' using the same ε\varepsilon-greedy policy. This AA' is both what you will use to update and what you will execute next — that double duty is the on-policy property.

4
Apply the SARSA update

Update using the quintuple you now hold:

Q(S,A)Q(S,A)+α[R+γQ(S,A)Q(S,A)]Q(S, A) \leftarrow Q(S, A) + \alpha\big[\,R + \gamma\, Q(S', A') - Q(S, A)\,\big]
5
Shift forward and repeat

Set SSS \leftarrow S' and AAA \leftarrow A', then loop from step 3 until SS is terminal. Decay ε\varepsilon over time so the policy becomes greedy in the limit.

Go deeper: why A’ is chosen before the update, not after

In Q-learning you can compute the target the moment you see SS', because the target uses maxaQ(S,a)\max_{a'} Q(S', a') — no commitment to an actual next action is needed. SARSA cannot do that. Its target contains Q(S,A)Q(S', A') for the real AA', so the algorithm must commit to (sample) the next action before it can form the update. That ordering is not a coincidence of implementation; it is the mechanism by which the value of the exploration policy leaks into the learned values. Reuse the sampled AA' as the next executed action and you get an efficient one-sample-per-step loop.

On-policy vs off-policy: SARSA and Q-learning side by side

SARSA and Q-learning are near-twins. They share the same skeleton and differ only in the target’s bootstrap term:

SARSA (on-policy)Q-learning (off-policy)target =R + γ · Q(S’, A’)A’ = the action the policyactually takes nexttarget =R + γ · max Q(S’, a’)max over all actions a’ —ignores what it does next
The lone difference. SARSA bootstraps from the next action it actually takes (on-policy); Q-learning bootstraps from the best action available (off-policy), regardless of what it does next.

The consequence shows up vividly in the classic cliff-walking gridworld. An agent must reach a goal along a grid edged by a cliff; stepping off the cliff incurs a large penalty and resets the episode.

SARSA learns the safe path

Because SARSA values the policy with its ε\varepsilon-greedy randomness, it knows that walking right beside the cliff risks a random step into it. It learns a longer, safer route one row up — and consequently earns higher reward during training.

Q-learning learns the optimal path

Q-learning bootstraps from the greedy action, so it values the shortest route along the cliff edge as if exploration never happened. Its learned policy is optimal, but while still exploring it occasionally falls off, taking lower average reward online.

PropertySARSAQ-learning
Policy typeOn-policyOff-policy
Bootstrap targetQ(S,A)Q(S', A') — action takenmaxaQ(S,a)\max_{a'} Q(S', a') — best action
Learns value ofThe exploring policyThe optimal policy
Cliff-walking resultSafe, longer pathOptimal, risky path
Online reward while learningHigherLower
Final greedy policyOptimal (as ε0\varepsilon \to 0)Optimal

Convergence guarantees

SARSA converges to the optimal action-value function qq_* — and thus an optimal policy — in the tabular case under two conditions, established by Singh, Jaakkola, Littman, and Szepesvári (2000):

1
GLIE exploration

The policy must be Greedy in the Limit with Infinite Exploration: every state-action pair is visited infinitely often, and the policy converges to greedy. An ε\varepsilon-greedy policy with ε\varepsilon decayed as εt=1/t\varepsilon_t = 1/t satisfies both halves.

2
Robbins-Monro step sizes

The step sizes must satisfy the stochastic-approximation conditions:

t=1αt=andt=1αt2<\sum_{t=1}^{\infty} \alpha_t = \infty \qquad\text{and}\qquad \sum_{t=1}^{\infty} \alpha_t^2 < \infty

The first ensures the estimates can still move arbitrarily far; the second ensures the noise eventually dies out. A schedule like αt=1/t\alpha_t = 1/t works (in practice a small constant α\alpha is common, trading asymptotic guarantees for tracking ability).

▶ RL Course by David Silver — Lecture 5: Model-Free Control (SARSA, Q-learning, GLIE)

Expected SARSA

A small change removes much of SARSA’s sampling noise. Instead of bootstrapping from the single sampled next action AA', Expected SARSA takes the expectation over all next actions under the policy:

Q(St,At)Q(St,At)+α[Rt+1+γaπ(aSt+1)Q(St+1,a)Q(St,At)]Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha\Big[\,R_{t+1} + \gamma \sum_{a} \pi(a \mid S_{t+1})\, Q(S_{t+1}, a) - Q(S_t, A_t)\,\Big]

By averaging over the policy’s action distribution analytically, Expected SARSA eliminates the variance introduced by randomly selecting AA'. That lower-variance target allows larger step sizes and faster, more stable learning; in deterministic environments the target variance is zero, permitting α=1\alpha = 1. It converges under the same conditions as SARSA and generally matches or beats both SARSA and Q-learning empirically — at the cost of summing over actions each step.

1994
Rummery & Niranjan introduce the algorithm (as 'modified Q-learning')
5
Elements in the update: S, A, R, S', A'
1 term
The only difference from Q-learning: Q(S',A') vs max Q(S',a')

Multi-step SARSA and SARSA(λ)

One-step SARSA bootstraps after a single transition. n-step SARSA instead bootstraps after nn steps, using the n-step return — a blend that sits between one-step TD (n=1n=1) and Monte Carlo (n=n=\infty):

Gt:t+n=Rt+1+γRt+2++γn1Rt+n+γnQ(St+n,At+n)G_{t:t+n} = R_{t+1} + \gamma R_{t+2} + \cdots + \gamma^{n-1} R_{t+n} + \gamma^{n} Q(S_{t+n}, A_{t+n})

Larger nn propagates reward information backward faster but raises variance. SARSA(λ) generalizes this with eligibility traces: a geometric average of all n-step returns weighted by λ[0,1]\lambda \in [0, 1]. The elegant backward view maintains a short-term memory e(s,a)e(s, a) that decays by γλ\gamma\lambda each step and marks recently visited pairs as “eligible” for the current TD error — so a single observation updates many past state-action pairs at once, without waiting for the future to unfold. SARSA(0) recovers one-step SARSA; SARSA(1) approximates Monte Carlo control.

SARSA with function approximation

For large or continuous state spaces, the tabular QQ becomes a parameterized function q^(s,a;w)\hat q(s, a; \mathbf{w}) — linear features, tile coding, or a neural network. Semi-gradient SARSA updates the weights toward the TD target:

ww+α[Rt+1+γq^(St+1,At+1;w)q^(St,At;w)]q^(St,At;w)\mathbf{w} \leftarrow \mathbf{w} + \alpha\big[\,R_{t+1} + \gamma\,\hat q(S_{t+1}, A_{t+1}; \mathbf{w}) - \hat q(S_t, A_t; \mathbf{w})\,\big]\nabla \hat q(S_t, A_t; \mathbf{w})

It is called semi-gradient because the target’s dependence on w\mathbf{w} is ignored when differentiating. With linear approximation, SARSA enjoys stronger stability than off-policy methods: on-policy distribution matching avoids the worst of the “deadly triad” (bootstrapping + function approximation + off-policy) that can make off-policy TD diverge. This is a notable practical edge of on-policy control. See deep Q-networks for how the off-policy side handles instability with replay and target networks.

A short history

1994
Modified Q-learning
Rummery and Niranjan propose the algorithm in a Cambridge tech report, calling it “modified Q-learning.”
1996
The name SARSA
Rich Sutton coins the name “SARSA” after the State-Action-Reward-State-Action quintuple, and the term sticks.
1998
Expected SARSA
George John and others describe expected-update variants; the idea is later formalized and analyzed in depth.
2000
Convergence proof
Singh, Jaakkola, Littman & Szepesvári prove SARSA converges to the optimal policy under GLIE and Robbins-Monro conditions.
2009
Expected SARSA analysis
van Seijen et al. give the theoretical and empirical analysis showing Expected SARSA’s variance and learning-rate advantages.

When to reach for SARSA

Use SARSA when…

Online performance during learning matters, mistakes are costly (robots, live systems), or you specifically want the value of the policy you are running. Its conservatism near “cliffs” is the point. Pairs naturally with on-policy methods like actor-critic and PPO.

Use Q-learning when…

You only care about the final greedy policy, can tolerate risky exploration mid-training, or want to learn from off-policy data (logged experience, replay buffers). Off-policy flexibility underpins DQN and most offline RL.

Frequently asked questions

Why is SARSA called on-policy?

Because it evaluates and improves the very policy it uses to act. Its update bootstraps from Q(S,A)Q(S', A') where AA' is the action the current (exploring) policy actually selects — so the learned values reflect the behaviour policy, including the cost of exploration. Q-learning, by contrast, bootstraps from the greedy action regardless of what it does next, making it off-policy.

What does the SARSA acronym stand for?

State, Action, Reward, State, Action — the five elements (St,At,Rt+1,St+1,At+1)(S_t, A_t, R_{t+1}, S_{t+1}, A_{t+1}) consumed by a single update. The agent is in a state, takes an action, gets a reward and a new state, and selects a new action; those are exactly the quantities in the update rule.

Is SARSA or Q-learning better?

Neither dominates — they optimize different objectives. Q-learning learns the optimal policy and is the right default when you only judge the final policy. SARSA learns the value of the exploring policy and earns more reward during training, which is preferable when mid-training mistakes are expensive. As ε0\varepsilon \to 0 both converge to the optimal greedy policy.

How does Expected SARSA relate to the two?

Expected SARSA replaces the sampled next action with the expectation over the policy’s action distribution, cutting variance and allowing larger step sizes. It is a generalization: with a greedy target policy it reduces exactly to Q-learning, and it is generally on-policy when the expectation uses the behaviour policy.

Key references

Q-learning · Temporal-difference learning · On-policy vs off-policy · Monte Carlo methods · Exploration vs exploitation · Value functions · What is reinforcement learning?