SARSA: On-Policy TD Control

Key takeaways

SARSA is an on-policy temporal-difference control algorithm: it learns the value of the policy it is actually following, exploration and all.
Its name is its update — State, Action, Reward, next State, next Action — the five things that make up one learning step.
The one-character difference from Q-learning (using the next action actually taken instead of the max) makes SARSA learn safer, more conservative policies.
It converges to the optimal action-value function under GLIE exploration and Robbins-Monro step sizes; Expected SARSA reduces its variance.

What is SARSA?

SARSA is a foundational algorithm for model-free control: learning to act well in an unknown environment without ever knowing its dynamics. It belongs to the family of temporal-difference (TD) learning methods, which update value estimates after every single step rather than waiting for an episode to end like Monte Carlo methods.

The name is an acronym for the quintuple of experience that drives one update: the agent is in a State, takes an Action, receives a Reward, lands in the next State, and chooses the next Action. Those five elements — $(S_t, A_t, R_{t+1}, S_{t+1}, A_{t+1})$ — are exactly what SARSA needs to improve its estimate of how good a state-action pair is.

What makes SARSA distinctive is that it is on-policy: it evaluates and improves the same policy that it uses to explore. It learns the value of the behaviour it actually exhibits — including the cost of its own random exploratory mistakes — which is what gives it its characteristic caution.

One SARSA transition: the agent acts, observes the reward and next state, then samples the next action under its current policy. All five pieces — S, A, R, S', A' — feed the update.

The SARSA update rule

SARSA estimates the action-value function $Q(s, a)$ — the expected return from taking action $a$ in state $s$ and then following the current policy. After each transition it nudges its estimate toward a one-step bootstrapped TD target:

Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha\Big[\,R_{t+1} + \gamma\, Q(S_{t+1}, A_{t+1}) - Q(S_t, A_t)\,\Big]

The bracketed quantity is the TD error $\delta_t = R_{t+1} + \gamma\,Q(S_{t+1}, A_{t+1}) - Q(S_t, A_t)$ : the gap between what we just observed (reward plus discounted value of where we actually ended up and what we actually did next) and what we previously believed. The step size $\alpha$ controls how much we trust the new evidence; the discount factor $\gamma$ weights future reward. If $S_{t+1}$ is terminal, $Q(S_{t+1}, A_{t+1})$ is defined as zero.

How the algorithm runs

SARSA interleaves policy evaluation (improving $Q$ ) and policy improvement (acting greedily-ish with respect to $Q$ ) on every step — a form of generalized policy iteration. The standard behaviour policy is $\varepsilon$ -greedy: take the current best action most of the time, but with probability $\varepsilon$ pick at random to keep exploring.

Initialize

Set $Q(s, a)$ arbitrarily for all state-action pairs (commonly zero), with terminal states fixed at zero. Pick a step size $\alpha$ , discount $\gamma$ , and exploration rate $\varepsilon$ .

Choose the first action

At the start of an episode, observe state $S$ and choose action $A$ from $Q$ using an $\varepsilon$ -greedy policy. Crucially, the action is selected before the loop — SARSA always has its “next action” in hand.

Act, then choose the next action

Take $A$ , observe reward $R$ and next state $S'$ . Now choose $A'$ from $S'$ using the same $\varepsilon$ -greedy policy. This $A'$ is both what you will use to update and what you will execute next — that double duty is the on-policy property.

Apply the SARSA update

Update using the quintuple you now hold:

Q(S, A) \leftarrow Q(S, A) + \alpha\big[\,R + \gamma\, Q(S', A') - Q(S, A)\,\big]

Shift forward and repeat

Set $S \leftarrow S'$ and $A \leftarrow A'$ , then loop from step 3 until $S$ is terminal. Decay $\varepsilon$ over time so the policy becomes greedy in the limit.

Go deeper: why A’ is chosen before the update, not after

In Q-learning you can compute the target the moment you see $S'$ , because the target uses $\max_{a'} Q(S', a')$ — no commitment to an actual next action is needed. SARSA cannot do that. Its target contains $Q(S', A')$ for the real $A'$ , so the algorithm must commit to (sample) the next action before it can form the update. That ordering is not a coincidence of implementation; it is the mechanism by which the value of the exploration policy leaks into the learned values. Reuse the sampled $A'$ as the next executed action and you get an efficient one-sample-per-step loop.

On-policy vs off-policy: SARSA and Q-learning side by side

SARSA and Q-learning are near-twins. They share the same skeleton and differ only in the target’s bootstrap term:

The lone difference. SARSA bootstraps from the next action it actually takes (on-policy); Q-learning bootstraps from the best action available (off-policy), regardless of what it does next.

The consequence shows up vividly in the classic cliff-walking gridworld. An agent must reach a goal along a grid edged by a cliff; stepping off the cliff incurs a large penalty and resets the episode.

SARSA learns the safe path

Because SARSA values the policy with its $\varepsilon$ -greedy randomness, it knows that walking right beside the cliff risks a random step into it. It learns a longer, safer route one row up — and consequently earns higher reward during training.

Q-learning learns the optimal path

Q-learning bootstraps from the greedy action, so it values the shortest route along the cliff edge as if exploration never happened. Its learned policy is optimal, but while still exploring it occasionally falls off, taking lower average reward online.

Property	SARSA	Q-learning
Policy type	On-policy	Off-policy
Bootstrap target	$Q(S', A')$ — action taken	$\max_{a'} Q(S', a')$ — best action
Learns value of	The exploring policy	The optimal policy
Cliff-walking result	Safe, longer path	Optimal, risky path
Online reward while learning	Higher	Lower
Final greedy policy	Optimal (as $\varepsilon \to 0$ )	Optimal

Convergence guarantees

SARSA converges to the optimal action-value function $q_*$ — and thus an optimal policy — in the tabular case under two conditions, established by Singh, Jaakkola, Littman, and Szepesvári (2000):

GLIE exploration

The policy must be Greedy in the Limit with Infinite Exploration: every state-action pair is visited infinitely often, and the policy converges to greedy. An $\varepsilon$ -greedy policy with $\varepsilon$ decayed as $\varepsilon_t = 1/t$ satisfies both halves.

Robbins-Monro step sizes

The step sizes must satisfy the stochastic-approximation conditions:

\sum_{t=1}^{\infty} \alpha_t = \infty \qquad\text{and}\qquad \sum_{t=1}^{\infty} \alpha_t^2 < \infty

The first ensures the estimates can still move arbitrarily far; the second ensures the noise eventually dies out. A schedule like $\alpha_t = 1/t$ works (in practice a small constant $\alpha$ is common, trading asymptotic guarantees for tracking ability).

▶ RL Course by David Silver — Lecture 5: Model-Free Control (SARSA, Q-learning, GLIE)

Expected SARSA

A small change removes much of SARSA’s sampling noise. Instead of bootstrapping from the single sampled next action $A'$ , Expected SARSA takes the expectation over all next actions under the policy:

Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha\Big[\,R_{t+1} + \gamma \sum_{a} \pi(a \mid S_{t+1})\, Q(S_{t+1}, a) - Q(S_t, A_t)\,\Big]

By averaging over the policy’s action distribution analytically, Expected SARSA eliminates the variance introduced by randomly selecting $A'$ . That lower-variance target allows larger step sizes and faster, more stable learning; in deterministic environments the target variance is zero, permitting $\alpha = 1$ . It converges under the same conditions as SARSA and generally matches or beats both SARSA and Q-learning empirically — at the cost of summing over actions each step.

1994

Rummery & Niranjan introduce the algorithm (as 'modified Q-learning')

Elements in the update: S, A, R, S', A'

1 term

The only difference from Q-learning: Q(S',A') vs max Q(S',a')

Multi-step SARSA and SARSA(λ)

One-step SARSA bootstraps after a single transition. n-step SARSA instead bootstraps after $n$ steps, using the n-step return — a blend that sits between one-step TD ( $n=1$ ) and Monte Carlo ( $n=\infty$ ):

G_{t:t+n} = R_{t+1} + \gamma R_{t+2} + \cdots + \gamma^{n-1} R_{t+n} + \gamma^{n} Q(S_{t+n}, A_{t+n})

Larger $n$ propagates reward information backward faster but raises variance. SARSA(λ) generalizes this with eligibility traces: a geometric average of all n-step returns weighted by $\lambda \in [0, 1]$ . The elegant backward view maintains a short-term memory $e(s, a)$ that decays by $\gamma\lambda$ each step and marks recently visited pairs as “eligible” for the current TD error — so a single observation updates many past state-action pairs at once, without waiting for the future to unfold. SARSA(0) recovers one-step SARSA; SARSA(1) approximates Monte Carlo control.

SARSA with function approximation

For large or continuous state spaces, the tabular $Q$ becomes a parameterized function $\hat q(s, a; \mathbf{w})$ — linear features, tile coding, or a neural network. Semi-gradient SARSA updates the weights toward the TD target:

\mathbf{w} \leftarrow \mathbf{w} + \alpha\big[\,R_{t+1} + \gamma\,\hat q(S_{t+1}, A_{t+1}; \mathbf{w}) - \hat q(S_t, A_t; \mathbf{w})\,\big]\nabla \hat q(S_t, A_t; \mathbf{w})

It is called semi-gradient because the target’s dependence on $\mathbf{w}$ is ignored when differentiating. With linear approximation, SARSA enjoys stronger stability than off-policy methods: on-policy distribution matching avoids the worst of the “deadly triad” (bootstrapping + function approximation + off-policy) that can make off-policy TD diverge. This is a notable practical edge of on-policy control. See deep Q-networks for how the off-policy side handles instability with replay and target networks.

A short history

1994

Modified Q-learning

Rummery and Niranjan propose the algorithm in a Cambridge tech report, calling it “modified Q-learning.”

1996

The name SARSA

Rich Sutton coins the name “SARSA” after the State-Action-Reward-State-Action quintuple, and the term sticks.

1998

Expected SARSA

George John and others describe expected-update variants; the idea is later formalized and analyzed in depth.

2000

Convergence proof

Singh, Jaakkola, Littman & Szepesvári prove SARSA converges to the optimal policy under GLIE and Robbins-Monro conditions.

2009

Expected SARSA analysis

van Seijen et al. give the theoretical and empirical analysis showing Expected SARSA’s variance and learning-rate advantages.

When to reach for SARSA

Use SARSA when…

Online performance during learning matters, mistakes are costly (robots, live systems), or you specifically want the value of the policy you are running. Its conservatism near “cliffs” is the point. Pairs naturally with on-policy methods like actor-critic and PPO.

Use Q-learning when…

You only care about the final greedy policy, can tolerate risky exploration mid-training, or want to learn from off-policy data (logged experience, replay buffers). Off-policy flexibility underpins DQN and most offline RL.

Frequently asked questions

Why is SARSA called on-policy?

Because it evaluates and improves the very policy it uses to act. Its update bootstraps from $Q(S', A')$ where $A'$ is the action the current (exploring) policy actually selects — so the learned values reflect the behaviour policy, including the cost of exploration. Q-learning, by contrast, bootstraps from the greedy action regardless of what it does next, making it off-policy.

What does the SARSA acronym stand for?

State, Action, Reward, State, Action — the five elements $(S_t, A_t, R_{t+1}, S_{t+1}, A_{t+1})$ consumed by a single update. The agent is in a state, takes an action, gets a reward and a new state, and selects a new action; those are exactly the quantities in the update rule.

Is SARSA or Q-learning better?

Neither dominates — they optimize different objectives. Q-learning learns the optimal policy and is the right default when you only judge the final policy. SARSA learns the value of the exploring policy and earns more reward during training, which is preferable when mid-training mistakes are expensive. As $\varepsilon \to 0$ both converge to the optimal greedy policy.

How does Expected SARSA relate to the two?

Expected SARSA replaces the sampled next action with the expectation over the policy’s action distribution, cutting variance and allowing larger step sizes. It is a generalization: with a greedy target policy it reduces exactly to Q-learning, and it is generally on-policy when the expectation uses the behaviour policy.

Key references

Sutton & Barto, Reinforcement Learning: An Introduction (2nd ed.), Ch. 6 — the canonical treatment of SARSA, Q-learning, and Expected SARSA.
Convergence Results for Single-Step On-Policy RL Algorithms — Singh, Jaakkola, Littman & Szepesvári, 2000 — the convergence proof.
A Theoretical and Empirical Analysis of Expected SARSA — van Seijen et al., 2009.
Rummery & Niranjan, On-Line Q-Learning Using Connectionist Systems, Cambridge tech report, 1994 — the original algorithm.

Q-learning · Temporal-difference learning · On-policy vs off-policy · Monte Carlo methods · Exploration vs exploitation · Value functions · What is reinforcement learning?

SARSA: On-Policy TD Control

What is SARSA?

The SARSA update rule

How the algorithm runs

On-policy vs off-policy: SARSA and Q-learning side by side

Convergence guarantees

Expected SARSA

Multi-step SARSA and SARSA(λ)

SARSA with function approximation

A short history

When to reach for SARSA

Frequently asked questions

Key references

Related