- SARSA is an on-policy temporal-difference control algorithm: it learns the value of the policy it is actually following, exploration and all.
- Its name is its update — State, Action, Reward, next State, next Action — the five things that make up one learning step.
- The one-character difference from Q-learning (using the next action actually taken instead of the max) makes SARSA learn safer, more conservative policies.
- It converges to the optimal action-value function under GLIE exploration and Robbins-Monro step sizes; Expected SARSA reduces its variance.
What is SARSA?
SARSA is a foundational algorithm for model-free control: learning to act well in an unknown environment without ever knowing its dynamics. It belongs to the family of temporal-difference (TD) learning methods, which update value estimates after every single step rather than waiting for an episode to end like Monte Carlo methods.
The name is an acronym for the quintuple of experience that drives one update: the agent is in a State, takes an Action, receives a Reward, lands in the next State, and chooses the next Action. Those five elements — — are exactly what SARSA needs to improve its estimate of how good a state-action pair is.
What makes SARSA distinctive is that it is on-policy: it evaluates and improves the same policy that it uses to explore. It learns the value of the behaviour it actually exhibits — including the cost of its own random exploratory mistakes — which is what gives it its characteristic caution.
The SARSA update rule
SARSA estimates the action-value function — the expected return from taking action in state and then following the current policy. After each transition it nudges its estimate toward a one-step bootstrapped TD target:
The bracketed quantity is the TD error : the gap between what we just observed (reward plus discounted value of where we actually ended up and what we actually did next) and what we previously believed. The step size controls how much we trust the new evidence; the discount factor weights future reward. If is terminal, is defined as zero.
How the algorithm runs
SARSA interleaves policy evaluation (improving ) and policy improvement (acting greedily-ish with respect to ) on every step — a form of generalized policy iteration. The standard behaviour policy is -greedy: take the current best action most of the time, but with probability pick at random to keep exploring.
Set arbitrarily for all state-action pairs (commonly zero), with terminal states fixed at zero. Pick a step size , discount , and exploration rate .
At the start of an episode, observe state and choose action from using an -greedy policy. Crucially, the action is selected before the loop — SARSA always has its “next action” in hand.
Take , observe reward and next state . Now choose from using the same -greedy policy. This is both what you will use to update and what you will execute next — that double duty is the on-policy property.
Update using the quintuple you now hold:
Set and , then loop from step 3 until is terminal. Decay over time so the policy becomes greedy in the limit.
Go deeper: why A’ is chosen before the update, not after
In Q-learning you can compute the target the moment you see , because the target uses — no commitment to an actual next action is needed. SARSA cannot do that. Its target contains for the real , so the algorithm must commit to (sample) the next action before it can form the update. That ordering is not a coincidence of implementation; it is the mechanism by which the value of the exploration policy leaks into the learned values. Reuse the sampled as the next executed action and you get an efficient one-sample-per-step loop.
On-policy vs off-policy: SARSA and Q-learning side by side
SARSA and Q-learning are near-twins. They share the same skeleton and differ only in the target’s bootstrap term:
The consequence shows up vividly in the classic cliff-walking gridworld. An agent must reach a goal along a grid edged by a cliff; stepping off the cliff incurs a large penalty and resets the episode.
Because SARSA values the policy with its -greedy randomness, it knows that walking right beside the cliff risks a random step into it. It learns a longer, safer route one row up — and consequently earns higher reward during training.
Q-learning bootstraps from the greedy action, so it values the shortest route along the cliff edge as if exploration never happened. Its learned policy is optimal, but while still exploring it occasionally falls off, taking lower average reward online.
| Property | SARSA | Q-learning |
|---|---|---|
| Policy type | On-policy | Off-policy |
| Bootstrap target | — action taken | — best action |
| Learns value of | The exploring policy | The optimal policy |
| Cliff-walking result | Safe, longer path | Optimal, risky path |
| Online reward while learning | Higher | Lower |
| Final greedy policy | Optimal (as ) | Optimal |
Convergence guarantees
SARSA converges to the optimal action-value function — and thus an optimal policy — in the tabular case under two conditions, established by Singh, Jaakkola, Littman, and Szepesvári (2000):
The policy must be Greedy in the Limit with Infinite Exploration: every state-action pair is visited infinitely often, and the policy converges to greedy. An -greedy policy with decayed as satisfies both halves.
The step sizes must satisfy the stochastic-approximation conditions:
The first ensures the estimates can still move arbitrarily far; the second ensures the noise eventually dies out. A schedule like works (in practice a small constant is common, trading asymptotic guarantees for tracking ability).
Expected SARSA
A small change removes much of SARSA’s sampling noise. Instead of bootstrapping from the single sampled next action , Expected SARSA takes the expectation over all next actions under the policy:
By averaging over the policy’s action distribution analytically, Expected SARSA eliminates the variance introduced by randomly selecting . That lower-variance target allows larger step sizes and faster, more stable learning; in deterministic environments the target variance is zero, permitting . It converges under the same conditions as SARSA and generally matches or beats both SARSA and Q-learning empirically — at the cost of summing over actions each step.
Multi-step SARSA and SARSA(λ)
One-step SARSA bootstraps after a single transition. n-step SARSA instead bootstraps after steps, using the n-step return — a blend that sits between one-step TD () and Monte Carlo ():
Larger propagates reward information backward faster but raises variance. SARSA(λ) generalizes this with eligibility traces: a geometric average of all n-step returns weighted by . The elegant backward view maintains a short-term memory that decays by each step and marks recently visited pairs as “eligible” for the current TD error — so a single observation updates many past state-action pairs at once, without waiting for the future to unfold. SARSA(0) recovers one-step SARSA; SARSA(1) approximates Monte Carlo control.
SARSA with function approximation
For large or continuous state spaces, the tabular becomes a parameterized function — linear features, tile coding, or a neural network. Semi-gradient SARSA updates the weights toward the TD target:
It is called semi-gradient because the target’s dependence on is ignored when differentiating. With linear approximation, SARSA enjoys stronger stability than off-policy methods: on-policy distribution matching avoids the worst of the “deadly triad” (bootstrapping + function approximation + off-policy) that can make off-policy TD diverge. This is a notable practical edge of on-policy control. See deep Q-networks for how the off-policy side handles instability with replay and target networks.
A short history
When to reach for SARSA
Online performance during learning matters, mistakes are costly (robots, live systems), or you specifically want the value of the policy you are running. Its conservatism near “cliffs” is the point. Pairs naturally with on-policy methods like actor-critic and PPO.
You only care about the final greedy policy, can tolerate risky exploration mid-training, or want to learn from off-policy data (logged experience, replay buffers). Off-policy flexibility underpins DQN and most offline RL.
Frequently asked questions
Why is SARSA called on-policy?
Because it evaluates and improves the very policy it uses to act. Its update bootstraps from where is the action the current (exploring) policy actually selects — so the learned values reflect the behaviour policy, including the cost of exploration. Q-learning, by contrast, bootstraps from the greedy action regardless of what it does next, making it off-policy.
What does the SARSA acronym stand for?
State, Action, Reward, State, Action — the five elements consumed by a single update. The agent is in a state, takes an action, gets a reward and a new state, and selects a new action; those are exactly the quantities in the update rule.
Is SARSA or Q-learning better?
Neither dominates — they optimize different objectives. Q-learning learns the optimal policy and is the right default when you only judge the final policy. SARSA learns the value of the exploring policy and earns more reward during training, which is preferable when mid-training mistakes are expensive. As both converge to the optimal greedy policy.
How does Expected SARSA relate to the two?
Expected SARSA replaces the sampled next action with the expectation over the policy’s action distribution, cutting variance and allowing larger step sizes. It is a generalization: with a greedy target policy it reduces exactly to Q-learning, and it is generally on-policy when the expectation uses the behaviour policy.
Key references
- Sutton & Barto, Reinforcement Learning: An Introduction (2nd ed.), Ch. 6 — the canonical treatment of SARSA, Q-learning, and Expected SARSA.
- Convergence Results for Single-Step On-Policy RL Algorithms — Singh, Jaakkola, Littman & Szepesvári, 2000 — the convergence proof.
- A Theoretical and Empirical Analysis of Expected SARSA — van Seijen et al., 2009.
- Rummery & Niranjan, On-Line Q-Learning Using Connectionist Systems, Cambridge tech report, 1994 — the original algorithm.
Related
Q-learning · Temporal-difference learning · On-policy vs off-policy · Monte Carlo methods · Exploration vs exploitation · Value functions · What is reinforcement learning?