reinforcement-learning.com
// ADVANCED TOPICS

Multi-Agent Reinforcement Learning (MARL)

What multi-agent RL is, the Markov-game formalism, non-stationarity and credit assignment, CTDE algorithms like MADDPG and QMIX, self-play and emergent behavior.

Updated 2026-06-07 16 min read
Key takeaways
  • Multi-agent RL (MARL) studies several agents that learn simultaneously in a shared environment — cooperating, competing, or both.
  • The clean MDP world breaks: when every agent is learning at once, the environment looks non-stationary from any single agent's view, and team rewards make credit assignment hard.
  • The dominant fix is CTDE — centralized training with decentralized execution — used by MADDPG (continuous), QMIX/VDN (value factorization) and COMA (counterfactual credit).
  • Self-play and autocurricula drove the headline results: AlphaStar, OpenAI Five, and the emergent tool-use of OpenAI's hide-and-seek.

What is multi-agent reinforcement learning?

Multi-agent reinforcement learning (MARL) is reinforcement learning where more than one agent learns at the same time inside a shared environment. Each agent picks actions, the world updates, and each receives its own reward — but crucially, every agent’s outcomes depend on what the others do. A self-driving car merges into traffic full of other learning drivers; a team of warehouse robots must avoid each other while clearing orders; two trading bots push prices against each other. None of these is a single-agent problem dressed up — the interaction is the problem.

That interaction can be cooperative (a shared team reward), competitive (one agent’s gain is another’s loss), or mixed (teams that compete, members that cooperate — like soccer). The mix changes everything about what “optimal” even means.

SINGLE-AGENTAgentEnvironmentactionstate, rewardMULTI-AGENTAgent 1Agent NShared environmenttransitions on the joint action
Single-agent RL closes one loop with the environment. In MARL, each agent's actions feed a shared environment whose transitions and rewards depend on the joint action — so from any one agent's view, the others are part of a moving environment.
▶ Multi-Agent Hide and Seek — OpenAI (emergent tool use from competition, ~3 min)

The formal model: Markov games

Single-agent RL lives on a Markov Decision Process. MARL generalizes it to a Markov game (also called a stochastic game, introduced by Shapley in 1953). For NN agents it is the tuple:

(S,  {Ai}i=1N,  P,  {Ri}i=1N,  γ)\big(\mathcal{S},\; \{\mathcal{A}_i\}_{i=1}^N,\; P,\; \{R_i\}_{i=1}^N,\; \gamma\big)

where S\mathcal{S} is the state space, Ai\mathcal{A}_i is agent ii‘s action set, and the joint action a=(a1,,aN)a = (a_1, \dots, a_N) drives both the transition and the rewards. The transition kernel depends on all agents:

P(ss,a1,,aN)P\big(s' \mid s,\, a_1, \dots, a_N\big)

and each agent gets its own reward Ri(s,a1,,aN)R_i(s, a_1, \dots, a_N). Three special cases name the field:

Cooperative

All agents share one reward, R1==RNR_1 = \cdots = R_N. The goal is a joint policy that maximizes the common return. This is the world of QMIX, VDN and COMA.

Competitive (zero-sum)

Rewards sum to zero: one agent’s gain is another’s loss. Two-player zero-sum games have a well-defined minimax/Nash value — the setting of self-play and AlphaZero-style training.

Mixed (general-sum)

Anything in between — teams, social dilemmas, markets. The richest and hardest case; solution concepts get subtle and equilibria may not be unique.

When agents can’t see the full state — the usual case — the model becomes a Dec-POMDP (decentralized partially observable MDP): each agent acts on its own local observation oio_i, not on ss. That partial observability is what makes decentralized execution both necessary and hard.

What does “optimal” mean with many learners?

In single-agent RL there’s one objective and one optimal policy. With several self-interested agents there is no single “best” — only equilibria. The central solution concept is the Nash equilibrium: a joint policy (π1,,πN)(\pi_1^*, \dots, \pi_N^*) where no agent can improve its own expected return by unilaterally changing its policy while the others hold theirs fixed. In a Markov game this is a Markov-perfect equilibrium when it holds at every state. Equilibria can be multiple, hard to compute, and not necessarily good for the group — which is why MARL borrows heavily from game theory.

Why MARL is hard: four core challenges

1
Non-stationarity

The killer problem. Standard RL assumes a stationary environment — fixed transition and reward dynamics. But when every agent learns simultaneously, each agent’s effective environment (the others’ behavior) keeps changing. A policy that was good yesterday may be bad once opponents adapt. This breaks the convergence guarantees of single-agent Q-learning, because the Markov property no longer holds from one agent’s local view.

2
Credit assignment

In a cooperative team with one shared reward, who actually caused the win? If five robots get +1 for clearing an order, each must figure out how much its own actions contributed. This multi-agent credit assignment problem is the reason value-decomposition and counterfactual methods exist.

3
Scalability / combinatorial explosion

The joint action space grows as iAi\prod_i |\mathcal{A}_i| — exponential in the number of agents. A naive “treat the team as one big agent” approach (a joint-action learner) is intractable beyond a handful of agents.

4
Partial observability & coordination

Each agent typically sees only a slice of the world. Agents must coordinate — sometimes communicate — to act coherently, all while their teammates’ policies are themselves shifting underfoot.

The dominant paradigm: centralized training, decentralized execution (CTDE)

The single most influential idea in modern deep MARL is CTDE. The insight: training and execution have different constraints.

  • During training (e.g. in a simulator) you can cheat — give the learner access to the global state, every agent’s observations, and every agent’s actions. This extra information tames non-stationarity, because a critic conditioned on the joint action sees a stationary target.
  • During execution each agent must act on its own local observation alone — no telepathy, no central controller.

CTDE squares that circle: train with a centralized critic, deploy a decentralized actor. Almost every flagship cooperative algorithm is a CTDE method.

TRAINING (centralized)Actor 1Actor NCentralized criticsees global state + joint actiongradientsEXECUTION (decentralized)Actor 1Actor Nobs o(1)obs o(N)Environmentcritic is gone
CTDE: a centralized critic sees the global state and the joint action during training (tames non-stationarity); each decentralized actor sees only its own local observation and is the only thing deployed at execution time.

The algorithm landscape

MARL methods fall into a few families. The table is the map; the prose below it is the territory.

AlgorithmTypeReward settingKey ideaDrops at execution
IQL (independent Q)Value, decentralizedAnyEach agent runs its own DQN, ignores the restnothing extra (baseline)
MADDPGActor-critic, CTDEMixed / continuousCentralized critic per agent; decentralized deterministic actorsthe centralized critics
VDNValue factorization, CTDECooperativeJoint Q = sum of per-agent Q’sthe summation
QMIXValue factorization, CTDECooperativeJoint Q = monotonic mix of per-agent Q’sthe mixing network
COMAActor-critic, CTDECooperativeCounterfactual baseline for credit assignmentthe centralized critic
MAPPOActor-critic, CTDECooperative / mixedPPO with a centralized value functionthe centralized critic

MADDPG — centralized critics for mixed settings

MADDPG (Lowe et al., 2017) is the canonical CTDE actor-critic. The paper opens by naming the two diseases: Q-learning suffers from non-stationarity, and policy gradients suffer from variance that explodes with the number of agents. The cure: give each agent a centralized critic Qi(s,a1,,aN)Q_i(s, a_1, \dots, a_N) that sees everyone’s actions, while each actor πi(oi)\pi_i(o_i) stays decentralized. Because the critic conditions on the joint action, its learning target is stationary even as policies change. MADDPG handles cooperative, competitive and mixed settings with continuous actions — building on DDPG.

VDN and QMIX — factorizing the team value

For purely cooperative teams, the trick is to decompose the team’s joint value into per-agent pieces, so each agent can act greedily on its own slice while the team value is still maximized. VDN (Sunehag et al., 2017) takes the simplest form:

Qtot(s,a)  =  i=1NQi(oi,ai)Q_{\text{tot}}(s, \boldsymbol{a}) \;=\; \sum_{i=1}^{N} Q_i(o_i, a_i)

QMIX (Rashid et al., 2018) generalizes this: instead of a plain sum, a mixing network combines the per-agent utilities into QtotQ_{\text{tot}}, subject to a monotonicity constraint:

QtotQi0i\frac{\partial Q_{\text{tot}}}{\partial Q_i} \ge 0 \quad \forall i

Monotonicity is the magic. It guarantees that the action that maximizes each agent’s local QiQ_i also maximizes the team QtotQ_{\text{tot}} — so the expensive centralized argmax\arg\max over the joint action decomposes into cheap per-agent argmax\arg\maxes. That’s what makes decentralized execution correct, not just convenient. QMIX remains a standard cooperative baseline.

COMA — solving credit assignment with counterfactuals

COMA (Foerster et al., 2017) attacks credit assignment directly. It uses a centralized critic and a counterfactual baseline: to judge agent ii‘s action, it asks “how much better did the team do than if agent ii had acted by default, holding everyone else’s actions fixed?” The advantage becomes

Ai(s,a)=Q(s,a)aiπi(aioi)Q(s,(ai,ai))A_i(s, \boldsymbol{a}) = Q(s, \boldsymbol{a}) - \sum_{a_i'} \pi_i(a_i' \mid o_i)\, Q\big(s,\, (\boldsymbol{a}_{-i}, a_i')\big)

By marginalizing out only agent ii‘s action, COMA isolates that agent’s contribution — separating signal from the noise of teammates.

Go deeper: why monotonicity in QMIX is also a limitation

QMIX’s monotonicity constraint buys tractable decentralized argmax\arg\max, but it can’t represent every cooperative game. Tasks with non-monotonic value structure — where the best individual action depends sharply on what a teammate simultaneously does (coordination that requires miscoordination to be punished) — fall outside QMIX’s representable class. Successors like QTRAN, QPLEX and weighted QMIX relax the constraint to recover more of the joint-value function at the cost of extra machinery. This expressiveness-vs-tractability trade is a recurring theme in value-factorization research.

Self-play, autocurricula and emergent behavior

The most spectacular MARL results don’t come from clever loss functions — they come from self-play: agents trained against copies (or past versions) of themselves. In a competitive game, self-play creates an autocurriculum — an automatically generated curriculum where every improvement by one side becomes a harder challenge for the other, with no human-designed difficulty ramp.

OpenAI’s Emergent Tool Use from Multi-Agent Autocurricula (Baker et al., 2019) is the vivid demonstration. Hiders and seekers play hide-and-seek with movable boxes and ramps. With no reward for touching objects — only for hiding or finding — six escalating strategy phases emerge: hiders build box forts, seekers learn to use ramps to jump in, hiders learn to lock the ramps away, and seekers eventually discover “box surfing” (riding a box over walls by exploiting the physics). Each phase exists only because the previous one created the pressure.

6
Distinct emergent strategy phases in hide-and-seek
99.8%
Human players AlphaStar ranked above in StarCraft II
0
Object-use rewards given — tool use was fully emergent

This is the same engine behind the headline systems. AlphaZero and MuZero reach superhuman play in Go, chess and shogi purely through self-play; AlphaStar hit grandmaster in StarCraft II using a league of diverse agents to avoid strategic blind spots; and OpenAI Five beat the Dota 2 world champions through massive-scale self-play.

A short history of MARL

1953
Stochastic games
Shapley defines stochastic (Markov) games — the multi-agent analog of the MDP, decades before deep RL.
1994
Markov games meet RL
Littman’s minimax-Q applies Q-learning to two-player zero-sum Markov games, founding modern MARL.
2017
The deep CTDE wave
MADDPG and COMA bring centralized critics to deep MARL; VDN factorizes the team value.
2018
QMIX & StarCraft benchmark
QMIX’s monotonic mixing and the SMAC benchmark make cooperative MARL reproducible and competitive.
2019
Superhuman milestones
AlphaStar reaches StarCraft II grandmaster; OpenAI Five wins at Dota 2; hide-and-seek shows emergent tool use.
2024
The textbook & LLM agents
The first comprehensive MARL textbook (Albrecht, Christianos, Schäfer) lands as multi-agent LLM systems revive interest.

Where MARL is used

DomainWhat the agents areWhy it’s multi-agent
Games & e-sportsUnits, players, teamsPure competition/cooperation at scale (AlphaStar, OpenAI Five)
Autonomous drivingVehicles in trafficEvery other car is a learning, reacting agent
Robotics & swarmsDrones, warehouse botsCoordination and collision-avoidance under partial observability — see RL in robotics
Networks & energyRouters, grid controllersDistributed resource allocation, no central controller
Trading & marketsStrategy agentsStrategic, adversarial, non-stationary by nature
LLM agent systemsTool-using language agentsTeams of models negotiating, debating, or dividing labor — see agentic RL

Limitations and open problems

  • Non-stationarity has no clean solution. CTDE mitigates it during training, but truly decentralized, online co-adaptation remains theoretically thorny.
  • Equilibrium selection. General-sum games can have many Nash equilibria, some bad for everyone; which one learning converges to is poorly understood.
  • Scalability. Most strong methods still assume tens, not thousands, of agents. Mean-field and graph-based approaches push the limit but trade away fidelity.
  • Emergent collusion and social dilemmas. Self-interested learners can converge on outcomes that are individually rational but collectively harmful — a live concern for markets and RL safety.
  • Evaluation. Without a single objective, “is this policy good?” depends on the opponent distribution — making benchmarks fragile.

MARL in practice

CTDE methods (MADDPG, QMIX, MAPPO) are the workhorses for cooperative tasks; self-play and league training power competitive ones. Standard benchmarks include SMAC (StarCraft Multi-Agent Challenge), PettingZoo (the multi-agent analog of Gym), and Melting Pot (social-dilemma evaluation). Libraries such as EPyMARL, MARLlib and RLlib implement the core algorithms — see RL libraries and frameworks. Building realistic multi-agent simulators, opponent pools, and reward pipelines at scale is its own discipline — see the multi-agent RL environment companies.

Frequently asked questions

How is MARL different from just running several single-agent RL algorithms?

You can run independent learners (IQL is exactly that), and it’s a real baseline — but it ignores non-stationarity. Because every agent’s environment keeps shifting as the others learn, independent learning has no convergence guarantee and can oscillate. MARL methods explicitly model the interaction, usually via a centralized critic during training.

What does “non-stationarity” mean in MARL, exactly?

A single-agent environment is stationary: its transition and reward functions don’t change. In MARL, from any one agent’s perspective the “environment” includes the other agents — and they are learning, so their behavior changes over time. The agent is therefore optimizing against a moving target, which violates the stationarity assumption that single-agent convergence proofs rely on.

What is CTDE and why is it so popular?

Centralized Training with Decentralized Execution. During training you allow access to global information (other agents’ actions, the full state) to stabilize learning; at deployment each agent acts only on its local observation. It captures the best of both worlds — stable training plus scalable, realistic execution — and underpins MADDPG, QMIX, COMA and MAPPO.

Is AlphaGo multi-agent RL?

In a sense — AlphaGo and AlphaZero reach superhuman play through self-play, which is a two-player zero-sum Markov game where the opponent is a copy of the agent. The training loop is multi-agent even though only one policy is being learned. Many people file it under both self-play and MARL.

Key papers

What is reinforcement learning? · Markov decision processes · Policy gradients · Actor-critic · PPO · AlphaZero & MuZero · Continuous control · Agentic RL