- Multi-agent RL (MARL) studies several agents that learn simultaneously in a shared environment — cooperating, competing, or both.
- The clean MDP world breaks: when every agent is learning at once, the environment looks non-stationary from any single agent's view, and team rewards make credit assignment hard.
- The dominant fix is CTDE — centralized training with decentralized execution — used by MADDPG (continuous), QMIX/VDN (value factorization) and COMA (counterfactual credit).
- Self-play and autocurricula drove the headline results: AlphaStar, OpenAI Five, and the emergent tool-use of OpenAI's hide-and-seek.
What is multi-agent reinforcement learning?
Multi-agent reinforcement learning (MARL) is reinforcement learning where more than one agent learns at the same time inside a shared environment. Each agent picks actions, the world updates, and each receives its own reward — but crucially, every agent’s outcomes depend on what the others do. A self-driving car merges into traffic full of other learning drivers; a team of warehouse robots must avoid each other while clearing orders; two trading bots push prices against each other. None of these is a single-agent problem dressed up — the interaction is the problem.
That interaction can be cooperative (a shared team reward), competitive (one agent’s gain is another’s loss), or mixed (teams that compete, members that cooperate — like soccer). The mix changes everything about what “optimal” even means.
The formal model: Markov games
Single-agent RL lives on a Markov Decision Process. MARL generalizes it to a Markov game (also called a stochastic game, introduced by Shapley in 1953). For agents it is the tuple:
where is the state space, is agent ‘s action set, and the joint action drives both the transition and the rewards. The transition kernel depends on all agents:
and each agent gets its own reward . Three special cases name the field:
All agents share one reward, . The goal is a joint policy that maximizes the common return. This is the world of QMIX, VDN and COMA.
Rewards sum to zero: one agent’s gain is another’s loss. Two-player zero-sum games have a well-defined minimax/Nash value — the setting of self-play and AlphaZero-style training.
Anything in between — teams, social dilemmas, markets. The richest and hardest case; solution concepts get subtle and equilibria may not be unique.
When agents can’t see the full state — the usual case — the model becomes a Dec-POMDP (decentralized partially observable MDP): each agent acts on its own local observation , not on . That partial observability is what makes decentralized execution both necessary and hard.
What does “optimal” mean with many learners?
In single-agent RL there’s one objective and one optimal policy. With several self-interested agents there is no single “best” — only equilibria. The central solution concept is the Nash equilibrium: a joint policy where no agent can improve its own expected return by unilaterally changing its policy while the others hold theirs fixed. In a Markov game this is a Markov-perfect equilibrium when it holds at every state. Equilibria can be multiple, hard to compute, and not necessarily good for the group — which is why MARL borrows heavily from game theory.
Why MARL is hard: four core challenges
The killer problem. Standard RL assumes a stationary environment — fixed transition and reward dynamics. But when every agent learns simultaneously, each agent’s effective environment (the others’ behavior) keeps changing. A policy that was good yesterday may be bad once opponents adapt. This breaks the convergence guarantees of single-agent Q-learning, because the Markov property no longer holds from one agent’s local view.
In a cooperative team with one shared reward, who actually caused the win? If five robots get +1 for clearing an order, each must figure out how much its own actions contributed. This multi-agent credit assignment problem is the reason value-decomposition and counterfactual methods exist.
The joint action space grows as — exponential in the number of agents. A naive “treat the team as one big agent” approach (a joint-action learner) is intractable beyond a handful of agents.
Each agent typically sees only a slice of the world. Agents must coordinate — sometimes communicate — to act coherently, all while their teammates’ policies are themselves shifting underfoot.
The dominant paradigm: centralized training, decentralized execution (CTDE)
The single most influential idea in modern deep MARL is CTDE. The insight: training and execution have different constraints.
- During training (e.g. in a simulator) you can cheat — give the learner access to the global state, every agent’s observations, and every agent’s actions. This extra information tames non-stationarity, because a critic conditioned on the joint action sees a stationary target.
- During execution each agent must act on its own local observation alone — no telepathy, no central controller.
CTDE squares that circle: train with a centralized critic, deploy a decentralized actor. Almost every flagship cooperative algorithm is a CTDE method.
The algorithm landscape
MARL methods fall into a few families. The table is the map; the prose below it is the territory.
| Algorithm | Type | Reward setting | Key idea | Drops at execution |
|---|---|---|---|---|
| IQL (independent Q) | Value, decentralized | Any | Each agent runs its own DQN, ignores the rest | nothing extra (baseline) |
| MADDPG | Actor-critic, CTDE | Mixed / continuous | Centralized critic per agent; decentralized deterministic actors | the centralized critics |
| VDN | Value factorization, CTDE | Cooperative | Joint Q = sum of per-agent Q’s | the summation |
| QMIX | Value factorization, CTDE | Cooperative | Joint Q = monotonic mix of per-agent Q’s | the mixing network |
| COMA | Actor-critic, CTDE | Cooperative | Counterfactual baseline for credit assignment | the centralized critic |
| MAPPO | Actor-critic, CTDE | Cooperative / mixed | PPO with a centralized value function | the centralized critic |
MADDPG — centralized critics for mixed settings
MADDPG (Lowe et al., 2017) is the canonical CTDE actor-critic. The paper opens by naming the two diseases: Q-learning suffers from non-stationarity, and policy gradients suffer from variance that explodes with the number of agents. The cure: give each agent a centralized critic that sees everyone’s actions, while each actor stays decentralized. Because the critic conditions on the joint action, its learning target is stationary even as policies change. MADDPG handles cooperative, competitive and mixed settings with continuous actions — building on DDPG.
VDN and QMIX — factorizing the team value
For purely cooperative teams, the trick is to decompose the team’s joint value into per-agent pieces, so each agent can act greedily on its own slice while the team value is still maximized. VDN (Sunehag et al., 2017) takes the simplest form:
QMIX (Rashid et al., 2018) generalizes this: instead of a plain sum, a mixing network combines the per-agent utilities into , subject to a monotonicity constraint:
Monotonicity is the magic. It guarantees that the action that maximizes each agent’s local also maximizes the team — so the expensive centralized over the joint action decomposes into cheap per-agent es. That’s what makes decentralized execution correct, not just convenient. QMIX remains a standard cooperative baseline.
COMA — solving credit assignment with counterfactuals
COMA (Foerster et al., 2017) attacks credit assignment directly. It uses a centralized critic and a counterfactual baseline: to judge agent ‘s action, it asks “how much better did the team do than if agent had acted by default, holding everyone else’s actions fixed?” The advantage becomes
By marginalizing out only agent ‘s action, COMA isolates that agent’s contribution — separating signal from the noise of teammates.
Go deeper: why monotonicity in QMIX is also a limitation
QMIX’s monotonicity constraint buys tractable decentralized , but it can’t represent every cooperative game. Tasks with non-monotonic value structure — where the best individual action depends sharply on what a teammate simultaneously does (coordination that requires miscoordination to be punished) — fall outside QMIX’s representable class. Successors like QTRAN, QPLEX and weighted QMIX relax the constraint to recover more of the joint-value function at the cost of extra machinery. This expressiveness-vs-tractability trade is a recurring theme in value-factorization research.
Self-play, autocurricula and emergent behavior
The most spectacular MARL results don’t come from clever loss functions — they come from self-play: agents trained against copies (or past versions) of themselves. In a competitive game, self-play creates an autocurriculum — an automatically generated curriculum where every improvement by one side becomes a harder challenge for the other, with no human-designed difficulty ramp.
OpenAI’s Emergent Tool Use from Multi-Agent Autocurricula (Baker et al., 2019) is the vivid demonstration. Hiders and seekers play hide-and-seek with movable boxes and ramps. With no reward for touching objects — only for hiding or finding — six escalating strategy phases emerge: hiders build box forts, seekers learn to use ramps to jump in, hiders learn to lock the ramps away, and seekers eventually discover “box surfing” (riding a box over walls by exploiting the physics). Each phase exists only because the previous one created the pressure.
This is the same engine behind the headline systems. AlphaZero and MuZero reach superhuman play in Go, chess and shogi purely through self-play; AlphaStar hit grandmaster in StarCraft II using a league of diverse agents to avoid strategic blind spots; and OpenAI Five beat the Dota 2 world champions through massive-scale self-play.
A short history of MARL
Where MARL is used
| Domain | What the agents are | Why it’s multi-agent |
|---|---|---|
| Games & e-sports | Units, players, teams | Pure competition/cooperation at scale (AlphaStar, OpenAI Five) |
| Autonomous driving | Vehicles in traffic | Every other car is a learning, reacting agent |
| Robotics & swarms | Drones, warehouse bots | Coordination and collision-avoidance under partial observability — see RL in robotics |
| Networks & energy | Routers, grid controllers | Distributed resource allocation, no central controller |
| Trading & markets | Strategy agents | Strategic, adversarial, non-stationary by nature |
| LLM agent systems | Tool-using language agents | Teams of models negotiating, debating, or dividing labor — see agentic RL |
Limitations and open problems
- Non-stationarity has no clean solution. CTDE mitigates it during training, but truly decentralized, online co-adaptation remains theoretically thorny.
- Equilibrium selection. General-sum games can have many Nash equilibria, some bad for everyone; which one learning converges to is poorly understood.
- Scalability. Most strong methods still assume tens, not thousands, of agents. Mean-field and graph-based approaches push the limit but trade away fidelity.
- Emergent collusion and social dilemmas. Self-interested learners can converge on outcomes that are individually rational but collectively harmful — a live concern for markets and RL safety.
- Evaluation. Without a single objective, “is this policy good?” depends on the opponent distribution — making benchmarks fragile.
MARL in practice
CTDE methods (MADDPG, QMIX, MAPPO) are the workhorses for cooperative tasks; self-play and league training power competitive ones. Standard benchmarks include SMAC (StarCraft Multi-Agent Challenge), PettingZoo (the multi-agent analog of Gym), and Melting Pot (social-dilemma evaluation). Libraries such as EPyMARL, MARLlib and RLlib implement the core algorithms — see RL libraries and frameworks. Building realistic multi-agent simulators, opponent pools, and reward pipelines at scale is its own discipline — see the multi-agent RL environment companies.
Frequently asked questions
How is MARL different from just running several single-agent RL algorithms?
You can run independent learners (IQL is exactly that), and it’s a real baseline — but it ignores non-stationarity. Because every agent’s environment keeps shifting as the others learn, independent learning has no convergence guarantee and can oscillate. MARL methods explicitly model the interaction, usually via a centralized critic during training.
What does “non-stationarity” mean in MARL, exactly?
A single-agent environment is stationary: its transition and reward functions don’t change. In MARL, from any one agent’s perspective the “environment” includes the other agents — and they are learning, so their behavior changes over time. The agent is therefore optimizing against a moving target, which violates the stationarity assumption that single-agent convergence proofs rely on.
What is CTDE and why is it so popular?
Centralized Training with Decentralized Execution. During training you allow access to global information (other agents’ actions, the full state) to stabilize learning; at deployment each agent acts only on its local observation. It captures the best of both worlds — stable training plus scalable, realistic execution — and underpins MADDPG, QMIX, COMA and MAPPO.
Is AlphaGo multi-agent RL?
In a sense — AlphaGo and AlphaZero reach superhuman play through self-play, which is a two-player zero-sum Markov game where the opponent is a copy of the agent. The training loop is multi-agent even though only one policy is being learned. Many people file it under both self-play and MARL.
Key papers
- Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments (MADDPG) — Lowe et al., 2017.
- Counterfactual Multi-Agent Policy Gradients (COMA) — Foerster et al., 2017.
- Value-Decomposition Networks (VDN) — Sunehag et al., 2017.
- QMIX: Monotonic Value Function Factorisation — Rashid et al., 2018.
- Grandmaster level in StarCraft II (AlphaStar) — Vinyals et al., 2019.
- Emergent Tool Use from Multi-Agent Autocurricula — Baker et al., 2019.
Related
What is reinforcement learning? · Markov decision processes · Policy gradients · Actor-critic · PPO · AlphaZero & MuZero · Continuous control · Agentic RL