- An RL environment is the world an agent acts in: it defines what the agent observes, what it can do, and how it's rewarded — the concrete form of a Markov decision process.
- A benchmark is a standardized set of environments plus an agreed evaluation protocol; the difference between an environment, a suite and a protocol is what makes results comparable.
- The standard interface is reset() and step() → (observation, reward, terminated, truncated, info); OpenAI Gym became Gymnasium, now maintained by the Farama Foundation.
- The landscape spans classic control → Atari → robotics → procedural → multi-agent → offline datasets → GPU/JAX-native sims → the 2025 wave of LLM agent gyms with verifiable rewards.
What is an RL environment?
An RL environment is the world a reinforcement-learning agent lives in. It is the other half of the loop: the agent chooses actions, the environment responds with a new situation and a number that says how well things are going. Everything the agent ever learns comes from interacting with it.
Concretely, an environment defines three things and one rule:
- Observations — what the agent can see of the world’s state at each step.
- Actions — what the agent is allowed to do.
- Reward — a scalar score telling the agent how good its last action was.
- Dynamics + termination — how the world changes in response to actions, and when an episode ends.
That is exactly the structure of a Markov decision process (MDP). An “environment” is just an MDP you can actually run code against.
The three things every environment defines (a CartPole example)
The classic “hello world” of RL is CartPole: balance a pole hinged on a cart by pushing the cart left or right. It makes the three pieces concrete.
| Piece | CartPole | What it generalizes to |
|---|---|---|
| Observation space | 4 numbers: cart position, cart velocity, pole angle, pole angular velocity | Anything the agent senses — pixels, joint angles, text, a board state |
| Action space | Discrete(2): push left or push right | Discrete (buttons) or continuous (torques, steering) |
| Reward | +1 for every timestep the pole stays up | The objective, distilled to a number per step |
| Termination | Pole falls past ±12° or cart leaves the track | A terminal state defined by the task |
| Truncation | Episode hits 500 steps | A cutoff outside the task (a time limit) |
The split between terminated and truncated is subtle but important: terminated means the MDP genuinely ended (you won, you lost, the robot fell); truncated means an external limit stopped you (a step budget). Algorithms must treat them differently when bootstrapping value estimates — a truncated episode isn’t really “over,” so its final value should still be estimated, not zeroed.
Environment vs. benchmark vs. evaluation protocol
These three words get used loosely. Keeping them apart is half of reading RL papers correctly.
A single task you can run — CartPole, one Atari game, one MuJoCo robot. It exposes the reset/step interface and nothing more.
A curated collection of environments meant to be tackled together — Atari-57, the DeepMind Control Suite, Procgen’s 16 games — so a single number summarizes broad competence.
The agreed rules of measurement: which seeds, how many episodes, sticky actions on or off, raw vs. normalized scores, sample budget. Without a fixed protocol the same suite yields incomparable numbers.
The protocol is the part newcomers underrate. Two papers can both “use Atari” and still be uncomparable because one used sticky actions and 200M frames while the other used deterministic ROMs and 50M frames. A benchmark without a protocol is just a pile of environments.
The standard interface: reset() and step()
Before 2016 every lab wrapped its own simulator its own way, and code didn’t transfer. OpenAI Gym fixed this by proposing one tiny interface that almost everything now speaks:
Start a fresh episode. Returns the first observation (often with a seeded bit of randomness so the agent learns a general policy, not one initial condition) and an info dict for diagnostics.
Apply one action, advance the world one tick, and return the next observation, the reward, the two end-of-episode flags, and info. The agent calls this in a loop.
When either flag is True, the episode is over and you start again. A training run is millions of these step/reset cycles.
A minimal interaction loop is just a few lines:
import gymnasium as gym
env = gym.make("CartPole-v1")
obs, info = env.reset(seed=42)
for _ in range(1000):
action = policy(obs) # your agent
obs, reward, terminated, truncated, info = env.step(action)
if terminated or truncated:
obs, info = env.reset()
A tour of the major environment families
Below is the map almost no single page draws in full — from toy gridworlds all the way to LLM agent gyms. Plain-English first, references after.
Classic control and gridworlds — the “hello world” of RL
Tiny, fast, fully understood. CartPole, MountainCar, Acrobot and LunarLander are physics toys; FrozenLake and Taxi are discrete gridworlds you can solve with a lookup table. They exist to debug an algorithm in seconds, not to prove anything about scale. If your PPO implementation can’t solve CartPole, the bug is in your code, not your hyperparameters.
Atari and the Arcade Learning Environment — discrete actions from pixels
The Arcade Learning Environment (ALE) turned Atari 2600 games into the canonical test of general competence: one algorithm, ~57 games, raw pixels in, joystick out. It’s the benchmark behind DQN’s 2013–2015 breakthrough. The modern evaluation protocol comes from Machado et al. (2018), which added sticky actions (a 25% chance the previous action repeats) so agents can’t win by memorizing a fixed sequence. Scores are usually reported as human-normalized values aggregated across the suite.
Continuous control and robotics — MuJoCo, DMC, Isaac Lab, CARLA
When actions are continuous torques rather than buttons, you need a physics engine. MuJoCo locomotion tasks (Ant, HalfCheetah, Humanoid) are the standard continuous-control benchmark; the DeepMind Control Suite wraps MuJoCo with uniform structure and interpretable, bounded rewards. For real robotics, Gymnasium-Robotics adds manipulation, NVIDIA Isaac Lab (successor to Isaac Gym) runs thousands of robots on a single GPU, and CARLA simulates autonomous driving. Browse vendors and real robot-sim offerings on directory of RL environment companies.
Procedural generation and generalization — Procgen, MiniGrid, NetHack, Crafter
A benchmark with fixed levels rewards memorization. Procedurally generated suites fix this by drawing a fresh level every episode, so the only way to score is to generalize. Procgen offers 16 such games; MiniGrid/BabyAI give language-conditioned gridworlds; the NetHack Learning Environment is brutally deep; Crafter packs Minecraft-style open-ended achievements into a fast 2D world.
Multi-task and meta-RL — Meta-World, XLand
To test transfer, you need many related tasks. Meta-World bundles 50 robotic-manipulation tasks for multi-task and meta-RL; DeepMind’s XLand procedurally generates vast spaces of games to train agents that adapt to unseen tasks at test time.
Multi-agent environments (MARL)
When several agents share a world, you need a multi-agent API. PettingZoo is the multi-agent counterpart to Gymnasium; SMAC / SMACv2 (StarCraft micro-battles) is the cooperative MARL standard; DeepMind’s Melting Pot specifically probes cooperation and competition with unfamiliar co-players. See multi-agent RL.
Offline RL datasets — D4RL, RL Unplugged
Sometimes you can’t interact at all — you only have a log of past behavior (think medical or driving data). Offline RL learns a policy from a fixed dataset, no environment calls. D4RL is the standard suite (with distribution shift, sparse rewards, and “trajectory stitching” challenges), and DeepMind’s RL Unplugged is its large-scale cousin. The Farama Foundation is migrating D4RL into Minari, its maintained offline-dataset API.
LLM agent environments and RLVR — the 2025+ wave
The newest family trains language models as agents. Here the “environment” is a task with a verifier: a program that checks whether the model’s output is correct. This is RLVR — RL with verifiable rewards — and it powers the reasoning-model boom. Examples: SWE-bench (resolve real GitHub issues; the verifier is the repo’s test suite), WebArena / OSWorld (drive a browser or desktop). Tooling like Prime Intellect’s verifiers library and its Environments Hub standardize and crowdsource these gyms. See RL for reasoning and agentic RL.
Go deeper: the agent-gym formulation maps onto the classic MDP
LLM agent papers often write an environment as . It looks alien but it’s the same MDP in new clothes:
- Tasks = the distribution of initial states (which prompt / repo / webpage you start on).
- Harness = the dynamics — how a tool call or browser click changes the world and produces the next observation.
- Verifier = the reward function (tests pass? answer correct?), only it’s programmatic instead of learned like an RLHF reward model.
- State / Config = the observation space and the protocol knobs (max turns, tool budget).
So a “verifiable agent gym” is just an MDP whose reward happens to be a unit-test runner. The classic observation/action/reward triple still holds.
Going faster: GPU-accelerated and JAX-native environments
For years the bottleneck wasn’t the GPU training the policy — it was the CPU simulating the environment. If the simulator runs on CPU and the policy on GPU, you spend half your time copying data back and forth. Two ideas broke the wall:
Run hundreds of environment copies in parallel with optimized C++. EnvPool reaches ~1M Atari frames/sec (and ~3M on MuJoCo) on one machine. PufferLib standardizes and vectorizes messy environments for high throughput.
Put the simulator itself on the accelerator so it never leaves the device. Brax, Isaac Gym (100–1000× speedups), Gymnax, Jumanji, XLand-MiniGrid, Pgx and MuJoCo Playground run millions of steps/sec, end-to-end on GPU/TPU.
This isn’t a minor convenience — it changed which algorithms are practical. When you can generate billions of frames in hours, on-policy methods that were “too sample-hungry” become routine, and experiments that took a cluster now fit on one card.
The ecosystem map: how the pieces fit
The standard interface is what lets a training library consume any compatible environment without custom glue.
The Farama Foundation is the non-profit maintenance home for the core APIs — Gymnasium, PettingZoo, Minari and the ALE. On the other side, libraries like Stable-Baselines3 (batteries-included algorithms), CleanRL (single-file reference implementations) and Ray RLlib (distributed scale) all consume the same interface. Because of that contract, swapping CartPole for an Atari game or a custom robot is a one-line change.
How to pick the right benchmark
Match the benchmark to the question you’re actually asking.
| Your research goal | Reach for | Why |
|---|---|---|
| Debug an implementation | CartPole, classic control | Solves in seconds; isolates code bugs |
| General competence from pixels | Atari-57 (sticky actions) | The canonical discrete-action standard |
| Continuous control / locomotion | MuJoCo, DeepMind Control Suite | Standardized, interpretable rewards |
| Generalization, not memorization | Procgen, MiniGrid, Crafter | Fresh procedural levels every episode |
| Learning from logs (no interaction) | D4RL / Minari, RL Unplugged | Static datasets, distribution shift |
| Several interacting agents | PettingZoo, SMACv2, Melting Pot | Multi-agent API and dynamics |
| Massive throughput on one GPU | Brax, Gymnax, Isaac Lab, EnvPool | Millions of steps/sec, end-to-end on device |
| Train an LLM agent | SWE-bench, WebArena, verifiers gyms | Verifiable, programmatic rewards (RLVR) |
Building your own custom Gymnasium environment
When no benchmark fits your problem, you wrap it in the interface yourself. The skeleton is small:
import gymnasium as gym
from gymnasium import spaces
import numpy as np
class GridWorld(gym.Env):
def __init__(self, size=5):
self.size = size
self.observation_space = spaces.Box(0, size - 1, shape=(2,), dtype=int)
self.action_space = spaces.Discrete(4) # up/down/left/right
def reset(self, seed=None, options=None):
super().reset(seed=seed)
self.agent = np.array([0, 0])
self.goal = np.array([self.size - 1, self.size - 1])
return self.agent.copy(), {}
def step(self, action):
move = {0: (-1, 0), 1: (1, 0), 2: (0, -1), 3: (0, 1)}[action]
self.agent = np.clip(self.agent + move, 0, self.size - 1)
terminated = bool(np.array_equal(self.agent, self.goal))
reward = 1.0 if terminated else -0.01 # small step cost
return self.agent.copy(), reward, terminated, False, {}
The code is the easy part. Reward design is where projects live or die. A few hard-won rules:
- Sparse rewards are honest but hard to learn from. A single +1 at the goal is unambiguous but gives the agent almost no gradient. Dense shaping (a small bonus for getting closer) speeds learning — but a careless shaping term invites reward hacking, where the agent maximizes your proxy without doing the task.
- Bound and normalize. Wildly scaled rewards destabilize value learning; keep them in a sane range.
- Make termination unambiguous. Decide clearly what is a real terminal state vs. a time-limit truncation.
- Seed everything so episodes are reproducible.
Pitfalls: reproducibility, seeds, and the sim-to-real gap
RL is notoriously hard to reproduce. Knowing the traps is part of doing it well.
- Stochasticity and seeds. RL variance across random seeds is large. A result from one seed is anecdote, not evidence — report the mean and spread over many seeds.
- Hyperparameter sensitivity. The same algorithm can look state-of-the-art or broken depending on learning rate and batch size. Tuning budget is part of the comparison.
- Normalized vs. raw scores. Atari numbers are typically human-normalized; raw scores across games span orders of magnitude and aren’t directly comparable.
- Sample efficiency vs. asymptotic performance. “Best final score” and “best score within N samples” rank algorithms differently. State which you mean.
- The sim-to-real gap. A policy that’s perfect in simulation can fail on a real robot because the sim’s physics, sensors and latency don’t match reality. Domain randomization narrows the gap but never closes it.
The MDP underneath
Strip away the engineering and every environment is a Markov decision process — the tuple :
- — the state space (what
observationsamples from). - — the action space (what
stepaccepts). - — the transition dynamics implemented inside
step. - — the reward function, the number
stepreturns. - — the discount factor, how much future reward counts now.
The agent seeks a policy maximizing expected discounted return:
The Markov property — the future depends only on the current state, not the full history — is what makes the problem tractable. When an environment’s observation doesn’t capture enough to be Markovian, you’re in a POMDP, and the agent compensates with memory. Everything from CartPole to a SWE-bench agent gym is a special case of this one equation. See what is reinforcement learning? for the full treatment.
A short history of RL environments
Researcher takes
Misha Laskin reframes the classic ‘which environment do you train on?’ question for the LLM era, arguing the answer has collapsed to a single thing.
Thomas Scialom states the thesis behind Meta’s agent environment release: in the current phase of AI, the limiting factor has shifted from models to the environments and evals around them.
Frequently asked questions
What’s the difference between Gym and Gymnasium?
They’re the same project under new stewardship. OpenAI’s original Gym is no longer maintained; the Farama Foundation forked it into Gymnasium, the actively maintained standard. The most visible API change is that the old done flag was split into terminated (the task really ended) and truncated (a time limit hit). New code should use Gymnasium.
What does it mean to “solve” an environment?
It’s defined per environment by a threshold under a fixed protocol — e.g. CartPole-v1 is “solved” at an average return of 475 over 100 consecutive episodes. For suites like Atari, there’s no single solve bar; you report aggregate human-normalized scores instead. Always check the protocol behind any “solved” claim.
Why are GPU/JAX-native environments such a big deal?
In a traditional setup the simulator runs on CPU and the policy on GPU, so data shuttles back and forth and the simulator becomes the bottleneck. JAX-native sims (Brax, Gymnax) keep everything on the accelerator, hitting millions of steps/sec. That doesn’t just save time — it makes sample-hungry on-policy algorithms practical and shrinks cluster-scale experiments onto a single GPU.
Are LLM agent gyms really “RL environments”?
Yes — they’re MDPs with a programmatic reward. The model takes actions (tool calls, code, browser clicks), the harness returns new observations, and a verifier scores the result (do the tests pass? is the answer correct?). The framing maps directly onto observation/action/reward. See RLVR and agentic RL.
Key papers and references
- Gymnasium: A Standard Interface for RL Environments — Towers et al., NeurIPS 2025 — the modern standard API.
- The Arcade Learning Environment — Bellemare et al., 2013 — Atari as a general-agent benchmark.
- Revisiting the ALE — Machado et al., 2018 — sticky actions and the standard protocol.
- DeepMind Control Suite — Tassa et al., 2018 — the continuous-control standard.
- Procgen — Cobbe et al., 2019 — generalization, not memorization.
- Meta-World — Yu et al., 2019 — multi-task & meta-RL.
- D4RL — Fu et al., 2020 — the offline-RL benchmark.
- PettingZoo — Terry et al., 2020 — the multi-agent API.
- Brax · Isaac Gym · EnvPool · Jumanji — the GPU/JAX acceleration wave.
- Open RL Benchmark — Huang et al., 2024 — reproducibility, tracked.
- The Landscape of Agentic RL for LLMs — 2025 survey of the agent-gym wave.
Related
What is reinforcement learning? · PPO · Agentic RL · RL for reasoning · RLVR · Reward models · RLHF