RL Environments & Benchmarks, Explained

Key takeaways

An RL environment is the world an agent acts in: it defines what the agent observes, what it can do, and how it's rewarded — the concrete form of a Markov decision process.
A benchmark is a standardized set of environments plus an agreed evaluation protocol; the difference between an environment, a suite and a protocol is what makes results comparable.
The standard interface is reset() and step() → (observation, reward, terminated, truncated, info); OpenAI Gym became Gymnasium, now maintained by the Farama Foundation.
The landscape spans classic control → Atari → robotics → procedural → multi-agent → offline datasets → GPU/JAX-native sims → the 2025 wave of LLM agent gyms with verifiable rewards.

What is an RL environment?

An RL environment is the world a reinforcement-learning agent lives in. It is the other half of the loop: the agent chooses actions, the environment responds with a new situation and a number that says how well things are going. Everything the agent ever learns comes from interacting with it.

Concretely, an environment defines three things and one rule:

Observations — what the agent can see of the world’s state at each step.
Actions — what the agent is allowed to do.
Reward — a scalar score telling the agent how good its last action was.
Dynamics + termination — how the world changes in response to actions, and when an episode ends.

That is exactly the structure of a Markov decision process (MDP). An “environment” is just an MDP you can actually run code against.

The agent–environment loop. The agent sends an action; the environment returns the next observation and a reward. Repeat until the episode terminates or is truncated.

The three things every environment defines (a CartPole example)

The classic “hello world” of RL is CartPole: balance a pole hinged on a cart by pushing the cart left or right. It makes the three pieces concrete.

Piece	CartPole	What it generalizes to
Observation space	4 numbers: cart position, cart velocity, pole angle, pole angular velocity	Anything the agent senses — pixels, joint angles, text, a board state
Action space	Discrete(2): push left or push right	Discrete (buttons) or continuous (torques, steering)
Reward	+1 for every timestep the pole stays up	The objective, distilled to a number per step
Termination	Pole falls past ±12° or cart leaves the track	A terminal state defined by the task
Truncation	Episode hits 500 steps	A cutoff outside the task (a time limit)

The split between terminated and truncated is subtle but important: terminated means the MDP genuinely ended (you won, you lost, the robot fell); truncated means an external limit stopped you (a step budget). Algorithms must treat them differently when bootstrapping value estimates — a truncated episode isn’t really “over,” so its final value should still be estimated, not zeroed.

Environment vs. benchmark vs. evaluation protocol

These three words get used loosely. Keeping them apart is half of reading RL papers correctly.

Environment

A single task you can run — CartPole, one Atari game, one MuJoCo robot. It exposes the reset/step interface and nothing more.

Benchmark suite

A curated collection of environments meant to be tackled together — Atari-57, the DeepMind Control Suite, Procgen’s 16 games — so a single number summarizes broad competence.

Evaluation protocol

The agreed rules of measurement: which seeds, how many episodes, sticky actions on or off, raw vs. normalized scores, sample budget. Without a fixed protocol the same suite yields incomparable numbers.

The protocol is the part newcomers underrate. Two papers can both “use Atari” and still be uncomparable because one used sticky actions and 200M frames while the other used deterministic ROMs and 50M frames. A benchmark without a protocol is just a pile of environments.

The standard interface: reset() and step()

Before 2016 every lab wrapped its own simulator its own way, and code didn’t transfer. OpenAI Gym fixed this by proposing one tiny interface that almost everything now speaks:

reset() → (observation, info)

Start a fresh episode. Returns the first observation (often with a seeded bit of randomness so the agent learns a general policy, not one initial condition) and an info dict for diagnostics.

step(action) → (observation, reward, terminated, truncated, info)

Apply one action, advance the world one tick, and return the next observation, the reward, the two end-of-episode flags, and info. The agent calls this in a loop.

On terminated or truncated, call reset()

When either flag is True, the episode is over and you start again. A training run is millions of these step/reset cycles.

A minimal interaction loop is just a few lines:

import gymnasium as gym

env = gym.make("CartPole-v1")
obs, info = env.reset(seed=42)

for _ in range(1000):
    action = policy(obs)                  # your agent
    obs, reward, terminated, truncated, info = env.step(action)
    if terminated or truncated:
        obs, info = env.reset()

A tour of the major environment families

Below is the map almost no single page draws in full — from toy gridworlds all the way to LLM agent gyms. Plain-English first, references after.

Classic control and gridworlds — the “hello world” of RL

Tiny, fast, fully understood. CartPole, MountainCar, Acrobot and LunarLander are physics toys; FrozenLake and Taxi are discrete gridworlds you can solve with a lookup table. They exist to debug an algorithm in seconds, not to prove anything about scale. If your PPO implementation can’t solve CartPole, the bug is in your code, not your hyperparameters.

Atari and the Arcade Learning Environment — discrete actions from pixels

The Arcade Learning Environment (ALE) turned Atari 2600 games into the canonical test of general competence: one algorithm, ~57 games, raw pixels in, joystick out. It’s the benchmark behind DQN’s 2013–2015 breakthrough. The modern evaluation protocol comes from Machado et al. (2018), which added sticky actions (a 25% chance the previous action repeats) so agents can’t win by memorizing a fixed sequence. Scores are usually reported as human-normalized values aggregated across the suite.

Continuous control and robotics — MuJoCo, DMC, Isaac Lab, CARLA

When actions are continuous torques rather than buttons, you need a physics engine. MuJoCo locomotion tasks (Ant, HalfCheetah, Humanoid) are the standard continuous-control benchmark; the DeepMind Control Suite wraps MuJoCo with uniform structure and interpretable, bounded rewards. For real robotics, Gymnasium-Robotics adds manipulation, NVIDIA Isaac Lab (successor to Isaac Gym) runs thousands of robots on a single GPU, and CARLA simulates autonomous driving. Browse vendors and real robot-sim offerings on directory of RL environment companies.

Procedural generation and generalization — Procgen, MiniGrid, NetHack, Crafter

A benchmark with fixed levels rewards memorization. Procedurally generated suites fix this by drawing a fresh level every episode, so the only way to score is to generalize. Procgen offers 16 such games; MiniGrid/BabyAI give language-conditioned gridworlds; the NetHack Learning Environment is brutally deep; Crafter packs Minecraft-style open-ended achievements into a fast 2D world.

Multi-task and meta-RL — Meta-World, XLand

To test transfer, you need many related tasks. Meta-World bundles 50 robotic-manipulation tasks for multi-task and meta-RL; DeepMind’s XLand procedurally generates vast spaces of games to train agents that adapt to unseen tasks at test time.

Multi-agent environments (MARL)

When several agents share a world, you need a multi-agent API. PettingZoo is the multi-agent counterpart to Gymnasium; SMAC / SMACv2 (StarCraft micro-battles) is the cooperative MARL standard; DeepMind’s Melting Pot specifically probes cooperation and competition with unfamiliar co-players. See multi-agent RL.

Offline RL datasets — D4RL, RL Unplugged

Sometimes you can’t interact at all — you only have a log of past behavior (think medical or driving data). Offline RL learns a policy from a fixed dataset, no environment calls. D4RL is the standard suite (with distribution shift, sparse rewards, and “trajectory stitching” challenges), and DeepMind’s RL Unplugged is its large-scale cousin. The Farama Foundation is migrating D4RL into Minari, its maintained offline-dataset API.

LLM agent environments and RLVR — the 2025+ wave

The newest family trains language models as agents. Here the “environment” is a task with a verifier: a program that checks whether the model’s output is correct. This is RLVR — RL with verifiable rewards — and it powers the reasoning-model boom. Examples: SWE-bench (resolve real GitHub issues; the verifier is the repo’s test suite), WebArena / OSWorld (drive a browser or desktop). Tooling like Prime Intellect’s verifiers library and its Environments Hub standardize and crowdsource these gyms. See RL for reasoning and agentic RL.

View Prime Intellect's post on X →

Go deeper: the agent-gym formulation maps onto the classic MDP

LLM agent papers often write an environment as $E = \{\text{Tasks}, \text{Harness}, \text{Verifier}, \text{State}, \text{Config}\}$ . It looks alien but it’s the same MDP in new clothes:

Tasks = the distribution of initial states (which prompt / repo / webpage you start on).
Harness = the dynamics — how a tool call or browser click changes the world and produces the next observation.
Verifier = the reward function (tests pass? answer correct?), only it’s programmatic instead of learned like an RLHF reward model.
State / Config = the observation space and the protocol knobs (max turns, tool budget).

So a “verifiable agent gym” is just an MDP whose reward happens to be a unit-test runner. The classic observation/action/reward triple still holds.

Going faster: GPU-accelerated and JAX-native environments

For years the bottleneck wasn’t the GPU training the policy — it was the CPU simulating the environment. If the simulator runs on CPU and the policy on GPU, you spend half your time copying data back and forth. Two ideas broke the wall:

Vectorized CPU engines

Run hundreds of environment copies in parallel with optimized C++. EnvPool reaches ~1M Atari frames/sec (and ~3M on MuJoCo) on one machine. PufferLib standardizes and vectorizes messy environments for high throughput.

GPU/JAX-native simulators

Put the simulator itself on the accelerator so it never leaves the device. Brax, Isaac Gym (100–1000× speedups), Gymnax, Jumanji, XLand-MiniGrid, Pgx and MuJoCo Playground run millions of steps/sec, end-to-end on GPU/TPU.

~1M

Atari frames/sec — EnvPool on one machine

100–1000×

Isaac Gym speedup vs. CPU pipelines

millions/s

Steps from JAX-native sims (Brax, Gymnax)

This isn’t a minor convenience — it changed which algorithms are practical. When you can generate billions of frames in hours, on-policy methods that were “too sample-hungry” become routine, and experiments that took a cluster now fit on one card.

The ecosystem map: how the pieces fit

The standard interface is what lets a training library consume any compatible environment without custom glue.

The Gymnasium-compatible stack: any compatible environment plugs into any training library through one shared interface.

The Farama Foundation is the non-profit maintenance home for the core APIs — Gymnasium, PettingZoo, Minari and the ALE. On the other side, libraries like Stable-Baselines3 (batteries-included algorithms), CleanRL (single-file reference implementations) and Ray RLlib (distributed scale) all consume the same interface. Because of that contract, swapping CartPole for an Atari game or a custom robot is a one-line change.

How to pick the right benchmark

Match the benchmark to the question you’re actually asking.

Your research goal	Reach for	Why
Debug an implementation	CartPole, classic control	Solves in seconds; isolates code bugs
General competence from pixels	Atari-57 (sticky actions)	The canonical discrete-action standard
Continuous control / locomotion	MuJoCo, DeepMind Control Suite	Standardized, interpretable rewards
Generalization, not memorization	Procgen, MiniGrid, Crafter	Fresh procedural levels every episode
Learning from logs (no interaction)	D4RL / Minari, RL Unplugged	Static datasets, distribution shift
Several interacting agents	PettingZoo, SMACv2, Melting Pot	Multi-agent API and dynamics
Massive throughput on one GPU	Brax, Gymnax, Isaac Lab, EnvPool	Millions of steps/sec, end-to-end on device
Train an LLM agent	SWE-bench, WebArena, verifiers gyms	Verifiable, programmatic rewards (RLVR)

Building your own custom Gymnasium environment

When no benchmark fits your problem, you wrap it in the interface yourself. The skeleton is small:

import gymnasium as gym
from gymnasium import spaces
import numpy as np

class GridWorld(gym.Env):
    def __init__(self, size=5):
        self.size = size
        self.observation_space = spaces.Box(0, size - 1, shape=(2,), dtype=int)
        self.action_space = spaces.Discrete(4)        # up/down/left/right

    def reset(self, seed=None, options=None):
        super().reset(seed=seed)
        self.agent = np.array([0, 0])
        self.goal = np.array([self.size - 1, self.size - 1])
        return self.agent.copy(), {}

    def step(self, action):
        move = {0: (-1, 0), 1: (1, 0), 2: (0, -1), 3: (0, 1)}[action]
        self.agent = np.clip(self.agent + move, 0, self.size - 1)
        terminated = bool(np.array_equal(self.agent, self.goal))
        reward = 1.0 if terminated else -0.01        # small step cost
        return self.agent.copy(), reward, terminated, False, {}

The code is the easy part. Reward design is where projects live or die. A few hard-won rules:

Sparse rewards are honest but hard to learn from. A single +1 at the goal is unambiguous but gives the agent almost no gradient. Dense shaping (a small bonus for getting closer) speeds learning — but a careless shaping term invites reward hacking, where the agent maximizes your proxy without doing the task.
Bound and normalize. Wildly scaled rewards destabilize value learning; keep them in a sane range.
Make termination unambiguous. Decide clearly what is a real terminal state vs. a time-limit truncation.
Seed everything so episodes are reproducible.

Pitfalls: reproducibility, seeds, and the sim-to-real gap

RL is notoriously hard to reproduce. Knowing the traps is part of doing it well.

Stochasticity and seeds. RL variance across random seeds is large. A result from one seed is anecdote, not evidence — report the mean and spread over many seeds.
Hyperparameter sensitivity. The same algorithm can look state-of-the-art or broken depending on learning rate and batch size. Tuning budget is part of the comparison.
Normalized vs. raw scores. Atari numbers are typically human-normalized; raw scores across games span orders of magnitude and aren’t directly comparable.
Sample efficiency vs. asymptotic performance. “Best final score” and “best score within N samples” rank algorithms differently. State which you mean.
The sim-to-real gap. A policy that’s perfect in simulation can fail on a real robot because the sim’s physics, sensors and latency don’t match reality. Domain randomization narrows the gap but never closes it.

The MDP underneath

Strip away the engineering and every environment is a Markov decision process — the tuple $(\mathcal{S}, \mathcal{A}, P, R, \gamma)$ :

$\mathcal{S}$ — the state space (what observation samples from).
$\mathcal{A}$ — the action space (what step accepts).
$P(s' \mid s, a)$ — the transition dynamics implemented inside step.
$R(s, a)$ — the reward function, the number step returns.
$\gamma \in [0,1)$ — the discount factor, how much future reward counts now.

The agent seeks a policy $\pi$ maximizing expected discounted return:

\max_{\pi}\; \mathbb{E}_{\pi}\!\left[\sum_{t=0}^{\infty} \gamma^{t}\, R(s_t, a_t)\right]

The Markov property — the future depends only on the current state, not the full history — is what makes the problem tractable. When an environment’s observation doesn’t capture enough to be Markovian, you’re in a POMDP, and the agent compensates with memory. Everything from CartPole to a SWE-bench agent gym is a special case of this one equation. See what is reinforcement learning? for the full treatment.

A short history of RL environments

2013

Arcade Learning Environment

Bellemare et al. make Atari 2600 the benchmark for general-competency agents — the stage for DQN.

2016

OpenAI Gym

One tiny reset/step interface unifies the field; environments and algorithms become interchangeable.

2018

DMC Suite & ALE protocol

DeepMind Control standardizes continuous control; Machado et al. fix the Atari evaluation protocol with sticky actions.

2019–20

Generalization & offline

Procgen tests generalization; D4RL makes offline RL a first-class benchmark.

2021–23

The GPU/JAX turn

Brax, Isaac Gym, EnvPool, Gymnax and Jumanji push simulation onto accelerators — millions of steps/sec.

2022

Farama Foundation & Gymnasium

A non-profit takes over Gym as Gymnasium and maintains the core APIs (PettingZoo, Minari, ALE).

2024–25

LLM agent gyms & RLVR

Verifiable-reward environments (SWE-bench, WebArena), the verifiers library and Prime Intellect’s Environments Hub power the reasoning-model wave. Gymnasium’s reference paper lands at NeurIPS 2025.

Researcher takes

Misha Laskin reframes the classic ‘which environment do you train on?’ question for the LLM era, arguing the answer has collapsed to a single thing.

View Misha Laskin's post on X →

Thomas Scialom states the thesis behind Meta’s agent environment release: in the current phase of AI, the limiting factor has shifted from models to the environments and evals around them.

View Thomas Scialom's post on X →

Frequently asked questions

What’s the difference between Gym and Gymnasium?

They’re the same project under new stewardship. OpenAI’s original Gym is no longer maintained; the Farama Foundation forked it into Gymnasium, the actively maintained standard. The most visible API change is that the old done flag was split into terminated (the task really ended) and truncated (a time limit hit). New code should use Gymnasium.

What does it mean to “solve” an environment?

It’s defined per environment by a threshold under a fixed protocol — e.g. CartPole-v1 is “solved” at an average return of 475 over 100 consecutive episodes. For suites like Atari, there’s no single solve bar; you report aggregate human-normalized scores instead. Always check the protocol behind any “solved” claim.

Why are GPU/JAX-native environments such a big deal?

In a traditional setup the simulator runs on CPU and the policy on GPU, so data shuttles back and forth and the simulator becomes the bottleneck. JAX-native sims (Brax, Gymnax) keep everything on the accelerator, hitting millions of steps/sec. That doesn’t just save time — it makes sample-hungry on-policy algorithms practical and shrinks cluster-scale experiments onto a single GPU.

Are LLM agent gyms really “RL environments”?

Yes — they’re MDPs with a programmatic reward. The model takes actions (tool calls, code, browser clicks), the harness returns new observations, and a verifier scores the result (do the tests pass? is the answer correct?). The $E = \{\text{Tasks}, \text{Harness}, \text{Verifier}, \text{State}, \text{Config}\}$ framing maps directly onto observation/action/reward. See RLVR and agentic RL.

Key papers and references

Gymnasium: A Standard Interface for RL Environments — Towers et al., NeurIPS 2025 — the modern standard API.
The Arcade Learning Environment — Bellemare et al., 2013 — Atari as a general-agent benchmark.
Revisiting the ALE — Machado et al., 2018 — sticky actions and the standard protocol.
DeepMind Control Suite — Tassa et al., 2018 — the continuous-control standard.
Procgen — Cobbe et al., 2019 — generalization, not memorization.
Meta-World — Yu et al., 2019 — multi-task & meta-RL.
D4RL — Fu et al., 2020 — the offline-RL benchmark.
PettingZoo — Terry et al., 2020 — the multi-agent API.
Brax · Isaac Gym · EnvPool · Jumanji — the GPU/JAX acceleration wave.
Open RL Benchmark — Huang et al., 2024 — reproducibility, tracked.
The Landscape of Agentic RL for LLMs — 2025 survey of the agent-gym wave.

What is reinforcement learning? · PPO · Agentic RL · RL for reasoning · RLVR · Reward models · RLHF