reinforcement-learning.com
// TOOLS & PRACTICE

RL Libraries & Frameworks

A practical 2026 map of reinforcement learning libraries: Gymnasium, Stable-Baselines3, CleanRL, RLlib, TorchRL, JAX (Brax/PufferLib), and LLM stacks TRL, verl, OpenRLHF.

Updated 2026-06-07 15 min read
Key takeaways
  • The RL stack has two layers: an environment API (Gymnasium / PettingZoo) and an algorithm library that trains against it.
  • Pick by intent: Stable-Baselines3 for getting something working, CleanRL to read and modify every line, RLlib/TorchRL for scale, JAX (Brax, PufferLib, PureJaxRL) for raw throughput.
  • Post-training LLMs is a separate world: TRL, verl and OpenRLHF wrap SFT, reward modeling, PPO, DPO and GRPO around huge models and inference engines like vLLM.
  • Reproducibility is the hard part — identical algorithm names across libraries do not give identical results, so the choice of implementation matters as much as the algorithm.

The two layers of any RL stack

Almost every reinforcement learning project is built from two interlocking pieces, and confusing them is the most common beginner mistake. The environment defines the problem — states, actions, rewards, episode boundaries — behind a standard interface. The algorithm library is the learner that repeatedly queries that environment and improves a policy. A library like Stable-Baselines3 trains against an environment like Gymnasium’s CartPole-v1; it does not contain the environment itself.

This separation is what makes the ecosystem composable. Swap the environment and the same PPO code learns a different task; swap the library and the same task gets a different learner. The glue that holds it together is a shared API contract, and that contract is overwhelmingly Gymnasium.

Environmentsimulators, games, robotsEnv APIGymnasium / PettingZooAlgorithm librarySB3, CleanRL, RLlib…reset / stepobs, rewardactionExperiment toolinglogging (Weights and Biases, TensorBoard) · hyperparameter search · checkpoints · vectorized rollout
The RL stack: a standard environment API sits between the world (simulators, games, real systems) and the algorithm library that learns a policy. Experiment tooling wraps the whole loop.

The environment API: Gymnasium and friends

The de facto standard is Gymnasium, the maintained fork of OpenAI’s original Gym, now stewarded by the Farama Foundation. Its contract is small enough to memorize: env.reset() returns a starting observation, and env.step(action) returns (observation, reward, terminated, truncated, info). That five-tuple is the lingua franca of single-agent RL.

The Gymnasium loopthe entire contract in five lines of pseudocode
obs, info = env.reset()for step in range(N):   action = policy(obs)   obs, reward, terminated, truncated, info = env.step(action)   if terminated or truncated: obs, info = env.reset()

Around this core, Farama maintains a whole ecosystem so you rarely have to invent an environment from scratch: PettingZoo (multi-agent), Minigrid and MiniWoB++ (gridworlds and web tasks), Gymnasium-Robotics and Metaworld (manipulation), the Arcade Learning Environment (Atari), Minari (offline-RL datasets) and MO-Gymnasium (multi-objective). The split between terminated (the task genuinely ended) and truncated (a time limit cut it off) matters more than it looks: bootstrapping value targets correctly depends on telling those apart. See RL environments for the deeper tour.

Classic deep-RL libraries: how to choose

For standard deep RL — control, games, robotics in simulation — four PyTorch libraries cover most needs, and they embody genuinely different philosophies. The right pick depends on whether you most value speed to a result, readability, scale, or modularity.

Stable-Baselines3 — reliable defaults

The “just make it work” choice. Clean model.learn() API, ~95% test coverage, algorithms benchmarked against reference codebases. PPO, SAC, TD3, DQN and more behind a uniform interface. Best for applied work, teaching, and strong baselines. See Stable-Baselines3.

CleanRL — read every line

Single-file implementations: ppo_atari.py is ~340 lines with all the tricks visible, nothing hidden behind abstraction. Not meant to be imported — meant to be read, forked and modified. Best for understanding and research prototyping. See CleanRL.

RLlib — production scale

Built on Ray for distributed and multi-agent training across cores and clusters. Heavier API, but the standard when one GPU is not enough and you need fault-tolerant, large-scale rollout collection.

TorchRL — composable primitives

Meta’s PyTorch-native toolkit built on the TensorDict primitive. Swappable actors, critics, replay buffers and world models that stay close to plain PyTorch. Best when you are building a new method, not running an existing one.

A useful way to internalize the tradeoff: Stable-Baselines3 hides the algorithm so you can use it; CleanRL exposes the algorithm so you can change it. Tianshou sits near TorchRL as a fast, modular PyTorch option; PufferLib is a different beast — not a learning library but a layer that makes messy environments (NetHack, Neural MMO) “play nice” with CleanRL and SB3, scaling rollouts to millions of steps per second.

~340 LOC
CleanRL's full PPO+Atari, all tricks visible
95%
Stable-Baselines3 automated test coverage
5-tuple
the Gymnasium step() contract everyone shares
▶ Reinforcement Learning with Stable Baselines 3 — Introduction (sentdex, hands-on)

The JAX wave: speed by running everything on the accelerator

The biggest shift since 2023 is the rise of JAX for RL. The insight is simple but radical: if you write the environment itself in JAX — not just the neural network — the entire training loop can be JIT-compiled and run on the GPU/TPU, eliminating the constant CPU-to-GPU data shuttle that bottlenecks traditional pipelines, and vectorizing thousands of environments in parallel.

The headline result is PureJaxRL, which reported over 4000x end-to-end speedups versus a conventional GPU-policy/CPU-environment setup, with JaxMARL pushing the multi-agent case far further. Around this sit a fast-growing toolkit: Brax (differentiable physics), gymnax (JAX reimplementations of classic control and Atari-like tasks), Jumanji (combinatorial and game environments), and JaxMARL for multi-agent work.

Choosing a library: a quick comparison

LibraryLayerLanguageBest forWatch out for
GymnasiumEnv APIPythonThe standard interface for single-agent RLJust the API — bring your own learner
Stable-Baselines3AlgorithmsPyTorchFast results, strong baselines, teachingLess flexible for novel research
CleanRLAlgorithmsPyTorch / JAXReading, debugging, prototypingNo agent.learn() end-user API
RLlibAlgorithmsPyTorch / TFDistributed and multi-agent at scaleHeavier API, steeper setup
TorchRLAlgorithmsPyTorchComposable, primitive-first researchMore assembly required
PureJaxRL / BraxBothJAXMaximum throughput, vectorized envsEnv must be pure JAX; functional style
TRL / verl / OpenRLHFLLM post-trainingPyTorchRLHF, DPO, GRPO on language modelsDifferent problem from classic RL
Go deeper: vectorized environments and why they matter

Modern RL throughput comes from running many environment copies in parallel. Gymnasium’s SyncVectorEnv / AsyncVectorEnv and SB3’s VecEnv step a batch of environments at once so the policy network gets a full batch per forward pass — the difference between a GPU at 5% utilization and one at 90%. JAX takes this to its conclusion: jax.vmap over the env function runs thousands of copies on-device with zero Python overhead. When a paper reports “millions of steps per second,” vectorization is almost always the reason. The catch is correct handling of per-environment resets and the terminated vs truncated distinction across the batch.

A different world: RL for LLMs

Post-training a large language model with RL looks superficially like classic RL — there is a policy, a reward, a KL leash — but the engineering is a separate discipline. The “environment” is a prompt distribution, the “episode” is a generated completion, and the bottleneck is generating samples from a giant model fast enough to feed the optimizer. That is why this corner of the ecosystem has its own libraries that fuse training backends (FSDP, Megatron-LM) with high-throughput inference engines (vLLM, SGLang).

TRL — Hugging Face

The most accessible entry point. Tightly integrated with the Transformers ecosystem; implements SFT, reward modeling, PPO, DPO and GRPO with a familiar trainer API. Best for getting a post-training experiment running on a few GPUs. See TRL.

verl — ByteDance + community

The HybridFlow framework: production-grade, scales to large models across FSDP/Megatron with vLLM/SGLang inference; supports PPO, GRPO, GSPO, RLOO and more. Used to train reasoning models at OpenAI o1-level math performance. Power and scale over simplicity. See verl.

OpenRLHF

Built specifically for RLHF with strong reward-model and distributed-training support, integrated with Hugging Face. A mature middle ground between TRL’s ease and verl’s scale. See OpenRLHF.

The recipe, not the library

All three implement the same conceptual menu — RLHF, DPO, GRPO, reward models, RLVR. Choosing among them is a scaling and infrastructure decision, not an algorithm one.

If your goal is aligning or reasoning-tuning a language model, start with RLHF and RL for reasoning for the methods, then pick the framework that matches your model size and cluster.

From zero to a trained policy

1
Define or pick the environment

Use an existing Gymnasium environment (CartPole-v1, an Atari game, a robotics task) or wrap your own problem behind the reset() / step() contract. Decide carefully when an episode terminated versus when it was merely truncated.

2
Pick the library that matches your intent

Need a result fast? Stable-Baselines3. Need to read and modify the algorithm? CleanRL. Need scale? RLlib or a JAX stack. Post-training an LLM? TRL, verl or OpenRLHF.

3
Vectorize and choose an algorithm

Wrap many env copies in a vectorized env so the GPU stays busy, then select an algorithm suited to your action space — PPO/SAC for continuous control, DQN/PPO for discrete. See PPO and actor-critic.

4
Instrument everything

Log to Weights and Biases or TensorBoard from step one. RL is famously noisy — without learning curves over multiple seeds you cannot tell a real improvement from luck.

5
Run multiple seeds and compare to a baseline

Report mean and variance across seeds, against a strong baseline on the same environment version. A single lucky run is the most common way RL results fail to replicate.

The reproducibility trap

A hard lesson for anyone moving between libraries: “PPO” is not one thing. The same algorithm name hides dozens of implementation details — observation normalization, advantage estimation, value clipping, learning-rate schedules, weight initialization — and these “code-level optimizations” can matter more than the headline algorithm. A 2025 study, “On the Mistaken Assumption of Interchangeable Deep RL Implementations”, found that nominally identical algorithms from different libraries produce substantially different performance, and even reported that in one benchmark Stable-Baselines3, CleanRL and baselines reached superhuman performance far more often than RLlib or Tianshou on the same tasks.

A short history of the tooling

2016
OpenAI Gym
The reset/step API becomes the universal contract and the Atari/MuJoCo benchmarks become RL’s common yardstick.
2018
RLlib & Dopamine
Distributed RL (RLlib on Ray) and clean research baselines (DeepMind’s Dopamine) make scaled and reproducible RL practical.
2021
Stable-Baselines3 & CleanRL
SB3 ports reliable baselines to PyTorch; CleanRL popularizes single-file, fully-visible implementations.
2022
Farama Foundation forks Gym → Gymnasium
Maintained stewardship rescues the abandoned Gym and consolidates the environment ecosystem (PettingZoo, Minari, ALE).
2023
The JAX turn & TorchRL
PureJaxRL reports 4000x speedups by running envs on-device; Meta releases TorchRL on the TensorDict primitive.
2024–26
LLM-RL stacks mature
TRL, OpenRLHF and verl turn RLHF/GRPO/RLVR into production infrastructure, fusing training backends with vLLM/SGLang inference.

Researcher takes

From the CleanRL author’s thread distilling ‘The 37 Implementation Details of PPO’ — the practical argument behind RL’s reproducibility pain: the algorithm in the paper is the easy part, and what actually decides whether you can reproduce a result is a long tail of unwritten implementation details (the vectorized environment architecture, network init, normalization). Miss one and your curves diverge, which is why single-file, fully-spelled-out libraries exist.

Chris Lu’s unveiling of PureJaxRL crystallized the JAX shift — the realization that putting the environment on the accelerator, not just the policy, changes the economics of RL research.

Frequently asked questions

Should a beginner start with Stable-Baselines3 or CleanRL?

Start with Stable-Baselines3 if your goal is to get an agent learning quickly and to use RL as a tool — its model.learn() API hides the plumbing. Switch to CleanRL the moment you want to understand why PPO works or to modify the algorithm: its single-file implementations show every trick. They are complementary, not competitors.

Is Gymnasium a learning library?

No. Gymnasium defines the environment API and ships reference tasks (CartPole, Atari, MuJoCo), but it contains no learning algorithms. You pair it with a library like SB3, CleanRL or RLlib that actually trains the policy. Many newcomers conflate the two.

Do I need JAX for serious RL?

No — most applied RL runs fine on PyTorch with SB3 or CleanRL. JAX pays off when throughput is the bottleneck and you can express the environment in pure JAX, unlocking thousands of parallel on-device envs and order-of-magnitude speedups. If your simulator is an external C++ engine or a real robot, JAX’s advantage largely disappears.

Why can’t I just reuse my classic-RL library to fine-tune an LLM?

Because the bottleneck and the scale are completely different. LLM post-training has to generate samples from a multi-billion-parameter model efficiently, so libraries like TRL, verl and OpenRLHF fuse training backends (FSDP, Megatron) with fast inference engines (vLLM, SGLang) — machinery a classic control library like SB3 simply does not have. The algorithms (PPO, GRPO, DPO) overlap; the infrastructure does not.

Key references

RL environments · PPO · Actor-critic · Continuous control · Multi-agent RL · RLHF · GRPO · What is reinforcement learning?