- The RL stack has two layers: an environment API (Gymnasium / PettingZoo) and an algorithm library that trains against it.
- Pick by intent: Stable-Baselines3 for getting something working, CleanRL to read and modify every line, RLlib/TorchRL for scale, JAX (Brax, PufferLib, PureJaxRL) for raw throughput.
- Post-training LLMs is a separate world: TRL, verl and OpenRLHF wrap SFT, reward modeling, PPO, DPO and GRPO around huge models and inference engines like vLLM.
- Reproducibility is the hard part — identical algorithm names across libraries do not give identical results, so the choice of implementation matters as much as the algorithm.
The two layers of any RL stack
Almost every reinforcement learning project is built from two interlocking pieces, and confusing them is the most common beginner mistake. The environment defines the problem — states, actions, rewards, episode boundaries — behind a standard interface. The algorithm library is the learner that repeatedly queries that environment and improves a policy. A library like Stable-Baselines3 trains against an environment like Gymnasium’s CartPole-v1; it does not contain the environment itself.
This separation is what makes the ecosystem composable. Swap the environment and the same PPO code learns a different task; swap the library and the same task gets a different learner. The glue that holds it together is a shared API contract, and that contract is overwhelmingly Gymnasium.
The environment API: Gymnasium and friends
The de facto standard is Gymnasium, the maintained fork of OpenAI’s original Gym, now stewarded by the Farama Foundation. Its contract is small enough to memorize: env.reset() returns a starting observation, and env.step(action) returns (observation, reward, terminated, truncated, info). That five-tuple is the lingua franca of single-agent RL.
Around this core, Farama maintains a whole ecosystem so you rarely have to invent an environment from scratch: PettingZoo (multi-agent), Minigrid and MiniWoB++ (gridworlds and web tasks), Gymnasium-Robotics and Metaworld (manipulation), the Arcade Learning Environment (Atari), Minari (offline-RL datasets) and MO-Gymnasium (multi-objective). The split between terminated (the task genuinely ended) and truncated (a time limit cut it off) matters more than it looks: bootstrapping value targets correctly depends on telling those apart. See RL environments for the deeper tour.
Classic deep-RL libraries: how to choose
For standard deep RL — control, games, robotics in simulation — four PyTorch libraries cover most needs, and they embody genuinely different philosophies. The right pick depends on whether you most value speed to a result, readability, scale, or modularity.
The “just make it work” choice. Clean model.learn() API, ~95% test coverage, algorithms benchmarked against reference codebases. PPO, SAC, TD3, DQN and more behind a uniform interface. Best for applied work, teaching, and strong baselines. See Stable-Baselines3.
Single-file implementations: ppo_atari.py is ~340 lines with all the tricks visible, nothing hidden behind abstraction. Not meant to be imported — meant to be read, forked and modified. Best for understanding and research prototyping. See CleanRL.
Built on Ray for distributed and multi-agent training across cores and clusters. Heavier API, but the standard when one GPU is not enough and you need fault-tolerant, large-scale rollout collection.
Meta’s PyTorch-native toolkit built on the TensorDict primitive. Swappable actors, critics, replay buffers and world models that stay close to plain PyTorch. Best when you are building a new method, not running an existing one.
A useful way to internalize the tradeoff: Stable-Baselines3 hides the algorithm so you can use it; CleanRL exposes the algorithm so you can change it. Tianshou sits near TorchRL as a fast, modular PyTorch option; PufferLib is a different beast — not a learning library but a layer that makes messy environments (NetHack, Neural MMO) “play nice” with CleanRL and SB3, scaling rollouts to millions of steps per second.
The JAX wave: speed by running everything on the accelerator
The biggest shift since 2023 is the rise of JAX for RL. The insight is simple but radical: if you write the environment itself in JAX — not just the neural network — the entire training loop can be JIT-compiled and run on the GPU/TPU, eliminating the constant CPU-to-GPU data shuttle that bottlenecks traditional pipelines, and vectorizing thousands of environments in parallel.
The headline result is PureJaxRL, which reported over 4000x end-to-end speedups versus a conventional GPU-policy/CPU-environment setup, with JaxMARL pushing the multi-agent case far further. Around this sit a fast-growing toolkit: Brax (differentiable physics), gymnax (JAX reimplementations of classic control and Atari-like tasks), Jumanji (combinatorial and game environments), and JaxMARL for multi-agent work.
Choosing a library: a quick comparison
| Library | Layer | Language | Best for | Watch out for |
|---|---|---|---|---|
| Gymnasium | Env API | Python | The standard interface for single-agent RL | Just the API — bring your own learner |
| Stable-Baselines3 | Algorithms | PyTorch | Fast results, strong baselines, teaching | Less flexible for novel research |
| CleanRL | Algorithms | PyTorch / JAX | Reading, debugging, prototyping | No agent.learn() end-user API |
| RLlib | Algorithms | PyTorch / TF | Distributed and multi-agent at scale | Heavier API, steeper setup |
| TorchRL | Algorithms | PyTorch | Composable, primitive-first research | More assembly required |
| PureJaxRL / Brax | Both | JAX | Maximum throughput, vectorized envs | Env must be pure JAX; functional style |
| TRL / verl / OpenRLHF | LLM post-training | PyTorch | RLHF, DPO, GRPO on language models | Different problem from classic RL |
Go deeper: vectorized environments and why they matter
Modern RL throughput comes from running many environment copies in parallel. Gymnasium’s SyncVectorEnv / AsyncVectorEnv and SB3’s VecEnv step a batch of environments at once so the policy network gets a full batch per forward pass — the difference between a GPU at 5% utilization and one at 90%. JAX takes this to its conclusion: jax.vmap over the env function runs thousands of copies on-device with zero Python overhead. When a paper reports “millions of steps per second,” vectorization is almost always the reason. The catch is correct handling of per-environment resets and the terminated vs truncated distinction across the batch.
A different world: RL for LLMs
Post-training a large language model with RL looks superficially like classic RL — there is a policy, a reward, a KL leash — but the engineering is a separate discipline. The “environment” is a prompt distribution, the “episode” is a generated completion, and the bottleneck is generating samples from a giant model fast enough to feed the optimizer. That is why this corner of the ecosystem has its own libraries that fuse training backends (FSDP, Megatron-LM) with high-throughput inference engines (vLLM, SGLang).
The most accessible entry point. Tightly integrated with the Transformers ecosystem; implements SFT, reward modeling, PPO, DPO and GRPO with a familiar trainer API. Best for getting a post-training experiment running on a few GPUs. See TRL.
The HybridFlow framework: production-grade, scales to large models across FSDP/Megatron with vLLM/SGLang inference; supports PPO, GRPO, GSPO, RLOO and more. Used to train reasoning models at OpenAI o1-level math performance. Power and scale over simplicity. See verl.
Built specifically for RLHF with strong reward-model and distributed-training support, integrated with Hugging Face. A mature middle ground between TRL’s ease and verl’s scale. See OpenRLHF.
All three implement the same conceptual menu — RLHF, DPO, GRPO, reward models, RLVR. Choosing among them is a scaling and infrastructure decision, not an algorithm one.
If your goal is aligning or reasoning-tuning a language model, start with RLHF and RL for reasoning for the methods, then pick the framework that matches your model size and cluster.
From zero to a trained policy
Use an existing Gymnasium environment (CartPole-v1, an Atari game, a robotics task) or wrap your own problem behind the reset() / step() contract. Decide carefully when an episode terminated versus when it was merely truncated.
Need a result fast? Stable-Baselines3. Need to read and modify the algorithm? CleanRL. Need scale? RLlib or a JAX stack. Post-training an LLM? TRL, verl or OpenRLHF.
Wrap many env copies in a vectorized env so the GPU stays busy, then select an algorithm suited to your action space — PPO/SAC for continuous control, DQN/PPO for discrete. See PPO and actor-critic.
Log to Weights and Biases or TensorBoard from step one. RL is famously noisy — without learning curves over multiple seeds you cannot tell a real improvement from luck.
Report mean and variance across seeds, against a strong baseline on the same environment version. A single lucky run is the most common way RL results fail to replicate.
The reproducibility trap
A hard lesson for anyone moving between libraries: “PPO” is not one thing. The same algorithm name hides dozens of implementation details — observation normalization, advantage estimation, value clipping, learning-rate schedules, weight initialization — and these “code-level optimizations” can matter more than the headline algorithm. A 2025 study, “On the Mistaken Assumption of Interchangeable Deep RL Implementations”, found that nominally identical algorithms from different libraries produce substantially different performance, and even reported that in one benchmark Stable-Baselines3, CleanRL and baselines reached superhuman performance far more often than RLlib or Tianshou on the same tasks.
A short history of the tooling
reset/step API becomes the universal contract and the Atari/MuJoCo benchmarks become RL’s common yardstick.Researcher takes
From the CleanRL author’s thread distilling ‘The 37 Implementation Details of PPO’ — the practical argument behind RL’s reproducibility pain: the algorithm in the paper is the easy part, and what actually decides whether you can reproduce a result is a long tail of unwritten implementation details (the vectorized environment architecture, network init, normalization). Miss one and your curves diverge, which is why single-file, fully-spelled-out libraries exist.
Chris Lu’s unveiling of PureJaxRL crystallized the JAX shift — the realization that putting the environment on the accelerator, not just the policy, changes the economics of RL research.
Frequently asked questions
Should a beginner start with Stable-Baselines3 or CleanRL?
Start with Stable-Baselines3 if your goal is to get an agent learning quickly and to use RL as a tool — its model.learn() API hides the plumbing. Switch to CleanRL the moment you want to understand why PPO works or to modify the algorithm: its single-file implementations show every trick. They are complementary, not competitors.
Is Gymnasium a learning library?
No. Gymnasium defines the environment API and ships reference tasks (CartPole, Atari, MuJoCo), but it contains no learning algorithms. You pair it with a library like SB3, CleanRL or RLlib that actually trains the policy. Many newcomers conflate the two.
Do I need JAX for serious RL?
No — most applied RL runs fine on PyTorch with SB3 or CleanRL. JAX pays off when throughput is the bottleneck and you can express the environment in pure JAX, unlocking thousands of parallel on-device envs and order-of-magnitude speedups. If your simulator is an external C++ engine or a real robot, JAX’s advantage largely disappears.
Why can’t I just reuse my classic-RL library to fine-tune an LLM?
Because the bottleneck and the scale are completely different. LLM post-training has to generate samples from a multi-billion-parameter model efficiently, so libraries like TRL, verl and OpenRLHF fuse training backends (FSDP, Megatron) with fast inference engines (vLLM, SGLang) — machinery a classic control library like SB3 simply does not have. The algorithms (PPO, GRPO, DPO) overlap; the infrastructure does not.
Key references
- Gymnasium: A Standardized Interface for RL Environments — Towers et al., 2024 — the maintained Gym standard.
- Stable-Baselines3: Reliable RL Implementations — Raffin et al., 2021 — benchmarked PyTorch baselines.
- CleanRL: High-quality Single-file Implementations — Huang et al., 2022 — the readable-RL philosophy.
- TorchRL: A data-driven decision-making library for PyTorch — Bou et al., 2023.
- PufferLib: Making RL Libraries and Environments Play Nice — Suarez, 2024.
- HybridFlow (verl): A Flexible and Efficient RLHF Framework — Sheng et al., 2024 — the LLM post-training stack.
- On the Mistaken Assumption of Interchangeable Deep RL Implementations — 2025 — why implementation choice matters.
Related
RL environments · PPO · Actor-critic · Continuous control · Multi-agent RL · RLHF · GRPO · What is reinforcement learning?