How machines learn
by trial and error
The complete, visual guide to reinforcement learning — from “what even is a reward?” to how ChatGPT and o1 are trained. Tell us where you’re starting and we’ll build the path.
It’s just learning from consequences
An agent looks at the state of its world, takes an action, and gets a reward — a number saying how well that went. Repeat millions of times and it learns a policy: a way of acting that maximizes reward over the long run.
That’s the whole loop. No labeled “right answers,” just a goal and feedback. Everything here — from Q-learning to ChatGPT’s alignment — is a different answer to one question: how do you turn that reward into smart behavior?
A guided route through RL
Pick your level above and we’ll lay out the modules to read, in order.
- 1 What is RL?An agent acts, collects rewards, and discovers a strategy that pays off over time. The entire field in a single feedback loop. 18 min
- 2 MDPsStates, actions, rewards and transitions — the five-piece formalism almost every RL algorithm is secretly solving. 15 min
- 3 Value functionsEstimate the long-run reward of a state or move and you can act greedily toward it. The Bellman equation, made useful. 15 min
- 4 Explore vs exploitExploit what already works, or explore for something better? Tip too far either way and the agent never really learns. 15 min
- 5 Q-learningBootstrap each action’s value off your own best guess of the next state. Simple, off-policy, and the seed of deep RL. 14 min
- 6 Policy gradientsSkip the value tables — just push the probability of good actions up. REINFORCE and the family that grew from it. 15 min
- 7 PPOA clipped objective keeps every update small and stable. Unglamorous, robust, and behind most aligned LLMs. 18 min
- 8 RLHFHumans pick the better of two answers; a reward model learns their taste; the LLM is tuned to score well. The step that turned GPT-3 into ChatGPT. 16 min
- 9 RL for reasoningReward only the final right answer on hard problems, and long, self-correcting chains of thought emerge on their own. 18 min
- 1 Q-learningBootstrap each action’s value off your own best guess of the next state. Simple, off-policy, and the seed of deep RL. 14 min
- 2 DQNDeepMind’s DQN learned to play Atari straight from pixels — the result that lit the fuse on the modern deep-RL era. 15 min
- 3 Policy gradientsSkip the value tables — just push the probability of good actions up. REINFORCE and the family that grew from it. 15 min
- 4 Actor–criticOne network chooses actions, another judges them — fusing policy gradients with value estimates for steadier learning. 15 min
- 5 PPOA clipped objective keeps every update small and stable. Unglamorous, robust, and behind most aligned LLMs. 18 min
- 6 Continuous controlRobots and physics don’t have “four buttons.” DDPG, TD3 and SAC handle continuous action spaces — the backbone of learned control. 15 min
- 7 RLHFHumans pick the better of two answers; a reward model learns their taste; the LLM is tuned to score well. The step that turned GPT-3 into ChatGPT. 16 min
- 8 RLVRDrop the human rater — grade answers with a verifier (did the code run? is the proof right?). A free, hard-to-game signal, and the engine behind 2025’s reasoning models. 15 min
- 9 GRPODeepSeek’s trick: score a whole group of answers against each other instead of training a separate value network. Cheaper RL that scales to reasoning. 15 min
- 1 RLHFHumans pick the better of two answers; a reward model learns their taste; the LLM is tuned to score well. The step that turned GPT-3 into ChatGPT. 16 min
- 2 RLVRDrop the human rater — grade answers with a verifier (did the code run? is the proof right?). A free, hard-to-game signal, and the engine behind 2025’s reasoning models. 15 min
- 3 GRPODeepSeek’s trick: score a whole group of answers against each other instead of training a separate value network. Cheaper RL that scales to reasoning. 15 min
- 4 RL for reasoningReward only the final right answer on hard problems, and long, self-correcting chains of thought emerge on their own. 18 min
- 5 Agentic RLWhen a model plans, calls tools and works over dozens of steps, a single end-reward has to shape the whole trajectory. RL for real agents. 18 min
- 6 Offline RLTrain from data already collected, with no live environment — vital when exploring for real is slow, costly or dangerous. 15 min
- 7 Multi-agent RLCooperation, competition, and a moving target: every agent’s learning reshapes everyone else’s world. Self-play and emergent strategy. 16 min
- 8 RL safetySpecification gaming, reward hacking and scalable oversight — making sure a powerful optimizer pursues what we actually intended. 17 min
No path, no pressure. Dive into any module below, or wander the concept map.
Five modules, 40 guides
Each module is a self-contained area of RL. Pick one and explore the topics inside.
The Frontier
How modern AI is trained
Foundations
The core ideas, from scratch
Classic Algorithms
From Q-tables to policy gradients
Planning & Advanced
Models, search, and the hard problems
Tools & Applications
Putting RL to work
What the field is actually saying
Not press releases — real takes from the researchers shaping RL.
What just landed at NeurIPS 2025
The field moves monthly. A few recent results worth your time.
See how it all connects
Every topic and how it relates to the others — from the foundations out to the frontier. Drag the nodes, hover to read what each one is, click to dive in.
- Actor–critic: One network chooses actions, another judges them — fusing policy gradients with value estimates for steadier learning.
- Agentic RL: When a model plans, calls tools and works over dozens of steps, a single end-reward has to shape the whole trajectory. RL for real agents.
- AlphaZero / MuZero: Marry tree search with a learned policy/value net — and, in MuZero, a learned model — to master Go, chess and Atari from scratch.
- Constitutional AI: Replace human preference labels with an AI critiquing answers against a written “constitution” — how Claude is aligned at scale.
- Continuous control: Robots and physics don’t have “four buttons.” DDPG, TD3 and SAC handle continuous action spaces — the backbone of learned control.
- Curiosity: When external reward is sparse, let curiosity drive exploration — prediction error, Random Network Distillation, and Go-Explore.
- Curriculum learning: Order training tasks from simple to hard (or auto-generate them) so the agent can learn what it could never tackle cold.
- DQN: DeepMind’s DQN learned to play Atari straight from pixels — the result that lit the fuse on the modern deep-RL era.
- Distributional RL: Don’t just predict the average return — model its full distribution. C51, QR-DQN, and why it tends to learn better.
- DPO: A bit of algebra turns preference pairs into one simple loss — no reward model, no PPO. “Your language model is secretly a reward model.”
- Explore vs exploit: Exploit what already works, or explore for something better? Tip too far either way and the agent never really learns.
- GRPO: DeepSeek’s trick: score a whole group of answers against each other instead of training a separate value network. Cheaper RL that scales to reasoning.
- Hierarchical RL: Break a long task into reusable sub-policies (options) so the agent plans at multiple timescales instead of one step at a time.
- Imitation & IRL: Copy an expert (behavioral cloning), or infer the reward behind their behavior (inverse RL) — how agents learn when rewards are hard to write.
- MDPs: States, actions, rewards and transitions — the five-piece formalism almost every RL algorithm is secretly solving.
- Model-based RL: Learn a model of the world and plan inside it — dramatically more sample-efficient than pure trial and error.
- Monte Carlo: Wait for the episode to end, then average the actual returns. Simple and unbiased — the counterpoint to TD’s bootstrapping.
- Multi-agent RL: Cooperation, competition, and a moving target: every agent’s learning reshapes everyone else’s world. Self-play and emergent strategy.
- Multi-armed bandits: One state, many levers, unknown payoffs. Bandits isolate the explore–exploit dilemma — the gateway to all of reinforcement learning.
- Offline RL: Train from data already collected, with no live environment — vital when exploring for real is slow, costly or dangerous.
- On- vs off-policy: On-policy learns from the actions it currently takes; off-policy can learn from old or others’ data. The split shapes every algorithm.
- POMDPs: The agent gets partial observations, not the true state. Belief states, memory, and why the real world is almost always a POMDP.
- Policy gradients: Skip the value tables — just push the probability of good actions up. REINFORCE and the family that grew from it.
- PPO: A clipped objective keeps every update small and stable. Unglamorous, robust, and behind most aligned LLMs.
- Q-learning: Bootstrap each action’s value off your own best guess of the next state. Simple, off-policy, and the seed of deep RL.
- Reward models: Learns to score responses the way people would — and quietly becomes the thing your policy tries to game. Reward hacking lives here.
- Reward shaping: Add helpful intermediate rewards to speed learning — without, if done right, changing the optimal policy. And how it quietly backfires.
- RL environments: The simulators, games and tasks an agent learns in. Increasingly the bottleneck for frontier RL — and a whole industry of its own.
- RL for reasoning: Reward only the final right answer on hard problems, and long, self-correcting chains of thought emerge on their own.
- RL in robotics: Sim-to-real, dexterity and locomotion: how RL learns control policies for robots — and why the physical world makes it so hard.
- Libraries & tools: Gymnasium, Stable-Baselines3, CleanRL, RLlib, TRL and friends — what to reach for, and when, to actually run RL.
- RL safety: Specification gaming, reward hacking and scalable oversight — making sure a powerful optimizer pursues what we actually intended.
- RLHF: Humans pick the better of two answers; a reward model learns their taste; the LLM is tuned to score well. The step that turned GPT-3 into ChatGPT.
- RLVR: Drop the human rater — grade answers with a verifier (did the code run? is the proof right?). A free, hard-to-game signal, and the engine behind 2025’s reasoning models.
- SARSA: Learn the value of the policy you’re actually following — SARSA updates from the action you really took, not the greedy one.
- TD learning: Update an estimate using your next estimate — bootstrapping. The idea at the heart of Q-learning, SARSA and most modern RL.
- Test-time compute: Spend more compute when answering — best-of-N, verifier-guided search, long reasoning — and why it can rival scaling training.
- Value functions: Estimate the long-run reward of a state or move and you can act greedily toward it. The Bellman equation, made useful.
- What is RL?: An agent acts, collects rewards, and discovers a strategy that pays off over time. The entire field in a single feedback loop.
- World models: Learn a compact model of the world and train inside the agent’s own imagination — Ha & Schmidhuber, the Dreamer line, and latent planning.