reinforcement-learning.com
// THE ULTIMATE GUIDE TO REINFORCEMENT LEARNING · 2026

How machines learn
by trial and error

The complete, visual guide to reinforcement learning — from “what even is a reward?” to how ChatGPT and o1 are trained. Tell us where you’re starting and we’ll build the path.

Where are you starting from?
This is RL, live. The dot learns to reach the ★ from nothing but rewards — watch the blue “value” spread as it figures the maze out. episode 1 · click to move the goal
// REINFORCEMENT LEARNING IN ONE MINUTE

It’s just learning from consequences

An agent looks at the state of its world, takes an action, and gets a reward — a number saying how well that went. Repeat millions of times and it learns a policy: a way of acting that maximizes reward over the long run.

That’s the whole loop. No labeled “right answers,” just a goal and feedback. Everything here — from Q-learning to ChatGPT’s alignment — is a different answer to one question: how do you turn that reward into smart behavior?

State what it seesAction what it doesReward how good that wasPolicy its strategy
Agent Environment action state + reward
// YOUR PATH

A guided route through RL

Pick your level above and we’ll lay out the modules to read, in order.

// EXPLORE BY MODULE

Five modules, 40 guides

Each module is a self-contained area of RL. Pick one and explore the topics inside.

The Frontier

How modern AI is trained

Teaching a model what “good” means Humans pick the better of two answers; a reward model learns their taste; the LLM is tuned to score well. The step that turned GPT-3 into ChatGPT. RLHF16 min How o1 and R1 learned to think Reward only the final right answer on hard problems, and long, self-correcting chains of thought emerge on their own. RL for reasoning18 min Rewards you can check Drop the human rater — grade answers with a verifier (did the code run? is the proof right?). A free, hard-to-game signal, and the engine behind 2025’s reasoning models. RLVR15 min RL without a critic DeepSeek’s trick: score a whole group of answers against each other instead of training a separate value network. Cheaper RL that scales to reasoning. GRPO15 min Alignment without the RL loop A bit of algebra turns preference pairs into one simple loss — no reward model, no PPO. “Your language model is secretly a reward model.” DPO17 min A model of human taste Learns to score responses the way people would — and quietly becomes the thing your policy tries to game. Reward hacking lives here. Reward models15 min Training agents that take action When a model plans, calls tools and works over dozens of steps, a single end-reward has to shape the whole trajectory. RL for real agents. Agentic RL18 min Alignment from AI feedback Replace human preference labels with an AI critiquing answers against a written “constitution” — how Claude is aligned at scale. Constitutional AI15 min Think longer at inference Spend more compute when answering — best-of-N, verifier-guided search, long reasoning — and why it can rival scaling training. Test-time compute15 min
// STRAIGHT FROM THE PEOPLE BUILDING IT

What the field is actually saying

Not press releases — real takes from the researchers shaping RL.

// FRESH FROM THE LAB

What just landed at NeurIPS 2025

The field moves monthly. A few recent results worth your time.

// THE WHOLE PICTURE

See how it all connects

Every topic and how it relates to the others — from the foundations out to the frontier. Drag the nodes, hover to read what each one is, click to dive in.

FoundationsCore algorithmsPlanning & model-basedAdvanced topicsRL for LLMs & AgentsEnvironments & benchmarksTools & practiceApplications
Hover a topic to see what it is — click to open the full guide.