// THE ULTIMATE GUIDE TO REINFORCEMENT LEARNING · 2026

How machines learn
by trial and error

The complete, visual guide to reinforcement learning — from “what even is a reward?” to how ChatGPT and o1 are trained. Tell us where you’re starting and we’ll build the path.

Where are you starting from?

This is RL, live. The dot learns to reach the ★ from nothing but rewards — watch the blue “value” spread as it figures the maze out. episode 1 · click to move the goal

// REINFORCEMENT LEARNING IN ONE MINUTE

It’s just learning from consequences

An agent looks at the state of its world, takes an action, and gets a reward — a number saying how well that went. Repeat millions of times and it learns a policy: a way of acting that maximizes reward over the long run.

That’s the whole loop. No labeled “right answers,” just a goal and feedback. Everything here — from Q-learning to ChatGPT’s alignment — is a different answer to one question: how do you turn that reward into smart behavior?

State what it seesAction what it doesReward how good that wasPolicy its strategy

// YOUR PATH

A guided route through RL

Pick your level above and we’ll lay out the modules to read, in order.

// EXPLORE BY MODULE

Five modules, 40 guides

Each module is a self-contained area of RL. Pick one and explore the topics inside.

The Frontier

How modern AI is trained

Teaching a model what “good” means Humans pick the better of two answers; a reward model learns their taste; the LLM is tuned to score well. The step that turned GPT-3 into ChatGPT. RLHF16 min How o1 and R1 learned to think Reward only the final right answer on hard problems, and long, self-correcting chains of thought emerge on their own. RL for reasoning18 min Rewards you can check Drop the human rater — grade answers with a verifier (did the code run? is the proof right?). A free, hard-to-game signal, and the engine behind 2025’s reasoning models. RLVR15 min RL without a critic DeepSeek’s trick: score a whole group of answers against each other instead of training a separate value network. Cheaper RL that scales to reasoning. GRPO15 min Alignment without the RL loop A bit of algebra turns preference pairs into one simple loss — no reward model, no PPO. “Your language model is secretly a reward model.” DPO17 min A model of human taste Learns to score responses the way people would — and quietly becomes the thing your policy tries to game. Reward hacking lives here. Reward models15 min Training agents that take action When a model plans, calls tools and works over dozens of steps, a single end-reward has to shape the whole trajectory. RL for real agents. Agentic RL18 min Alignment from AI feedback Replace human preference labels with an AI critiquing answers against a written “constitution” — how Claude is aligned at scale. Constitutional AI15 min Think longer at inference Spend more compute when answering — best-of-N, verifier-guided search, long reasoning — and why it can rival scaling training. Test-time compute15 min

// STRAIGHT FROM THE PEOPLE BUILDING IT

What the field is actually saying

Not press releases — real takes from the researchers shaping RL.

View Yann LeCun's post on X →

View Andrej Karpathy's post on X →

View Noam Brown's post on X →

// FRESH FROM THE LAB

What just landed at NeurIPS 2025

The field moves monthly. A few recent results worth your time.

NeurIPS 2025 · Best PaperArtificial Hivemind: the Homogeneity of Language ModelsAligning to aggregate human preference quietly collapses output diversity — and the reward models we rely on are miscalibrated to real human variation.Read the paper ↗NeurIPS 2025 · Runner-UpDoes RL Really Incentivize Reasoning Beyond the Base Model?A pass@k study arguing RLVR mostly sharpens reasoning paths the base model already had, rather than creating genuinely new ones.Read the paper ↗Scale AI · 2025Rubrics as Rewards: RL Beyond Verifiable DomainsHow do you run RL when there’s no checkable answer? Use structured rubric checklists as interpretable, hard-to-hack reward signals.Read the paper ↗

// THE WHOLE PICTURE

See how it all connects

Every topic and how it relates to the others — from the foundations out to the frontier. Drag the nodes, hover to read what each one is, click to dive in.

FoundationsCore algorithmsPlanning & model-basedAdvanced topicsRL for LLMs & AgentsEnvironments & benchmarksTools & practiceApplications

Hover a topic to see what it is — click to open the full guide.

How machines learn
by trial and error

It’s just learning from consequences

A guided route through RL

Five modules, 40 guides

The Frontier

Foundations

Classic Algorithms

Planning & Advanced

Tools & Applications

What the field is actually saying

What just landed at NeurIPS 2025

See how it all connects

How machines learnby trial and error

It’s just learning from consequences

A guided route through RL

Five modules, 40 guides

The Frontier

Foundations

Classic Algorithms

Planning & Advanced

Tools & Applications

What the field is actually saying

What just landed at NeurIPS 2025

See how it all connects

How machines learn
by trial and error