What Is Reinforcement Learning? A Visual Guide

Key takeaways

Reinforcement learning (RL) is machine learning where an agent learns good decisions by trial and error — it acts, gets a reward, and adjusts to maximize long-term reward.
Unlike supervised learning, there are no labeled right answers: the agent must discover what works, balancing exploration (trying new things) against exploitation (using what works).
The core machinery — policy, reward, return, value functions — is formalized by the Markov Decision Process and solved by value-based, policy-gradient, or actor-critic algorithms.
RL powered AlphaGo and Atari-from-pixels, and in 2023–2026 it became central to LLMs via RLHF and RL with verifiable rewards (DeepSeek-R1).

What is reinforcement learning?

Reinforcement learning (RL) is a branch of machine learning in which an agent learns to make good decisions by interacting with an environment through trial and error. At each step the agent observes a state, takes an action, and receives a reward — a number that says how good that action was. Over many attempts it learns a policy: a strategy that maps situations to actions so as to maximize cumulative reward over time, not just the next payoff.

The crucial difference from other ML: there are no labeled correct answers. Nobody tells the agent the right action — it must discover which actions pay off by trying them and observing the consequences, often delayed. That single property drives almost everything distinctive about RL, from the exploration problem to why it’s so hard to train.

The reinforcement learning loop: the agent observes a state, chooses an action via its policy, and the environment returns a reward and the next state. Repeat — and learn.

▶ Reinforcement Learning: Essential Concepts — StatQuest (plain-English intuition)

A simple analogy: learning by trial and error

Think about training a dog. You can’t hand the dog a labeled dataset of “correct behaviors.” Instead it tries something, and you give a treat (positive reward) or nothing (no reward). Over time the dog learns which actions in which situations earn treats. It even learns sequences: sit, then stay, then come — because the treat at the end reinforces the whole chain.

The same shape describes a child learning to ride a bike (wobble → fall → adjust → stay up) or a game player racking up points. Three ingredients recur:

Trial and error — you learn by doing, not by being told.
Delayed reward — the payoff often comes well after the action that caused it (this is the hard part: credit assignment).
A goal expressed as reward — “stay upright,” “win the game,” “earn the treat.”

RL is the mathematical formalization of exactly this learning process.

The RL loop in plain terms

Every RL system is built from the same handful of pieces, interacting in a loop.

The agent and the environment

The agent is the learner and decision-maker — the dog, the game-playing program, the robot controller. Everything outside the agent that it interacts with is the environment — the world, the game, the simulator. The agent acts on the environment; the environment responds with a new situation and a reward. The boundary is conceptual: for a robot, the “environment” includes its own arm dynamics, because the controller can’t change those directly.

States and observations

A state $s$ is a complete description of the situation at a moment in time. In practice the agent often sees only an observation — a partial view (a chess engine sees the whole board; a robot sees only its camera). When the observation fully captures the state, the problem is fully observable; otherwise it’s partially observable, which is harder.

Actions

An action $a$ is a choice the agent can make. The set of legal actions is the action space, which may be discrete (move left / right / jump) or continuous (apply 3.7 N·m of torque). Continuous, high-dimensional action spaces are much harder and shape which algorithms you can use.

Rewards and the goal of maximizing return

After each action the environment emits a scalar reward $r$ — the only learning signal RL gets. The agent’s goal is not to maximize the immediate reward but the return: the cumulative reward over the long run. This distinction is the heart of RL: a move that scores zero now (developing a chess piece) may be essential to a big reward later (checkmate).

How RL differs from supervised and unsupervised learning

All three are machine learning, but they learn from fundamentally different signals.

	Supervised	Unsupervised	Reinforcement
Signal	Labeled examples (input → correct output)	Unlabeled data	A reward number, often delayed
Goal	Predict the label	Find structure / patterns	Choose actions to maximize return
Feedback	The right answer for each example	None	Evaluative (“how good”), not instructive (“what was right”)
Data	Fixed dataset	Fixed dataset	Generated by the agent’s own actions
Example	Image classification	Clustering customers	Game playing, robot control

Two differences do most of the work. First, RL feedback is evaluative, not instructive: a reward tells you how good an action was, never what the best action would have been. Second, the data is non-stationary and self-generated — as the policy changes, the distribution of states it visits changes too. That feedback loop is why RL is powerful and why it’s notoriously unstable.

Key concepts you’ll keep seeing

Policy (the agent’s strategy)

A policy $\pi$ is the agent’s behavior — a mapping from states to actions. It can be deterministic ( $a = \pi(s)$ ) or stochastic ( $\pi(a \mid s)$ gives a probability for each action). The whole point of RL is to find an optimal policy $\pi^\*$ : one that maximizes expected return from every state. In deep RL the policy is a neural network whose weights are what training adjusts.

Return, the discount factor, and why future rewards matter

The return $G_t$ is the discounted sum of future rewards from time $t$ :

G_t = r_t + \gamma\, r_{t+1} + \gamma^2 r_{t+2} + \cdots = \sum_{k=0}^{\infty} \gamma^{k}\, r_{t+k}

The discount factor $\gamma \in [0,1)$ controls how much future rewards count. With $\gamma = 0$ the agent is myopic (cares only about the next reward); near $\gamma = 1$ it’s far-sighted. Discounting also keeps the sum finite for never-ending tasks and reflects that distant rewards are less certain. A common default is $\gamma = 0.99$ .

Value functions (V and Q)

A value function estimates expected return — the long-run payoff of being in a situation. Two flavors:

State-value $V^\pi(s)$ — expected return starting from state $s$ and following policy $\pi$ .
Action-value $Q^\pi(s,a)$ — expected return after taking action $a$ in state $s$ , then following $\pi$ .

$Q$ is especially useful: if you know $Q$ , the best action is simply $\arg\max_a Q(s,a)$ . Many RL algorithms are, at heart, ways to estimate these values from experience.

Exploration vs. exploitation

To learn, the agent must sometimes try actions it’s unsure about (exploration) rather than always picking the current best (exploitation). Too much exploitation and it gets stuck in a mediocre habit; too much exploration and it never cashes in. The simplest balance is ε-greedy: with probability $\varepsilon$ pick a random action, otherwise pick the greedy (best-known) one — and decay $\varepsilon$ over time.

The math under the hood: Markov Decision Processes

The MDP tuple

The standard formal framework for RL is the Markov Decision Process (MDP), defined by the tuple $(\mathcal{S}, \mathcal{A}, P, R, \gamma)$ :

States 𝒮 and Actions 𝒜

The set of all situations the agent can be in, and all actions it can take.

Transition function P

$P(s' \mid s, a)$ — the probability of landing in state $s'$ after taking action $a$ in state $s$ . This encodes the environment’s dynamics.

Reward function R

$R(s, a)$ — the (expected) reward for taking action $a$ in state $s$ .

Discount factor γ

How much future reward is worth relative to immediate reward, as above.

The defining assumption is the Markov property: the future depends only on the current state, not the full history — the present state captures everything relevant. This is what makes the problem tractable.

The Bellman equation in one line

The value of a state is the immediate reward plus the discounted value of where you land next. That self-referential identity is the Bellman equation, the engine of almost every RL algorithm:

V^\pi(s) = \mathbb{E}_{a \sim \pi,\, s' \sim P}\big[\, R(s,a) + \gamma\, V^\pi(s')\,\big]

Go deeper: the Bellman optimality equation

For the optimal value function $V^\*$ , instead of averaging over the policy you take the best action:

V^\*(s) = \max_{a}\; \Big[ R(s,a) + \gamma \sum_{s'} P(s' \mid s, a)\, V^\*(s') \Big]

The action-value form gives the famous Q-learning update target, $Q^\*(s,a) = R(s,a) + \gamma \sum_{s'} P(s'\mid s,a)\max_{a'} Q^\*(s', a')$ . Temporal-difference (TD) learning turns this into a practical rule: nudge your current estimate toward the observed reward plus the discounted estimate of the next state — bootstrapping from your own predictions rather than waiting for the full return.

Families of RL algorithms

There are three big approaches, distinguished by what the agent learns.

A map of model-free RL algorithm families. Value-based methods learn what states/actions are worth; policy-based methods learn the behavior directly; actor-critic combines both.

Value-based methods (Q-learning, DQN)

These learn a value function (usually $Q$ ) and derive the policy by acting greedily with respect to it. Q-learning is the classic; DQN (Deep Q-Network) scaled it with a neural network to play Atari from pixels. Value-based methods are sample-efficient and off-policy (they can learn from old data via a replay buffer) but struggle with continuous action spaces, where $\arg\max_a Q$ becomes intractable.

Policy-gradient and actor-critic methods (REINFORCE, PPO, SAC)

Policy-based methods skip the value function and optimize the policy directly by gradient ascent on expected return — naturally handling continuous actions and stochastic policies. Plain REINFORCE is high-variance; actor-critic methods fix this by also learning a value function (the critic) to reduce variance while the actor updates the policy. PPO is the dominant modern choice — stable, simple, and the workhorse behind much of RLHF. SAC (Soft Actor-Critic) is a strong off-policy option for continuous control.

Model-free vs. model-based RL

A separate axis cuts across all of the above:

Model-free RL

The agent learns a policy or value function directly from experience, without modeling how the environment works. Simpler and more general — most famous results (DQN, PPO) are model-free — but sample-hungry, often needing millions of interactions.

Model-based RL

The agent first learns (or is given) a model of the environment’s dynamics, then plans against it. Far more sample-efficient and powerful (AlphaZero, MuZero, Dreamer) — but learning an accurate model is hard, and model errors compound.

Deep reinforcement learning: adding neural networks

Classic RL stored values in a table — one entry per state. That collapses the moment states are images, sensor streams, or text: the table is astronomically large. Deep reinforcement learning replaces the table with a neural network as a function approximator that generalizes across similar states, so the agent can handle raw pixels or high-dimensional sensors it has never seen exactly before.

This is what unlocked the headline results — but it also introduced instability, because you’re now chasing a moving target with a function approximator on correlated, self-generated data. The 2015 DQN paper’s fixes — an experience replay buffer (to decorrelate data) and a target network (to stabilize the learning target) — became standard tricks of the trade.

Atari games DQN learned from raw pixels (2015)

4–1

AlphaGo's 2016 series win over Lee Sedol

2024

Turing Award to Sutton & Barto for RL

Milestones that put RL on the map

1980s–90s

Foundations: TD learning & Q-learning

Sutton and Barto develop temporal-difference learning; Watkins introduces Q-learning. The conceptual and algorithmic core of modern RL — work that earned the 2024 Turing Award (announced March 2025).

2013–2015

DQN plays Atari from pixels

DeepMind’s Deep Q-Network reaches human-level play on 49 Atari games from raw pixels alone, launching the deep RL era.

2016

AlphaGo beats Lee Sedol

RL plus tree search masters Go, long considered a grand challenge for AI; AlphaZero (2017) then learns chess, shogi and Go from self-play alone.

2019–2022

Robotics, StarCraft & real-world control

RL tackles dexterous manipulation, StarCraft II (AlphaStar), and systems like data-center cooling and chip placement.

2022

RLHF turns GPT-3 into ChatGPT

RL from human feedback becomes the alignment step behind modern assistants — RL’s biggest commercial impact yet.

2025

DeepSeek-R1 & the reasoning turn

Pure RL with simple verifiable rewards incentivizes chain-of-thought reasoning in LLMs (published in Nature, Sept 2025), making RL central to frontier reasoning models.

RL for large language models

The hottest entry point to RL today isn’t games — it’s large language models. Two RL-flavored techniques reshaped how LLMs are trained after pretraining:

RLHF (RL from Human Feedback) — learn a reward model from human preferences, then optimize the LLM (often with PPO) to produce answers humans prefer. This is what turned base models into helpful, safe assistants. Simpler descendants like DPO skip the explicit RL loop.
RLVR (RL with Verifiable Rewards) — the reward comes from a programmatic checker: did the unit tests pass? is the math answer correct? DeepSeek-R1 showed that pure RL with rule-based accuracy + format rewards can make models develop self-reflection and verification on their own — a landmark for RL for reasoning. Algorithms like GRPO make this loop cheap.

Frontier models increasingly use both: RLVR to sharpen reasoning on checkable tasks, RLHF to keep results helpful and safe on open-ended ones. See agentic RL for how this extends to tool-using agents.

Where RL is used in the real world

Domain	What RL does
Games	Superhuman Go, chess, Atari, StarCraft, Dota 2 — the proving ground
Robotics	Locomotion, dexterous manipulation, often trained in simulation then transferred
LLM alignment	RLHF / RLVR post-training for every major assistant
Recommendation	Long-horizon engagement optimization (sequential, not one-shot)
Energy & operations	Data-center cooling, grid balancing, inventory and logistics
Finance	Trade execution and portfolio strategies under sequential decisions
Science	Plasma control in fusion reactors, drug and materials discovery

Why RL is hard: common challenges

RL is powerful but notoriously finicky. The honest list of failure modes:

Sample inefficiency — model-free RL can need millions or billions of environment steps, which is why so much RL trains in simulation.
Reward specification & shaping — designing a reward that actually expresses what you want is deceptively hard; a badly shaped reward teaches the wrong behavior.
Reward hacking — the agent finds loopholes that score highly but defeat your intent (Goodhart’s law). See the reward hacking discussion.
Sparse & delayed rewards — when reward only arrives at the very end (win/lose), assigning credit to the right earlier actions is brutally hard.
Instability — combining function approximation, bootstrapping, and off-policy data (the “deadly triad”) can make training diverge.
The sim-to-real gap — policies trained in simulation often break on real hardware where the dynamics differ.

How to get started with RL

A practical ladder from concepts to running code:

Build intuition

Watch a clear explainer (the StatQuest video above) and read OpenAI’s Spinning Up “Key Concepts in RL” — the best free practitioner intro.

Learn the theory

Work through Sutton & Barto, Reinforcement Learning: An Introduction — the canonical textbook, free online, by the 2024 Turing Award winners.

Write code

Take the hands-on Hugging Face Deep RL Course, which pairs theory with implementations you train yourself.

Train an agent

Use Gymnasium (the maintained successor to OpenAI Gym) for standard environments, and a library like Stable-Baselines3 or CleanRL for reliable algorithm implementations.

When you’re ready to go beyond toy tasks, explore RL environments and the broader list of RL environment companies.

Researcher takes

Yann LeCun, author of the famous cake analogy that called RL the ‘cherry on top,’ clarifies that calling RL the cherry was never a dismissal of it.

View Yann LeCun's post on X →

Eugene Yan revisits LeCun’s cake analogy eight years on and argues the ordering it implied has largely held up.

View Eugene Yan's post on X →

Frequently asked questions

Is reinforcement learning supervised or unsupervised?

Neither — it’s a third paradigm. Supervised learning needs labeled correct answers; unsupervised learning finds structure in unlabeled data. RL learns from a reward signal generated by the agent’s own actions, where feedback is evaluative (“how good”) rather than instructive (“what was right”).

What is the difference between reward and return?

A reward is the single number the environment gives after one action. The return is the (discounted) sum of all future rewards. The agent optimizes the return, not the immediate reward — which is why it will sacrifice short-term payoff for a bigger long-term one.

What is a policy in reinforcement learning?

A policy is the agent’s strategy: a function mapping states to actions (or to a probability distribution over actions). Training an RL agent means searching for the optimal policy — the behavior that maximizes expected return. In deep RL the policy is a neural network.

Do ChatGPT and other LLMs use reinforcement learning?

Yes. After pretraining, most assistants are refined with RLHF (RL from human feedback), and reasoning models increasingly use RLVR (RL with verifiable rewards), as DeepSeek-R1 demonstrated. RL is now a core part of the LLM training stack, not just a games technique.

Key papers and sources

Reinforcement Learning: An Introduction (2nd ed.) — Sutton & Barto — the canonical free textbook.
Spinning Up: Key Concepts in RL — OpenAI — the best practitioner intro.
Human-level control through deep RL (DQN) — Mnih et al., 2015 — Atari from pixels; the start of modern deep RL.
Proximal Policy Optimization (PPO) — Schulman et al., 2017 — the default modern RL algorithm.
DeepSeek-R1 — DeepSeek-AI, Nature 2025 — RL with verifiable rewards for reasoning.
A Survey of RL from Human Feedback — 2023 — bridge from classic RL to LLM alignment.

RLHF · PPO · DPO & preference optimization · GRPO · RLVR · RL for reasoning · Reward models · Agentic RL · RL environments

What is Reinforcement Learning?