- Reinforcement learning (RL) is machine learning where an agent learns good decisions by trial and error — it acts, gets a reward, and adjusts to maximize long-term reward.
- Unlike supervised learning, there are no labeled right answers: the agent must discover what works, balancing exploration (trying new things) against exploitation (using what works).
- The core machinery — policy, reward, return, value functions — is formalized by the Markov Decision Process and solved by value-based, policy-gradient, or actor-critic algorithms.
- RL powered AlphaGo and Atari-from-pixels, and in 2023–2026 it became central to LLMs via RLHF and RL with verifiable rewards (DeepSeek-R1).
What is reinforcement learning?
Reinforcement learning (RL) is a branch of machine learning in which an agent learns to make good decisions by interacting with an environment through trial and error. At each step the agent observes a state, takes an action, and receives a reward — a number that says how good that action was. Over many attempts it learns a policy: a strategy that maps situations to actions so as to maximize cumulative reward over time, not just the next payoff.
The crucial difference from other ML: there are no labeled correct answers. Nobody tells the agent the right action — it must discover which actions pay off by trying them and observing the consequences, often delayed. That single property drives almost everything distinctive about RL, from the exploration problem to why it’s so hard to train.
A simple analogy: learning by trial and error
Think about training a dog. You can’t hand the dog a labeled dataset of “correct behaviors.” Instead it tries something, and you give a treat (positive reward) or nothing (no reward). Over time the dog learns which actions in which situations earn treats. It even learns sequences: sit, then stay, then come — because the treat at the end reinforces the whole chain.
The same shape describes a child learning to ride a bike (wobble → fall → adjust → stay up) or a game player racking up points. Three ingredients recur:
- Trial and error — you learn by doing, not by being told.
- Delayed reward — the payoff often comes well after the action that caused it (this is the hard part: credit assignment).
- A goal expressed as reward — “stay upright,” “win the game,” “earn the treat.”
RL is the mathematical formalization of exactly this learning process.
The RL loop in plain terms
Every RL system is built from the same handful of pieces, interacting in a loop.
The agent and the environment
The agent is the learner and decision-maker — the dog, the game-playing program, the robot controller. Everything outside the agent that it interacts with is the environment — the world, the game, the simulator. The agent acts on the environment; the environment responds with a new situation and a reward. The boundary is conceptual: for a robot, the “environment” includes its own arm dynamics, because the controller can’t change those directly.
States and observations
A state is a complete description of the situation at a moment in time. In practice the agent often sees only an observation — a partial view (a chess engine sees the whole board; a robot sees only its camera). When the observation fully captures the state, the problem is fully observable; otherwise it’s partially observable, which is harder.
Actions
An action is a choice the agent can make. The set of legal actions is the action space, which may be discrete (move left / right / jump) or continuous (apply 3.7 N·m of torque). Continuous, high-dimensional action spaces are much harder and shape which algorithms you can use.
Rewards and the goal of maximizing return
After each action the environment emits a scalar reward — the only learning signal RL gets. The agent’s goal is not to maximize the immediate reward but the return: the cumulative reward over the long run. This distinction is the heart of RL: a move that scores zero now (developing a chess piece) may be essential to a big reward later (checkmate).
How RL differs from supervised and unsupervised learning
All three are machine learning, but they learn from fundamentally different signals.
| Supervised | Unsupervised | Reinforcement | |
|---|---|---|---|
| Signal | Labeled examples (input → correct output) | Unlabeled data | A reward number, often delayed |
| Goal | Predict the label | Find structure / patterns | Choose actions to maximize return |
| Feedback | The right answer for each example | None | Evaluative (“how good”), not instructive (“what was right”) |
| Data | Fixed dataset | Fixed dataset | Generated by the agent’s own actions |
| Example | Image classification | Clustering customers | Game playing, robot control |
Two differences do most of the work. First, RL feedback is evaluative, not instructive: a reward tells you how good an action was, never what the best action would have been. Second, the data is non-stationary and self-generated — as the policy changes, the distribution of states it visits changes too. That feedback loop is why RL is powerful and why it’s notoriously unstable.
Key concepts you’ll keep seeing
Policy (the agent’s strategy)
A policy is the agent’s behavior — a mapping from states to actions. It can be deterministic () or stochastic ( gives a probability for each action). The whole point of RL is to find an optimal policy : one that maximizes expected return from every state. In deep RL the policy is a neural network whose weights are what training adjusts.
Return, the discount factor, and why future rewards matter
The return is the discounted sum of future rewards from time :
The discount factor controls how much future rewards count. With the agent is myopic (cares only about the next reward); near it’s far-sighted. Discounting also keeps the sum finite for never-ending tasks and reflects that distant rewards are less certain. A common default is .
Value functions (V and Q)
A value function estimates expected return — the long-run payoff of being in a situation. Two flavors:
- State-value — expected return starting from state and following policy .
- Action-value — expected return after taking action in state , then following .
is especially useful: if you know , the best action is simply . Many RL algorithms are, at heart, ways to estimate these values from experience.
Exploration vs. exploitation
To learn, the agent must sometimes try actions it’s unsure about (exploration) rather than always picking the current best (exploitation). Too much exploitation and it gets stuck in a mediocre habit; too much exploration and it never cashes in. The simplest balance is ε-greedy: with probability pick a random action, otherwise pick the greedy (best-known) one — and decay over time.
The math under the hood: Markov Decision Processes
The MDP tuple
The standard formal framework for RL is the Markov Decision Process (MDP), defined by the tuple :
The set of all situations the agent can be in, and all actions it can take.
— the probability of landing in state after taking action in state . This encodes the environment’s dynamics.
— the (expected) reward for taking action in state .
How much future reward is worth relative to immediate reward, as above.
The defining assumption is the Markov property: the future depends only on the current state, not the full history — the present state captures everything relevant. This is what makes the problem tractable.
The Bellman equation in one line
The value of a state is the immediate reward plus the discounted value of where you land next. That self-referential identity is the Bellman equation, the engine of almost every RL algorithm:
Go deeper: the Bellman optimality equation
For the optimal value function , instead of averaging over the policy you take the best action:
The action-value form gives the famous Q-learning update target, . Temporal-difference (TD) learning turns this into a practical rule: nudge your current estimate toward the observed reward plus the discounted estimate of the next state — bootstrapping from your own predictions rather than waiting for the full return.
Families of RL algorithms
There are three big approaches, distinguished by what the agent learns.
Value-based methods (Q-learning, DQN)
These learn a value function (usually ) and derive the policy by acting greedily with respect to it. Q-learning is the classic; DQN (Deep Q-Network) scaled it with a neural network to play Atari from pixels. Value-based methods are sample-efficient and off-policy (they can learn from old data via a replay buffer) but struggle with continuous action spaces, where becomes intractable.
Policy-gradient and actor-critic methods (REINFORCE, PPO, SAC)
Policy-based methods skip the value function and optimize the policy directly by gradient ascent on expected return — naturally handling continuous actions and stochastic policies. Plain REINFORCE is high-variance; actor-critic methods fix this by also learning a value function (the critic) to reduce variance while the actor updates the policy. PPO is the dominant modern choice — stable, simple, and the workhorse behind much of RLHF. SAC (Soft Actor-Critic) is a strong off-policy option for continuous control.
Model-free vs. model-based RL
A separate axis cuts across all of the above:
The agent learns a policy or value function directly from experience, without modeling how the environment works. Simpler and more general — most famous results (DQN, PPO) are model-free — but sample-hungry, often needing millions of interactions.
The agent first learns (or is given) a model of the environment’s dynamics, then plans against it. Far more sample-efficient and powerful (AlphaZero, MuZero, Dreamer) — but learning an accurate model is hard, and model errors compound.
Deep reinforcement learning: adding neural networks
Classic RL stored values in a table — one entry per state. That collapses the moment states are images, sensor streams, or text: the table is astronomically large. Deep reinforcement learning replaces the table with a neural network as a function approximator that generalizes across similar states, so the agent can handle raw pixels or high-dimensional sensors it has never seen exactly before.
This is what unlocked the headline results — but it also introduced instability, because you’re now chasing a moving target with a function approximator on correlated, self-generated data. The 2015 DQN paper’s fixes — an experience replay buffer (to decorrelate data) and a target network (to stabilize the learning target) — became standard tricks of the trade.
Milestones that put RL on the map
RL for large language models
The hottest entry point to RL today isn’t games — it’s large language models. Two RL-flavored techniques reshaped how LLMs are trained after pretraining:
- RLHF (RL from Human Feedback) — learn a reward model from human preferences, then optimize the LLM (often with PPO) to produce answers humans prefer. This is what turned base models into helpful, safe assistants. Simpler descendants like DPO skip the explicit RL loop.
- RLVR (RL with Verifiable Rewards) — the reward comes from a programmatic checker: did the unit tests pass? is the math answer correct? DeepSeek-R1 showed that pure RL with rule-based accuracy + format rewards can make models develop self-reflection and verification on their own — a landmark for RL for reasoning. Algorithms like GRPO make this loop cheap.
Frontier models increasingly use both: RLVR to sharpen reasoning on checkable tasks, RLHF to keep results helpful and safe on open-ended ones. See agentic RL for how this extends to tool-using agents.
Where RL is used in the real world
| Domain | What RL does |
|---|---|
| Games | Superhuman Go, chess, Atari, StarCraft, Dota 2 — the proving ground |
| Robotics | Locomotion, dexterous manipulation, often trained in simulation then transferred |
| LLM alignment | RLHF / RLVR post-training for every major assistant |
| Recommendation | Long-horizon engagement optimization (sequential, not one-shot) |
| Energy & operations | Data-center cooling, grid balancing, inventory and logistics |
| Finance | Trade execution and portfolio strategies under sequential decisions |
| Science | Plasma control in fusion reactors, drug and materials discovery |
Why RL is hard: common challenges
RL is powerful but notoriously finicky. The honest list of failure modes:
- Sample inefficiency — model-free RL can need millions or billions of environment steps, which is why so much RL trains in simulation.
- Reward specification & shaping — designing a reward that actually expresses what you want is deceptively hard; a badly shaped reward teaches the wrong behavior.
- Reward hacking — the agent finds loopholes that score highly but defeat your intent (Goodhart’s law). See the reward hacking discussion.
- Sparse & delayed rewards — when reward only arrives at the very end (win/lose), assigning credit to the right earlier actions is brutally hard.
- Instability — combining function approximation, bootstrapping, and off-policy data (the “deadly triad”) can make training diverge.
- The sim-to-real gap — policies trained in simulation often break on real hardware where the dynamics differ.
How to get started with RL
A practical ladder from concepts to running code:
Watch a clear explainer (the StatQuest video above) and read OpenAI’s Spinning Up “Key Concepts in RL” — the best free practitioner intro.
Work through Sutton & Barto, Reinforcement Learning: An Introduction — the canonical textbook, free online, by the 2024 Turing Award winners.
Take the hands-on Hugging Face Deep RL Course, which pairs theory with implementations you train yourself.
Use Gymnasium (the maintained successor to OpenAI Gym) for standard environments, and a library like Stable-Baselines3 or CleanRL for reliable algorithm implementations.
When you’re ready to go beyond toy tasks, explore RL environments and the broader list of RL environment companies.
Researcher takes
Yann LeCun, author of the famous cake analogy that called RL the ‘cherry on top,’ clarifies that calling RL the cherry was never a dismissal of it.
Eugene Yan revisits LeCun’s cake analogy eight years on and argues the ordering it implied has largely held up.
Frequently asked questions
Is reinforcement learning supervised or unsupervised?
Neither — it’s a third paradigm. Supervised learning needs labeled correct answers; unsupervised learning finds structure in unlabeled data. RL learns from a reward signal generated by the agent’s own actions, where feedback is evaluative (“how good”) rather than instructive (“what was right”).
What is the difference between reward and return?
A reward is the single number the environment gives after one action. The return is the (discounted) sum of all future rewards. The agent optimizes the return, not the immediate reward — which is why it will sacrifice short-term payoff for a bigger long-term one.
What is a policy in reinforcement learning?
A policy is the agent’s strategy: a function mapping states to actions (or to a probability distribution over actions). Training an RL agent means searching for the optimal policy — the behavior that maximizes expected return. In deep RL the policy is a neural network.
Do ChatGPT and other LLMs use reinforcement learning?
Yes. After pretraining, most assistants are refined with RLHF (RL from human feedback), and reasoning models increasingly use RLVR (RL with verifiable rewards), as DeepSeek-R1 demonstrated. RL is now a core part of the LLM training stack, not just a games technique.
Key papers and sources
- Reinforcement Learning: An Introduction (2nd ed.) — Sutton & Barto — the canonical free textbook.
- Spinning Up: Key Concepts in RL — OpenAI — the best practitioner intro.
- Human-level control through deep RL (DQN) — Mnih et al., 2015 — Atari from pixels; the start of modern deep RL.
- Proximal Policy Optimization (PPO) — Schulman et al., 2017 — the default modern RL algorithm.
- DeepSeek-R1 — DeepSeek-AI, Nature 2025 — RL with verifiable rewards for reasoning.
- A Survey of RL from Human Feedback — 2023 — bridge from classic RL to LLM alignment.
Related
RLHF · PPO · DPO & preference optimization · GRPO · RLVR · RL for reasoning · Reward models · Agentic RL · RL environments