- RL lets robots learn motor skills — walking, grasping, dexterous manipulation — that are too complex to hand-code, by optimizing a reward through trial and error.
- The dominant recipe is train in simulation, deploy on hardware: GPU physics simulators run thousands of robots in parallel, then policies cross the reality gap via domain randomization.
- Real robots are slow, fragile and expensive, so most RL happens in sim; the central challenge is the sim-to-real gap, not the learning algorithm itself.
- Landmark results — ANYmal walking, OpenAI's Dactyl solving a Rubik's cube, Google's QT-Opt grasping — proved RL works on hardware; 2024–2026 pushes agile humanoids and foundation-model policies.
What is reinforcement learning in robotics?
Reinforcement learning in robotics is the use of trial-and-error optimization to teach a physical robot a control policy — a function mapping sensor readings (joint angles, camera images, IMU data) to motor commands. Instead of an engineer hand-tuning a controller, the robot (or a simulated copy of it) explores actions, receives a scalar reward for good behavior, and gradually shifts its policy toward higher reward. It is the application of the core RL loop — agent, environment, reward — to the messiest possible environment: the real physical world.
Robotics is one of RL’s oldest and hardest proving grounds. The math is the same as everywhere else in RL, but the constraints are brutal: actions take real time, mistakes break expensive hardware, sensors are noisy, and you cannot rewind the world. These constraints reshape how RL is done — and explain why simulation sits at the center of nearly every modern robot-learning pipeline.
Why robotics makes RL hard
The same properties that make robots useful make them hostile to naive RL. A simulated Atari agent can play millions of games overnight; a real robot arm can attempt maybe a few hundred grasps per hour, and snaps a gripper if the policy commands a bad move.
Four structural challenges dominate the field:
- Sample inefficiency meets scarce data. Deep RL famously needs millions to billions of environment steps. On hardware that is impractical, dangerous and slow.
- Safety and fragility. Exploration means trying bad actions. On a real robot a bad action can destroy hardware, the environment, or a person.
- Partial, noisy observation. Cameras lag, joint encoders drift, contact is hard to sense. The true state is rarely fully observable — a POMDP, not a clean MDP.
- Reward design. “Walk forward without falling” hides dozens of sub-goals (don’t waste energy, keep feet from slipping, stay upright). Sparse rewards barely train; dense rewards invite reward hacking.
The single biggest consequence: almost all robot RL trains in simulation and transfers to hardware. The hard problem becomes crossing the gap between the simulator and reality.
The dominant recipe: simulation-first
Modern robot learning runs on GPU-accelerated physics simulators — NVIDIA’s Isaac Gym / Isaac Lab, MuJoCo (now MJX), Brax — that simulate thousands of robot copies in parallel directly on the GPU, with no CPU bottleneck. This collapses what once took weeks on real hardware into minutes of wall-clock training.
Model the robot’s kinematics, mass, actuators and the task in a physics engine. Fidelity matters but is never perfect — real friction, latency and motor dynamics are hard to capture exactly. This residual mismatch is the reality gap.
Randomize physics parameters — masses, friction, motor strength, sensor noise, latency, even visual textures — every episode. The policy must succeed across the whole distribution, so the real world looks like just one more random sample. This is domain randomization, the workhorse of sim-to-real.
Freeze the policy and run it on the real robot. A robust policy transfers zero-shot; otherwise a short bout of real-world fine-tuning or online adaptation closes the remaining gap.
Crossing the reality gap
No simulator matches reality exactly. A policy that overfits to simulator quirks — exploiting un-physical friction or perfect sensing — collapses on hardware. Four families of techniques bridge the gap, often combined:
Train across a wide distribution of simulated dynamics and appearances so the policy is robust to the unknown true parameters. The most reliable and widely used technique; OpenAI’s Dactyl pushed it furthest. See domain randomization in robotics.
Carefully measure real parameters (mass, friction, motor curves) and tune the simulator to match. Reduces the gap directly but never fully — and must be redone per robot and per wear state.
Use real data to adapt a sim-trained policy — fine-tuning, or learning a latent representation shared across sim and real. Bridges what randomization alone can’t.
Train a “teacher” in sim with access to privileged state (true terrain, contact forces), then distill it into a “student” that uses only real onboard sensors. The recipe behind robust legged locomotion.
OpenAI’s Dactyl showed how far randomization scales. Hand-tuning the randomization ranges was itself a bottleneck, so the team invented Automatic Domain Randomization (ADR): start with a single non-randomized environment, and automatically widen the randomization ranges every time the policy clears a performance threshold. The hand learned to manipulate and ultimately solve a Rubik’s cube one-handed — generalizing to disturbances it never saw in training, like being prodded with a stuffed giraffe.
Go deeper: the teacher–student trick that makes legged robots robust
In simulation you know everything — exact ground height, friction at each foot, contact forces. On the real robot you know almost none of it. Teacher–student learning exploits this asymmetry. First, train a teacher policy with RL that reads privileged simulator state directly, so it learns near-optimal behavior fast. Then train a student policy via supervised imitation to reproduce the teacher’s actions using only the noisy onboard sensors a real robot actually has (joint encoders, IMU), typically with a recurrent or history-conditioned network that implicitly estimates the hidden terrain from a window of past observations. The student inherits the teacher’s skill but runs on real hardware. This two-stage approach underpins ETH Zürich’s ANYmal work and most modern quadruped and humanoid pipelines.
Algorithms that actually get used
The robotics community converged on a small set of workhorses, chosen for sample efficiency and stability rather than novelty:
| Algorithm | Type | Where it shines | Why |
|---|---|---|---|
| PPO | On-policy | Locomotion, sim-trained whole-body control | Stable, parallelizes to thousands of envs, forgiving to tune |
| SAC | Off-policy | Real-world & sample-limited manipulation | Maximum-entropy exploration, very sample-efficient |
| TD3 / DDPG | Off-policy | Continuous control benchmarks | Deterministic continuous actions |
| QT-Opt | Off-policy Q-learning | Vision-based grasping at scale | Learns from huge logged datasets, closed-loop |
| Model-based RL | Learns a dynamics model | Data-scarce real-robot learning | Plans in a learned model, far fewer real samples |
Robot actions are continuous (joint torques, end-effector velocities), so continuous-control methods and policy gradients dominate over discrete-action Q-learning — QT-Opt being a notable Q-learning exception that discretizes via optimization. For the data-scarcity problem specifically, model-based RL is attractive: learn a dynamics model from limited real data, then plan or train inside it.
Go deeper: why PPO won locomotion and SAC won manipulation
Locomotion training lives in simulation where you can run thousands of parallel environments, so sample efficiency barely matters — wall-clock throughput does. PPO’s on-policy updates parallelize trivially and are exceptionally stable, so it became the default for sim-trained walking and whole-body control. Manipulation that touches real hardware faces the opposite pressure: every sample is precious. SAC (and off-policy methods generally) reuse a replay buffer of past experience and add an entropy bonus that keeps exploration alive, squeezing far more learning from each real interaction. Same RL theory, opposite engineering constraint — and the on-policy vs off-policy trade-off explains the split.
RL vs imitation learning in robotics
Not all robot learning is RL. Imitation learning — training a policy to copy human or expert demonstrations — has surged with foundation-model policies, and the two paradigms are best understood as complementary.
| Reinforcement learning | Imitation learning | |
|---|---|---|
| Signal | Reward from trial and error | Expert demonstrations |
| Can exceed the expert? | Yes — discovers novel strategies | No — bounded by demo quality |
| Data cost | Cheap in sim, dangerous on hardware | Expensive human teleoperation |
| Reward design | Required (and hard) | Not needed |
| 2020s exemplars | ANYmal, Dactyl, QT-Opt | RT-1, RT-2, ALOHA, diffusion policies |
Google DeepMind’s RT-1 and RT-2 vision-language-action models — trained by imitation on large robot datasets — show how powerful demonstration learning has become for generalist manipulation. But imitation can’t surpass its demonstrators and needs costly teleoperated data. The frontier increasingly combines them: pre-train a broad policy by imitation, then sharpen specific skills with RL, or use RL in simulation where demonstrations are scarce. See imitation and inverse RL.
Landmark results
Where robot RL is in 2026
Three threads define the current frontier:
- Humanoids go agile. Whole-body RL — much of it teacher–student plus aggressive sim-to-real — now drives dynamic walking, recovery and manipulation on platforms like Unitree’s G1 and others. Work such as ASAP explicitly aligns simulated and real physics to unlock agile whole-body skills.
- Foundation-model policies meet RL. Generalist VLA policies (RT-2-class) are pre-trained by imitation, then increasingly fine-tuned with RL for reliability — combining broad semantic knowledge with reward-driven skill.
- Better simulators and learned worlds. Isaac Lab, MJX and differentiable/learned world models keep shrinking both the reality gap and the wall-clock cost of training.
Researcher takes
Levine argues the long-standing objection that RL is too sample-inefficient for physical robots is being overturned: combining offline pretraining with fast online fine-tuning makes real-world RL routine rather than heroic.
A historical reflection on the pace of real-world RL for locomotion, with Levine framing the two-orders-of-magnitude speedup in on-robot training time as the key shift making learned legged control practical.
Pieter Abbeel, one of the founders of modern deep RL for robotics, on how deep learning reshaped robot control:
Frequently asked questions
Why do robots train in simulation instead of the real world?
Deep RL needs millions to billions of trial-and-error steps. A real robot is slow (a few hundred attempts an hour), fragile (bad exploratory actions break hardware), and unsafe to let flail freely. GPU simulators run thousands of robots in parallel and compress weeks of real experience into minutes — so the policy is trained in sim and then transferred to hardware.
What is the sim-to-real gap?
The mismatch between a physics simulator and the real world — friction, motor dynamics, sensor noise and latency are never modeled perfectly. A policy that overfits simulator quirks fails on hardware. Domain randomization, system identification, domain adaptation and teacher–student learning are the main techniques for crossing this gap. See the sim-to-real survey.
Is reinforcement learning or imitation learning better for robots?
Neither — they solve different problems. RL discovers behavior from reward and can exceed any human demonstrator, but needs reward design and lots of (usually simulated) experience. Imitation learning is data-efficient and avoids reward design but is bounded by demonstration quality. Modern systems increasingly combine them: imitate to bootstrap broad skill, then RL to sharpen and surpass. See imitation and inverse RL.
Which RL algorithm should I use for a robot?
For sim-trained locomotion and whole-body control, PPO is the default — stable and trivially parallel. For real-world or sample-limited manipulation, off-policy SAC or TD3 squeeze more from scarce data. When real data is extremely limited, consider model-based RL. All operate in continuous action spaces.
Key papers
- Domain Randomization for Transferring Deep Neural Networks — Tobin et al., 2017 — the sim-to-real seed.
- QT-Opt: Scalable Deep RL for Vision-Based Robotic Manipulation — Kalashnikov et al., 2018 — grasping at scale.
- Learning Agile and Dynamic Motor Skills for Legged Robots — Hwangbo et al., 2019 — ANYmal in Science Robotics.
- Solving Rubik’s Cube with a Robot Hand — OpenAI, 2019 — Dactyl and Automatic Domain Randomization.
- Learning to Walk in Minutes Using Massively Parallel Deep RL — Rudin et al., 2021 — GPU-parallel locomotion.
- RT-2: Vision-Language-Action Models — Brohan et al., 2023 — generalist robot policies.
- Sim-to-Real Transfer in Deep RL for Robotics: a Survey — Zhao et al., 2020.
Related
What is reinforcement learning? · Continuous control · PPO · Actor–critic · Model-based RL · Imitation and inverse RL · Reward shaping · RL environments