Reinforcement Learning in Robotics

Key takeaways

RL lets robots learn motor skills — walking, grasping, dexterous manipulation — that are too complex to hand-code, by optimizing a reward through trial and error.
The dominant recipe is train in simulation, deploy on hardware: GPU physics simulators run thousands of robots in parallel, then policies cross the reality gap via domain randomization.
Real robots are slow, fragile and expensive, so most RL happens in sim; the central challenge is the sim-to-real gap, not the learning algorithm itself.
Landmark results — ANYmal walking, OpenAI's Dactyl solving a Rubik's cube, Google's QT-Opt grasping — proved RL works on hardware; 2024–2026 pushes agile humanoids and foundation-model policies.

What is reinforcement learning in robotics?

Reinforcement learning in robotics is the use of trial-and-error optimization to teach a physical robot a control policy — a function mapping sensor readings (joint angles, camera images, IMU data) to motor commands. Instead of an engineer hand-tuning a controller, the robot (or a simulated copy of it) explores actions, receives a scalar reward for good behavior, and gradually shifts its policy toward higher reward. It is the application of the core RL loop — agent, environment, reward — to the messiest possible environment: the real physical world.

Robotics is one of RL’s oldest and hardest proving grounds. The math is the same as everywhere else in RL, but the constraints are brutal: actions take real time, mistakes break expensive hardware, sensors are noisy, and you cannot rewind the world. These constraints reshape how RL is done — and explain why simulation sits at the center of nearly every modern robot-learning pipeline.

The robot control loop as an MDP: the policy reads state s(t) from sensors, sends an action a(t) to the actuators, and the physical world returns the next state s(t+1) and a reward r(t+1).

Why robotics makes RL hard

The same properties that make robots useful make them hostile to naive RL. A simulated Atari agent can play millions of games overnight; a real robot arm can attempt maybe a few hundred grasps per hour, and snaps a gripper if the policy commands a bad move.

580k

Real grasp attempts QT-Opt needed (7 robots, 4 months)

< 4 min

To train flat-ground walking in sim (4096 robots, 1 GPU)

10⁴–10⁵

Robots simulated in parallel on a single modern GPU

Four structural challenges dominate the field:

Sample inefficiency meets scarce data. Deep RL famously needs millions to billions of environment steps. On hardware that is impractical, dangerous and slow.
Safety and fragility. Exploration means trying bad actions. On a real robot a bad action can destroy hardware, the environment, or a person.
Partial, noisy observation. Cameras lag, joint encoders drift, contact is hard to sense. The true state is rarely fully observable — a POMDP, not a clean MDP.
Reward design. “Walk forward without falling” hides dozens of sub-goals (don’t waste energy, keep feet from slipping, stay upright). Sparse rewards barely train; dense rewards invite reward hacking.

The single biggest consequence: almost all robot RL trains in simulation and transfers to hardware. The hard problem becomes crossing the gap between the simulator and reality.

The dominant recipe: simulation-first

Modern robot learning runs on GPU-accelerated physics simulators — NVIDIA’s Isaac Gym / Isaac Lab, MuJoCo (now MJX), Brax — that simulate thousands of robot copies in parallel directly on the GPU, with no CPU bottleneck. This collapses what once took weeks on real hardware into minutes of wall-clock training.

Simulation-first pipeline: train a policy across thousands of randomized simulated robots, then deploy the frozen policy on real hardware. Domain randomization makes the real robot look like just another sample to the policy.

Build a simulated twin

Model the robot’s kinematics, mass, actuators and the task in a physics engine. Fidelity matters but is never perfect — real friction, latency and motor dynamics are hard to capture exactly. This residual mismatch is the reality gap.

Train massively in parallel

Run thousands of robot instances on the GPU and optimize the policy with an algorithm like PPO or SAC. On-policy PPO dominates locomotion; off-policy SAC and DQN-style Q-learning suit sample-limited or vision-based manipulation.

Randomize the domain

Randomize physics parameters — masses, friction, motor strength, sensor noise, latency, even visual textures — every episode. The policy must succeed across the whole distribution, so the real world looks like just one more random sample. This is domain randomization, the workhorse of sim-to-real.

Deploy and (optionally) fine-tune

Freeze the policy and run it on the real robot. A robust policy transfers zero-shot; otherwise a short bout of real-world fine-tuning or online adaptation closes the remaining gap.

▶ Learning to Walk in Minutes Using Massively Parallel Deep RL — ANYmal trained in simulation (ETH Zürich)

Crossing the reality gap

No simulator matches reality exactly. A policy that overfits to simulator quirks — exploiting un-physical friction or perfect sensing — collapses on hardware. Four families of techniques bridge the gap, often combined:

Domain randomization

Train across a wide distribution of simulated dynamics and appearances so the policy is robust to the unknown true parameters. The most reliable and widely used technique; OpenAI’s Dactyl pushed it furthest. See domain randomization in robotics.

System identification

Carefully measure real parameters (mass, friction, motor curves) and tune the simulator to match. Reduces the gap directly but never fully — and must be redone per robot and per wear state.

Domain adaptation

Use real data to adapt a sim-trained policy — fine-tuning, or learning a latent representation shared across sim and real. Bridges what randomization alone can’t.

Teacher–student (privileged learning)

Train a “teacher” in sim with access to privileged state (true terrain, contact forces), then distill it into a “student” that uses only real onboard sensors. The recipe behind robust legged locomotion.

OpenAI’s Dactyl showed how far randomization scales. Hand-tuning the randomization ranges was itself a bottleneck, so the team invented Automatic Domain Randomization (ADR): start with a single non-randomized environment, and automatically widen the randomization ranges every time the policy clears a performance threshold. The hand learned to manipulate and ultimately solve a Rubik’s cube one-handed — generalizing to disturbances it never saw in training, like being prodded with a stuffed giraffe.

Go deeper: the teacher–student trick that makes legged robots robust

In simulation you know everything — exact ground height, friction at each foot, contact forces. On the real robot you know almost none of it. Teacher–student learning exploits this asymmetry. First, train a teacher policy with RL that reads privileged simulator state directly, so it learns near-optimal behavior fast. Then train a student policy via supervised imitation to reproduce the teacher’s actions using only the noisy onboard sensors a real robot actually has (joint encoders, IMU), typically with a recurrent or history-conditioned network that implicitly estimates the hidden terrain from a window of past observations. The student inherits the teacher’s skill but runs on real hardware. This two-stage approach underpins ETH Zürich’s ANYmal work and most modern quadruped and humanoid pipelines.

Algorithms that actually get used

The robotics community converged on a small set of workhorses, chosen for sample efficiency and stability rather than novelty:

Algorithm	Type	Where it shines	Why
PPO	On-policy	Locomotion, sim-trained whole-body control	Stable, parallelizes to thousands of envs, forgiving to tune
SAC	Off-policy	Real-world & sample-limited manipulation	Maximum-entropy exploration, very sample-efficient
TD3 / DDPG	Off-policy	Continuous control benchmarks	Deterministic continuous actions
QT-Opt	Off-policy Q-learning	Vision-based grasping at scale	Learns from huge logged datasets, closed-loop
Model-based RL	Learns a dynamics model	Data-scarce real-robot learning	Plans in a learned model, far fewer real samples

Robot actions are continuous (joint torques, end-effector velocities), so continuous-control methods and policy gradients dominate over discrete-action Q-learning — QT-Opt being a notable Q-learning exception that discretizes via optimization. For the data-scarcity problem specifically, model-based RL is attractive: learn a dynamics model from limited real data, then plan or train inside it.

Go deeper: why PPO won locomotion and SAC won manipulation

Locomotion training lives in simulation where you can run thousands of parallel environments, so sample efficiency barely matters — wall-clock throughput does. PPO’s on-policy updates parallelize trivially and are exceptionally stable, so it became the default for sim-trained walking and whole-body control. Manipulation that touches real hardware faces the opposite pressure: every sample is precious. SAC (and off-policy methods generally) reuse a replay buffer of past experience and add an entropy bonus that keeps exploration alive, squeezing far more learning from each real interaction. Same RL theory, opposite engineering constraint — and the on-policy vs off-policy trade-off explains the split.

RL vs imitation learning in robotics

Not all robot learning is RL. Imitation learning — training a policy to copy human or expert demonstrations — has surged with foundation-model policies, and the two paradigms are best understood as complementary.

	Reinforcement learning	Imitation learning
Signal	Reward from trial and error	Expert demonstrations
Can exceed the expert?	Yes — discovers novel strategies	No — bounded by demo quality
Data cost	Cheap in sim, dangerous on hardware	Expensive human teleoperation
Reward design	Required (and hard)	Not needed
2020s exemplars	ANYmal, Dactyl, QT-Opt	RT-1, RT-2, ALOHA, diffusion policies

Google DeepMind’s RT-1 and RT-2 vision-language-action models — trained by imitation on large robot datasets — show how powerful demonstration learning has become for generalist manipulation. But imitation can’t surpass its demonstrators and needs costly teleoperated data. The frontier increasingly combines them: pre-train a broad policy by imitation, then sharpen specific skills with RL, or use RL in simulation where demonstrations are scarce. See imitation and inverse RL.

Landmark results

2017

Domain randomization for transfer

Tobin et al. show policies trained on randomized simulated images transfer to real cameras — the seed of modern sim-to-real.

2018

QT-Opt grasps at scale

Google trains vision-based grasping on 580k real attempts across 7 robots, hitting 96% success on unseen objects with closed-loop regrasping.

2019

ANYmal learns agile skills

Hwangbo et al. (Science Robotics) train ANYmal in sim with a learned actuator model and transfer agile walking and fall-recovery to hardware.

2019

Dactyl solves a Rubik's cube

OpenAI’s robot hand, trained purely in sim with Automatic Domain Randomization, solves a Rubik’s cube one-handed.

2021

Learning to walk in minutes

Rudin et al. train ANYmal in Isaac Gym with 4096 parallel robots — flat-ground walking in under 4 minutes on one GPU.

2022–23

RT-1 / RT-2 generalist policies

DeepMind’s vision-language-action transformers bring web-scale knowledge to robot control via large-scale imitation.

2024–26

Agile humanoids

Teacher–student RL and sim-to-real drive whole-body humanoid skills (Unitree G1, agile walking and recovery) and RL-tuned foundation policies.

Where robot RL is in 2026

Three threads define the current frontier:

Humanoids go agile. Whole-body RL — much of it teacher–student plus aggressive sim-to-real — now drives dynamic walking, recovery and manipulation on platforms like Unitree’s G1 and others. Work such as ASAP explicitly aligns simulated and real physics to unlock agile whole-body skills.
Foundation-model policies meet RL. Generalist VLA policies (RT-2-class) are pre-trained by imitation, then increasingly fine-tuned with RL for reliability — combining broad semantic knowledge with reward-driven skill.
Better simulators and learned worlds. Isaac Lab, MJX and differentiable/learned world models keep shrinking both the reality gap and the wall-clock cost of training.

Researcher takes

Levine argues the long-standing objection that RL is too sample-inefficient for physical robots is being overturned: combining offline pretraining with fast online fine-tuning makes real-world RL routine rather than heroic.

View Sergey Levine's post on X →

A historical reflection on the pace of real-world RL for locomotion, with Levine framing the two-orders-of-magnitude speedup in on-robot training time as the key shift making learned legged control practical.

View Sergey Levine's post on X →

Pieter Abbeel, one of the founders of modern deep RL for robotics, on how deep learning reshaped robot control:

▶ Pieter Abbeel — Deep Learning for Robotics (CMU RI Seminar)

Frequently asked questions

Why do robots train in simulation instead of the real world?

Deep RL needs millions to billions of trial-and-error steps. A real robot is slow (a few hundred attempts an hour), fragile (bad exploratory actions break hardware), and unsafe to let flail freely. GPU simulators run thousands of robots in parallel and compress weeks of real experience into minutes — so the policy is trained in sim and then transferred to hardware.

What is the sim-to-real gap?

The mismatch between a physics simulator and the real world — friction, motor dynamics, sensor noise and latency are never modeled perfectly. A policy that overfits simulator quirks fails on hardware. Domain randomization, system identification, domain adaptation and teacher–student learning are the main techniques for crossing this gap. See the sim-to-real survey.

Is reinforcement learning or imitation learning better for robots?

Neither — they solve different problems. RL discovers behavior from reward and can exceed any human demonstrator, but needs reward design and lots of (usually simulated) experience. Imitation learning is data-efficient and avoids reward design but is bounded by demonstration quality. Modern systems increasingly combine them: imitate to bootstrap broad skill, then RL to sharpen and surpass. See imitation and inverse RL.

Which RL algorithm should I use for a robot?

For sim-trained locomotion and whole-body control, PPO is the default — stable and trivially parallel. For real-world or sample-limited manipulation, off-policy SAC or TD3 squeeze more from scarce data. When real data is extremely limited, consider model-based RL. All operate in continuous action spaces.

Key papers

Domain Randomization for Transferring Deep Neural Networks — Tobin et al., 2017 — the sim-to-real seed.
QT-Opt: Scalable Deep RL for Vision-Based Robotic Manipulation — Kalashnikov et al., 2018 — grasping at scale.
Learning Agile and Dynamic Motor Skills for Legged Robots — Hwangbo et al., 2019 — ANYmal in Science Robotics.
Solving Rubik’s Cube with a Robot Hand — OpenAI, 2019 — Dactyl and Automatic Domain Randomization.
Learning to Walk in Minutes Using Massively Parallel Deep RL — Rudin et al., 2021 — GPU-parallel locomotion.
RT-2: Vision-Language-Action Models — Brohan et al., 2023 — generalist robot policies.
Sim-to-Real Transfer in Deep RL for Robotics: a Survey — Zhao et al., 2020.

What is reinforcement learning? · Continuous control · PPO · Actor–critic · Model-based RL · Imitation and inverse RL · Reward shaping · RL environments