reinforcement-learning.com
// APPLICATIONS

Reinforcement Learning in Robotics

How RL teaches real robots to walk, grasp and manipulate — the sim-to-real pipeline, domain randomization, teacher-student training, key results, and 2026 trends.

Updated 2026-06-07 16 min read
Key takeaways
  • RL lets robots learn motor skills — walking, grasping, dexterous manipulation — that are too complex to hand-code, by optimizing a reward through trial and error.
  • The dominant recipe is train in simulation, deploy on hardware: GPU physics simulators run thousands of robots in parallel, then policies cross the reality gap via domain randomization.
  • Real robots are slow, fragile and expensive, so most RL happens in sim; the central challenge is the sim-to-real gap, not the learning algorithm itself.
  • Landmark results — ANYmal walking, OpenAI's Dactyl solving a Rubik's cube, Google's QT-Opt grasping — proved RL works on hardware; 2024–2026 pushes agile humanoids and foundation-model policies.

What is reinforcement learning in robotics?

Reinforcement learning in robotics is the use of trial-and-error optimization to teach a physical robot a control policy — a function mapping sensor readings (joint angles, camera images, IMU data) to motor commands. Instead of an engineer hand-tuning a controller, the robot (or a simulated copy of it) explores actions, receives a scalar reward for good behavior, and gradually shifts its policy toward higher reward. It is the application of the core RL loop — agent, environment, reward — to the messiest possible environment: the real physical world.

Robotics is one of RL’s oldest and hardest proving grounds. The math is the same as everywhere else in RL, but the constraints are brutal: actions take real time, mistakes break expensive hardware, sensors are noisy, and you cannot rewind the world. These constraints reshape how RL is done — and explain why simulation sits at the center of nearly every modern robot-learning pipeline.

Policyneural net (π_θ)Robot + Worldphysics / dynamicsaction a(t)(motor torque)state s(t+1)reward r(t+1)sensors → actuators, ~50–1000 Hz
The robot control loop as an MDP: the policy reads state s(t) from sensors, sends an action a(t) to the actuators, and the physical world returns the next state s(t+1) and a reward r(t+1).

Why robotics makes RL hard

The same properties that make robots useful make them hostile to naive RL. A simulated Atari agent can play millions of games overnight; a real robot arm can attempt maybe a few hundred grasps per hour, and snaps a gripper if the policy commands a bad move.

580k
Real grasp attempts QT-Opt needed (7 robots, 4 months)
< 4 min
To train flat-ground walking in sim (4096 robots, 1 GPU)
10⁴–10⁵
Robots simulated in parallel on a single modern GPU

Four structural challenges dominate the field:

  • Sample inefficiency meets scarce data. Deep RL famously needs millions to billions of environment steps. On hardware that is impractical, dangerous and slow.
  • Safety and fragility. Exploration means trying bad actions. On a real robot a bad action can destroy hardware, the environment, or a person.
  • Partial, noisy observation. Cameras lag, joint encoders drift, contact is hard to sense. The true state is rarely fully observable — a POMDP, not a clean MDP.
  • Reward design. “Walk forward without falling” hides dozens of sub-goals (don’t waste energy, keep feet from slipping, stay upright). Sparse rewards barely train; dense rewards invite reward hacking.

The single biggest consequence: almost all robot RL trains in simulation and transfers to hardware. The hard problem becomes crossing the gap between the simulator and reality.

The dominant recipe: simulation-first

Modern robot learning runs on GPU-accelerated physics simulators — NVIDIA’s Isaac Gym / Isaac Lab, MuJoCo (now MJX), Brax — that simulate thousands of robot copies in parallel directly on the GPU, with no CPU bottleneck. This collapses what once took weeks on real hardware into minutes of wall-clock training.

Simulation (GPU)thousands of randomized robotsRL trainingPPO / SACFrozen policy π_θdeployed as-isRealrobotzero-shot or fine-tune
Simulation-first pipeline: train a policy across thousands of randomized simulated robots, then deploy the frozen policy on real hardware. Domain randomization makes the real robot look like just another sample to the policy.
1
Build a simulated twin

Model the robot’s kinematics, mass, actuators and the task in a physics engine. Fidelity matters but is never perfect — real friction, latency and motor dynamics are hard to capture exactly. This residual mismatch is the reality gap.

2
Train massively in parallel

Run thousands of robot instances on the GPU and optimize the policy with an algorithm like PPO or SAC. On-policy PPO dominates locomotion; off-policy SAC and DQN-style Q-learning suit sample-limited or vision-based manipulation.

3
Randomize the domain

Randomize physics parameters — masses, friction, motor strength, sensor noise, latency, even visual textures — every episode. The policy must succeed across the whole distribution, so the real world looks like just one more random sample. This is domain randomization, the workhorse of sim-to-real.

4
Deploy and (optionally) fine-tune

Freeze the policy and run it on the real robot. A robust policy transfers zero-shot; otherwise a short bout of real-world fine-tuning or online adaptation closes the remaining gap.

▶ Learning to Walk in Minutes Using Massively Parallel Deep RL — ANYmal trained in simulation (ETH Zürich)

Crossing the reality gap

No simulator matches reality exactly. A policy that overfits to simulator quirks — exploiting un-physical friction or perfect sensing — collapses on hardware. Four families of techniques bridge the gap, often combined:

Domain randomization

Train across a wide distribution of simulated dynamics and appearances so the policy is robust to the unknown true parameters. The most reliable and widely used technique; OpenAI’s Dactyl pushed it furthest. See domain randomization in robotics.

System identification

Carefully measure real parameters (mass, friction, motor curves) and tune the simulator to match. Reduces the gap directly but never fully — and must be redone per robot and per wear state.

Domain adaptation

Use real data to adapt a sim-trained policy — fine-tuning, or learning a latent representation shared across sim and real. Bridges what randomization alone can’t.

Teacher–student (privileged learning)

Train a “teacher” in sim with access to privileged state (true terrain, contact forces), then distill it into a “student” that uses only real onboard sensors. The recipe behind robust legged locomotion.

OpenAI’s Dactyl showed how far randomization scales. Hand-tuning the randomization ranges was itself a bottleneck, so the team invented Automatic Domain Randomization (ADR): start with a single non-randomized environment, and automatically widen the randomization ranges every time the policy clears a performance threshold. The hand learned to manipulate and ultimately solve a Rubik’s cube one-handed — generalizing to disturbances it never saw in training, like being prodded with a stuffed giraffe.

Go deeper: the teacher–student trick that makes legged robots robust

In simulation you know everything — exact ground height, friction at each foot, contact forces. On the real robot you know almost none of it. Teacher–student learning exploits this asymmetry. First, train a teacher policy with RL that reads privileged simulator state directly, so it learns near-optimal behavior fast. Then train a student policy via supervised imitation to reproduce the teacher’s actions using only the noisy onboard sensors a real robot actually has (joint encoders, IMU), typically with a recurrent or history-conditioned network that implicitly estimates the hidden terrain from a window of past observations. The student inherits the teacher’s skill but runs on real hardware. This two-stage approach underpins ETH Zürich’s ANYmal work and most modern quadruped and humanoid pipelines.

Algorithms that actually get used

The robotics community converged on a small set of workhorses, chosen for sample efficiency and stability rather than novelty:

AlgorithmTypeWhere it shinesWhy
PPOOn-policyLocomotion, sim-trained whole-body controlStable, parallelizes to thousands of envs, forgiving to tune
SACOff-policyReal-world & sample-limited manipulationMaximum-entropy exploration, very sample-efficient
TD3 / DDPGOff-policyContinuous control benchmarksDeterministic continuous actions
QT-OptOff-policy Q-learningVision-based grasping at scaleLearns from huge logged datasets, closed-loop
Model-based RLLearns a dynamics modelData-scarce real-robot learningPlans in a learned model, far fewer real samples

Robot actions are continuous (joint torques, end-effector velocities), so continuous-control methods and policy gradients dominate over discrete-action Q-learning — QT-Opt being a notable Q-learning exception that discretizes via optimization. For the data-scarcity problem specifically, model-based RL is attractive: learn a dynamics model from limited real data, then plan or train inside it.

Go deeper: why PPO won locomotion and SAC won manipulation

Locomotion training lives in simulation where you can run thousands of parallel environments, so sample efficiency barely matters — wall-clock throughput does. PPO’s on-policy updates parallelize trivially and are exceptionally stable, so it became the default for sim-trained walking and whole-body control. Manipulation that touches real hardware faces the opposite pressure: every sample is precious. SAC (and off-policy methods generally) reuse a replay buffer of past experience and add an entropy bonus that keeps exploration alive, squeezing far more learning from each real interaction. Same RL theory, opposite engineering constraint — and the on-policy vs off-policy trade-off explains the split.

RL vs imitation learning in robotics

Not all robot learning is RL. Imitation learning — training a policy to copy human or expert demonstrations — has surged with foundation-model policies, and the two paradigms are best understood as complementary.

Reinforcement learningImitation learning
SignalReward from trial and errorExpert demonstrations
Can exceed the expert?Yes — discovers novel strategiesNo — bounded by demo quality
Data costCheap in sim, dangerous on hardwareExpensive human teleoperation
Reward designRequired (and hard)Not needed
2020s exemplarsANYmal, Dactyl, QT-OptRT-1, RT-2, ALOHA, diffusion policies

Google DeepMind’s RT-1 and RT-2 vision-language-action models — trained by imitation on large robot datasets — show how powerful demonstration learning has become for generalist manipulation. But imitation can’t surpass its demonstrators and needs costly teleoperated data. The frontier increasingly combines them: pre-train a broad policy by imitation, then sharpen specific skills with RL, or use RL in simulation where demonstrations are scarce. See imitation and inverse RL.

Landmark results

2017
Domain randomization for transfer
Tobin et al. show policies trained on randomized simulated images transfer to real cameras — the seed of modern sim-to-real.
2018
QT-Opt grasps at scale
Google trains vision-based grasping on 580k real attempts across 7 robots, hitting 96% success on unseen objects with closed-loop regrasping.
2019
ANYmal learns agile skills
Hwangbo et al. (Science Robotics) train ANYmal in sim with a learned actuator model and transfer agile walking and fall-recovery to hardware.
2019
Dactyl solves a Rubik's cube
OpenAI’s robot hand, trained purely in sim with Automatic Domain Randomization, solves a Rubik’s cube one-handed.
2021
Learning to walk in minutes
Rudin et al. train ANYmal in Isaac Gym with 4096 parallel robots — flat-ground walking in under 4 minutes on one GPU.
2022–23
RT-1 / RT-2 generalist policies
DeepMind’s vision-language-action transformers bring web-scale knowledge to robot control via large-scale imitation.
2024–26
Agile humanoids
Teacher–student RL and sim-to-real drive whole-body humanoid skills (Unitree G1, agile walking and recovery) and RL-tuned foundation policies.

Where robot RL is in 2026

Three threads define the current frontier:

  • Humanoids go agile. Whole-body RL — much of it teacher–student plus aggressive sim-to-real — now drives dynamic walking, recovery and manipulation on platforms like Unitree’s G1 and others. Work such as ASAP explicitly aligns simulated and real physics to unlock agile whole-body skills.
  • Foundation-model policies meet RL. Generalist VLA policies (RT-2-class) are pre-trained by imitation, then increasingly fine-tuned with RL for reliability — combining broad semantic knowledge with reward-driven skill.
  • Better simulators and learned worlds. Isaac Lab, MJX and differentiable/learned world models keep shrinking both the reality gap and the wall-clock cost of training.

Researcher takes

Levine argues the long-standing objection that RL is too sample-inefficient for physical robots is being overturned: combining offline pretraining with fast online fine-tuning makes real-world RL routine rather than heroic.

A historical reflection on the pace of real-world RL for locomotion, with Levine framing the two-orders-of-magnitude speedup in on-robot training time as the key shift making learned legged control practical.

Pieter Abbeel, one of the founders of modern deep RL for robotics, on how deep learning reshaped robot control:

▶ Pieter Abbeel — Deep Learning for Robotics (CMU RI Seminar)

Frequently asked questions

Why do robots train in simulation instead of the real world?

Deep RL needs millions to billions of trial-and-error steps. A real robot is slow (a few hundred attempts an hour), fragile (bad exploratory actions break hardware), and unsafe to let flail freely. GPU simulators run thousands of robots in parallel and compress weeks of real experience into minutes — so the policy is trained in sim and then transferred to hardware.

What is the sim-to-real gap?

The mismatch between a physics simulator and the real world — friction, motor dynamics, sensor noise and latency are never modeled perfectly. A policy that overfits simulator quirks fails on hardware. Domain randomization, system identification, domain adaptation and teacher–student learning are the main techniques for crossing this gap. See the sim-to-real survey.

Is reinforcement learning or imitation learning better for robots?

Neither — they solve different problems. RL discovers behavior from reward and can exceed any human demonstrator, but needs reward design and lots of (usually simulated) experience. Imitation learning is data-efficient and avoids reward design but is bounded by demonstration quality. Modern systems increasingly combine them: imitate to bootstrap broad skill, then RL to sharpen and surpass. See imitation and inverse RL.

Which RL algorithm should I use for a robot?

For sim-trained locomotion and whole-body control, PPO is the default — stable and trivially parallel. For real-world or sample-limited manipulation, off-policy SAC or TD3 squeeze more from scarce data. When real data is extremely limited, consider model-based RL. All operate in continuous action spaces.

Key papers

What is reinforcement learning? · Continuous control · PPO · Actor–critic · Model-based RL · Imitation and inverse RL · Reward shaping · RL environments