- Hierarchical RL (HRL) decomposes a long task into a high-level policy that picks subgoals or skills and a low-level policy that executes them — control at two (or more) timescales.
- The core abstraction is the option: a temporally extended action with its own policy, an initiation set, and a termination condition, formalized as a semi-Markov decision process.
- Temporal abstraction buys two things flat RL struggles with: long-horizon credit assignment and structured exploration — which is why HRL cracks sparse-reward tasks like Montezuma's Revenge.
- Landmark methods: options/MAXQ (1999), h-DQN and FeUdal Networks (2016–17), option-critic (2017), HIRO (2018) — and in 2026 the same ideas reappear as planner/executor stacks in LLM agents.
What is hierarchical reinforcement learning?
Hierarchical reinforcement learning (HRL) is reinforcement learning where the agent decides at more than one timescale. Instead of choosing a single primitive action every step — “move left,” “fire thruster,” “emit token” — a high-level policy chooses a subgoal or a skill that runs for many steps, and a low-level policy figures out the primitive actions to carry it out. Think of a chef who decides “make the sauce” (high level) and only then worries about each whisk stroke (low level).
This is the standard answer to RL’s curse of long horizons. A flat agent solving a task that takes 1,000 steps has to propagate reward back across 1,000 decisions and explore an astronomically large space of action sequences. An agent that thinks in terms of ten reusable skills, each ~100 steps long, faces a far shorter planning problem at the top and ten well-shaped sub-problems at the bottom. HRL is the machinery for learning both levels — and the interface between them.
Why HRL exists: the two payoffs
A flat agent in a sparse-reward task is nearly blind. In Atari’s Montezuma’s Revenge — climb a ladder, grab a key, cross to a door — random exploration almost never stumbles onto the first reward, so vanilla DQN scores roughly zero. The 2016 h-DQN agent, which learned a meta-controller that proposes subgoals (“reach the key”) and a controller that pursues them with intrinsic reward, was among the first deep agents to make real progress on that game. The lift comes from two distinct benefits of temporal abstraction:
With options, a single high-level decision spans many steps, so reward propagates across far fewer decision points. A semi-MDP Bellman update jumps over the whole option instead of crawling back one primitive step at a time — the difference between credit flowing over 10 hops versus 1,000.
Exploring with skills means each “random” choice is a coherent, multi-step behavior (“go to the door”) rather than a single jitter. The agent covers the state space in meaningful strides, which is exactly what sparse-reward and long-horizon tasks demand.
The options framework
The dominant formalism for HRL is the options framework of Sutton, Precup and Singh (1999), introduced in their paper Between MDPs and Semi-MDPs. An option generalizes the notion of an action into a closed-loop, temporally extended behavior defined by three parts:
The set of states where the option is available to be started. “Open the door” can only begin when you are standing near a door.
While the option runs, it chooses primitive actions according to its own policy . This is the skill itself — the actual behavior, like the sequence of moves that walks you to the door.
At each state the option ends with probability . When it terminates, control returns to the high-level policy over options, which picks the next option.
A primitive action is just a special one-step option (terminates immediately, available everywhere). Crucially, once you define options, the top-level problem of choosing among options is no longer a Markov decision process — it is a semi-Markov decision process (SMDP), because options take variable amounts of real time. The SMDP Bellman equation for the value of a state under a policy over options is:
where the option runs for steps before terminating in state . The single jump over the option is the formal source of HRL’s better credit assignment — the discount leaps over the whole behavior instead of decaying one step at a time. Because options and primitive actions share the same interface, you can mix them freely in planning (dynamic programming) and learning (Q-learning, actor-critic). See Markov decision processes for the flat baseline this extends.
Three foundational architectures
HRL has three classic backbones, each answering “where does the structure come from?” differently.
| Approach | High-level signal | What is learned | State abstraction | Reference |
|---|---|---|---|---|
| Options / SMDP | which option to run | policy over options (options often hand-given) | optional | Sutton, Precup, Singh 1999 |
| MAXQ | which subtask to invoke | recursive value decomposition over a task graph | five safe-abstraction conditions | Dietterich 2000 |
| Feudal / goal-conditioned | a goal/direction for the worker | both levels, end-to-end | the goal is the abstraction | Dayan and Hinton 1993; FeUdal 2017; HIRO 2018 |
MAXQ (Dietterich 2000) takes a programmer-specified task hierarchy — “get passenger” calls “navigate” calls “move” — and decomposes the value function into an additive sum of each subtask’s contribution. Its MAXQ-Q algorithm converges to a recursively optimal policy (each subtask optimal given its children) and proves five conditions under which state abstraction is safe, which is what makes it dramatically faster than flat Q-learning on the canonical Taxi domain.
Feudal RL (Dayan and Hinton 1993) introduced the manager/worker metaphor: managers set goals for sub-managers and reward them for achieving those goals, regardless of the true environment reward — “managers reward sub-managers for doing their bidding.” This is the seed of every modern manager/worker deep HRL system.
Go deeper: optimality flavors in HRL
HRL trades global optimality for tractability, and there are two named compromises. A hierarchically optimal policy is the best policy consistent with the hierarchy — the best you can do without breaking the structure. A recursively optimal policy (MAXQ’s guarantee) is weaker: each subtask is optimal given its own subtask context, ignoring the larger calling context, so a subtask might learn a policy that is locally great but globally suboptimal because it cannot see what the parent needs next. Recursive optimality is easier to achieve and enables more state abstraction; hierarchical optimality is stronger but more expensive. Which you want depends on how much you trust the hand-designed structure.
Learning the hierarchy end-to-end
The classic methods often assume the options or task graph are given. The deep-RL era asked: can we learn the skills themselves, jointly with the policy that uses them? Three influential answers:
Bacon, Harb and Precup made options fully differentiable. Their intra-option policy gradient and termination gradient theorems let you learn the option policies and their termination conditions end-to-end by gradient ascent on expected return — no subgoals or extra rewards required. See the paper.
A Manager sets an abstract goal direction in a learned latent space at low temporal resolution; the Worker is rewarded (via a directional cosine signal) for moving the latent state that way. Decoupling the two timescales gives long-horizon credit assignment and emergent sub-policies. See the paper.
The third pillar is HIRO (Nachum et al. 2018), which made goal-conditioned HRL sample-efficient enough for real robotics-style tasks. The higher level proposes a goal state (a target the lower level should reach); the lower level is rewarded by distance to that goal. Both levels train with off-policy algorithms (TD3-style), reusing data aggressively. HIRO learned complex simulated-robot behaviors — pushing objects, navigating mazes — from only a few million samples, beating prior HRL substantially.
Go deeper: the option-critic gradients
Option-critic parametrizes the intra-option policies and termination functions , then optimizes the expected discounted return directly. The intra-option policy gradient updates much like a standard policy gradient, but using the option-value function as the critic. The termination gradient updates in proportion to the advantage of continuing the current option versus switching, : if the current option is worse than average, the gradient pushes up so it terminates and hands control back. A known pathology is option collapse — without a regularizer the options degenerate into either always-terminating (back to flat RL) or one option that does everything — so practical variants add a deliberation cost or entropy term to keep options distinct and temporally extended.
How a two-level update actually flows
Putting the pieces together, one episode of goal-conditioned HRL (HIRO-style) looks like this:
Every steps the high-level policy observes the state and emits a goal (a target state, latent direction, or discrete option id) for the worker to pursue.
The low-level policy acts every environment step conditioned on , earning an intrinsic reward for progress toward (e.g. negative distance to the goal state). The true task reward accrues in the background.
The worker trains on its intrinsic reward at every step. The manager trains on the summed environment reward over its -step decision, treating the whole interval as one SMDP transition.
Because the worker keeps changing, stored high-level transitions are relabeled (off-policy correction) so the manager’s targets stay consistent with the current worker — the fix for inter-level non-stationarity.
A short history of HRL
Where HRL is used
| Domain | How hierarchy helps |
|---|---|
| Robotics & locomotion | A manager picks navigation waypoints or gaits; a worker handles joint torques. Reusable low-level skills transfer across tasks. See RL in robotics. |
| Sparse-reward games | Subgoals (reach key, open door) give intrinsic signal where the environment is silent — the Montezuma’s Revenge story. |
| Long-horizon LLM agents | A high-level planner decomposes a task into steps; low-level policies (or tool calls) execute each. The hot frontier of agentic RL in 2026. |
| Multi-agent coordination | Managers assign roles/subgoals to teammates; workers act locally. See multi-agent RL. |
HRL is also conceptually intertwined with adjacent ideas: curriculum learning (ordering subgoals from easy to hard), curiosity and intrinsic motivation (the reward that drives skill discovery), model-based RL (planning over abstract actions), and imitation and inverse RL (extracting skills from demonstrations).
Limitations and open problems
- Where do the options come from? Automatic, useful, reusable skill discovery — without hand-designed subgoals — is still HRL’s central open problem. Discovered options often collapse or fail to transfer.
- Inter-level non-stationarity. Training both levels at once is unstable; off-policy corrections and frozen lower levels help but don’t fully solve it.
- Recursive vs hierarchical optimality. Hand-designed structure can lock the agent out of the truly optimal policy.
- It often doesn’t beat strong flat baselines. On many benchmarks a well-tuned PPO or SAC agent matches HRL; the gains are clearest in genuinely long-horizon, sparse-reward, or transfer-heavy settings.
HRL and LLM agents in 2026
The most visible HRL today rarely calls itself HRL. Modern long-horizon LLM agents are hierarchical almost by construction: a planning step decomposes a goal into subtasks, and lower-level steps (tool calls, sub-agent invocations, multi-step rollouts) execute each one. The agentic RL survey explicitly frames HRL as the answer to long-horizon credit assignment in agents — exactly the problem the options framework was built for, now over tokens and tool calls instead of joysticks. Two threads stand out: using LLMs to propose options or subgoals in natural language (human-readable skills that transfer), and unsupervised skill discovery to populate a reusable low-level library. The vocabulary changed; the manager/worker decomposition did not. For the broader picture see RL for reasoning and the tooling companies building RL environments.
Frequently asked questions
How is HRL different from a regular RL agent?
A flat agent picks one primitive action per step and reasons over the full task horizon. An HRL agent reasons at multiple timescales: a high-level policy selects subgoals or skills that each run for many steps, and a low-level policy executes them. The payoff is shorter effective horizons — better credit assignment and exploration in long, sparse-reward tasks.
What exactly is an “option”?
An option is a temporally extended action with three parts: an initiation set (states where it can start), an intra-option policy (how it acts while running), and a termination condition (the probability it ends in each state). Primitive actions are just one-step options, so options slot into the same RL machinery — the top-level problem becomes a semi-Markov decision process.
Do I have to hand-design the hierarchy?
Not anymore. Classic methods like MAXQ assume a programmer-supplied task graph, but option-critic, FeUdal Networks and HIRO learn the skills (or goals) jointly with the policy that uses them. Fully automatic discovery of reusable, transferable skills is still an active research problem.
Is HRL actually used in frontier systems?
Yes, often under other names. Long-horizon LLM agents that plan-then-execute are hierarchical by design, and the 2025–26 agentic-RL literature treats HRL as the standard tool for long-horizon credit assignment. In robotics, manager/worker skill hierarchies are common. On short-horizon benchmarks, though, a well-tuned flat agent frequently competes.
Key papers
- Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction — Sutton, Precup, Singh, 1999 — the options framework.
- Hierarchical RL with the MAXQ Value Function Decomposition — Dietterich, 2000 — value decomposition and safe state abstraction.
- Hierarchical Deep RL: Temporal Abstraction and Intrinsic Motivation — Kulkarni et al., 2016 — h-DQN on Montezuma’s Revenge.
- The Option-Critic Architecture — Bacon, Harb, Precup, 2017 — learning options end-to-end.
- FeUdal Networks for Hierarchical RL — Vezhnevets et al., 2017 — deep Manager/Worker.
- Data-Efficient Hierarchical RL (HIRO) — Nachum et al., 2018 — off-policy goal-conditioned HRL.
- Discovering Temporal Structure: An Overview of HRL — Klissarov et al., 2025 — the modern survey.
Related
What is reinforcement learning? · Markov decision processes · Value functions · Policy gradients · Curiosity & intrinsic motivation · Curriculum learning · Agentic RL · RL in robotics