reinforcement-learning.com
// ADVANCED TOPICS

Curriculum Learning in RL

How curriculum learning orders RL tasks from easy to hard — manual curricula, teacher-student, learning-progress signals, self-play autocurricula, UED, and 2025 LLM reasoning.

Updated 2026-06-08 15 min read
Key takeaways
  • Curriculum learning trains an RL agent on an ordered sequence of tasks — easy first, then harder — instead of throwing it at the hardest task cold.
  • It is the standard cure for sparse rewards and hard-exploration problems, where a from-scratch agent never stumbles onto any reward signal.
  • Automatic curriculum learning (ACL) removes the hand-design: a teacher picks tasks by learning progress, regret, or novelty so the agent always trains at the edge of its ability.
  • Self-play and unsupervised environment design produce open-ended autocurricula — and in 2025 easy-to-hard ordering became a standard trick for RL on LLM reasoning.

What is curriculum learning in RL?

Curriculum learning trains an agent on a sequence of tasks of increasing difficulty, transferring what it learns at each stage forward, rather than dropping it into the hardest task from the start. The idea is borrowed straight from how humans learn: you teach arithmetic before calculus because the easy material builds the representations and skills the hard material needs.

In reinforcement learning this is more than a nicety. Many tasks have sparse rewards — the agent only sees a signal when it completes a long sequence of correct actions — so a from-scratch policy doing random exploration may never hit reward, and learning never starts. A curriculum seeds the agent with easy variants where reward is reachable, then gradually raises the difficulty so the policy is always learning something it can almost already do.

Task 1easyTask 2harderTask 3harder stillTarget taskfull difficultytransfer πtransfer πtransfer πdifficulty →
A curriculum sequences tasks from easy to hard. Each task transfers its learned policy forward as the initialization for the next — so the agent always trains just past the edge of its current ability, never on a task it cannot get any signal from.

Why it matters: the exploration problem

A pretrained policy that has never received reward is, for all practical purposes, blind. The canonical failure case is a sparse-reward task like Montezuma’s Revenge: the agent must perform a long, precise sequence of moves before any points appear, and uniform random exploration has a vanishing chance of completing it. Reward-driven learning has nothing to drive it.

Curriculum learning attacks this by reshaping what the agent trains on rather than the reward itself (the complementary lever — see reward shaping) or the exploration bonus (see curiosity and intrinsic motivation). It is one of three standard answers to exploration vs exploitation in hard-exploration domains.

2009
Bengio et al. formalize curriculum learning for ML
2020
JMLR survey: a full CL-for-RL framework
6 strategies
emerged from OpenAI's hide-and-seek autocurriculum

The building blocks of a curriculum

The 2020 JMLR survey by Narvekar et al. breaks any curriculum method into three design decisions. Getting all three right is what separates a curriculum that accelerates learning from one that wastes compute or even hurts.

1
Task generation — what tasks exist?

Define the space of intermediate tasks. These can be hand-authored levels, procedurally generated variations, parameterized environments (e.g. terrain roughness as a continuous knob), or sub-goals carved out of the target task. The space must contain tasks easy enough to be solvable now and a smooth path toward the target.

2
Sequencing — what order?

Decide which task to train on next, given the agent’s current ability. This is the heart of the method: a fixed hand-designed schedule, or an adaptive policy that picks the task offering the most learning right now. Bad sequencing (too hard, too soon) gives no signal; too easy wastes samples.

3
Transfer — how does knowledge carry over?

Move what was learned on task k into task k+1. The simplest and most common mechanism is to keep the same policy network and continue training — the weights are the transferred knowledge. Value functions, learned representations, or skills can also be carried forward.

Go deeper: the curriculum as a meta-MDP

Narvekar et al. formalize sequencing as a curriculum MDP sitting above the task MDPs. Its “state” is the agent’s current knowledge (e.g. its policy parameters), an “action” is choosing the next task to train on, and the “reward” is how much the chosen task improves performance on the target. Solving this meta-MDP optimally would yield the best possible curriculum — but it is far harder than the original problem, so in practice everyone uses cheap heuristics (learning progress, regret, difficulty bins) as proxies for the meta-reward. See markov decision processes for the underlying formalism.

Manual vs automatic curricula

The first axis to understand is who designs the sequence. Hand-designed curricula are simple and controllable but brittle and labor-intensive; automatic curriculum learning (ACL) hands sequencing to an algorithm that adapts to the learner in real time.

Manual / hand-designed

A human authors the task progression: start the robot near the goal, then move the start state farther back; grow the board size; raise opponent strength on a fixed schedule. Pros: simple, interpretable, no extra machinery. Cons: requires domain expertise, doesn’t adapt to the agent’s actual learning curve, breaks when the agent learns faster or slower than expected.

Automatic (ACL)

A teacher algorithm selects tasks online using a signal like learning progress, regret, or novelty — always training the agent at the frontier of its competence. Pros: adapts to the learner, no hand-tuned schedule, scales to huge task spaces. Cons: the teacher’s signal can be noisy or gamed, and it adds its own hyperparameters. See the short ACL survey.

Teacher-student and learning-progress curricula

The dominant ACL idea is teacher-student curriculum learning (TSCL): a teacher agent’s job is to pick, at each step, the sub-task on which the student is currently improving fastest. The intuition is that the steepest part of the learning curve is where training samples buy the most — too-easy tasks are mastered (no slope) and too-hard tasks give no progress (no slope), so the productive frontier is in between.

The teacher’s reward for proposing task ii is the student’s learning progress — the change in performance on that task:

rt(i)  =  xt(i)xt1(i)r_t^{(i)} \;=\; x_t^{(i)} - x_{t-1}^{(i)}

where xt(i)x_t^{(i)} is the student’s score on task ii at time tt. The teacher then treats task selection as a non-stationary multi-armed bandit problem, pulling the arm (task) with the highest expected progress. A refinement, ALP-GMM (Portelas et al., 2019), works in continuous task spaces and uses absolute learning progress, rrold|r - r_{\text{old}}|, so the teacher also revisits tasks where the student is regressing — fitting a Gaussian mixture over the task space to find high-progress regions.

task difficulty for current agent →learning progresstrain heretoo easytoo hard
Learning progress as a sequencing signal. The teacher prefers the middle band: too-easy tasks are already solved (flat, no slope) and too-hard tasks produce no improvement (flat, no slope). Maximum slope sits at the agent's competence frontier — the zone of proximal development.

Generating the tasks: goals, self-play, and environment design

Sequencing assumes a pool of tasks to sequence. Where does that pool come from? Three families answer this differently.

FamilyHow tasks are createdRepresentative method
Goal generationA generator proposes intermediate goals at the right difficultyGoal GAN — “goals of intermediate difficulty” (Florensa et al., 2018)
Self-playAn opponent or a goal-setting partner is the curriculum, scaling with the agentAsymmetric self-play; AlphaZero-style self-play
Environment designAn adversary or curator shapes the environment itself toward maximal learningPAIRED, Prioritized Level Replay, POET

Goal generation trains a generator (originally a GAN) to output goals the agent succeeds at roughly half the time — neither trivial nor impossible — and refreshes them as the agent improves.

Self-play is the most famous source of automatic curricula: when an agent trains against copies of itself, the opponent is always at exactly the right level, and difficulty rises automatically as both sides improve. This is the engine behind AlphaZero and MuZero and a core mechanism of multi-agent RL. OpenAI’s hide-and-seek experiment is the vivid demonstration: with no explicit curriculum, multi-agent competition produced an autocurriculum of six escalating strategies and counter-strategies — running, fort-building, ramp use, ramp defense, box-surfing, surf defense — each emerging only because the previous one made it worthwhile.

▶ Multi-Agent Hide and Seek — OpenAI (the autocurriculum of emergent strategies, ~3 min)

Unsupervised environment design (UED)

The most principled modern framing is unsupervised environment design: a teacher generates the environment configuration (a maze layout, a terrain) to maximize the student’s learning. The standard signal is regret — the gap between the best achievable return on a level and the student’s actual return. High-regret levels are exactly those the student could solve but currently doesn’t: the productive frontier again.

  • PAIRED stages a three-way game: a generator builds levels, an antagonist (expert) and a protagonist (the student) both attempt them, and the generator is rewarded by the performance gap — driving it toward solvable-but-unsolved levels.
  • Prioritized Level Replay (PLR) is the simpler, more scalable cousin: randomly sample levels, estimate each one’s learning potential (via GAE-based regret proxies), and keep replaying the highest-regret levels from a rolling buffer. PLR generalizes better out-of-distribution than PAIRED.
  • POET co-evolves a population of environments and the agents that solve them, endlessly generating new, increasingly complex terrains — an open-ended curriculum with no fixed target task at all.
Go deeper: why regret beats raw difficulty as a signal

A naive teacher that maximizes student failure will happily generate impossible levels — an unsolvable maze has maximum failure and zero learning value. Regret fixes this: it is high only when a level is solvable (some policy gets high return) and the current student does poorly. Maximizing regret therefore steers the generator toward the frontier of solvable-but-unsolved tasks, and it comes with a game-theoretic guarantee: at the equilibrium of the teacher-student game, the student has minimized worst-case regret, i.e. it is robust across the whole level distribution. In practice the true regret is unknown, so methods use proxies — the antagonist-protagonist gap (PAIRED) or value-prediction error (PLR).

Curriculum learning for LLM reasoning (2024-2026)

Curriculum learning is having a second life inside RL post-training for LLMs. When you train a reasoning model with RLVR or GRPO, the difficulty of each problem matters enormously: if every problem is too hard, all sampled answers are wrong, the group-relative advantage is zero, and the gradient vanishes — the model learns nothing. If everything is too easy, all answers are right and again there is no signal. Productive RL needs problems the model gets right some of the time.

This makes difficulty ordering a first-class lever. Recent work shows that scheduling problems easy-to-hard — or simply filtering to problems of intermediate pass-rate — meaningfully improves RL for reasoning, especially for small models that flounder under vanilla RL. The 2026 E2H Reasoner result found that easy tasks provide early traction but should be phased out to avoid overfitting, and self-evolving curricula let the model’s own success rate define the schedule online — TSCL’s learning-progress idea, rediscovered for LLMs.

A short history

1993
Learning and development (Elman)
Jeffrey Elman shows neural nets learn grammar better when “starting small” — the cognitive-science seed of curriculum learning.
2009
Curriculum Learning (Bengio et al.)
The ICML paper that names and formalizes curriculum learning, framing it as a continuation method for non-convex optimization.
2017–18
Automatic curricula for deep RL
Teacher-Student CL (Matiisen et al.), Goal GAN (Florensa et al.), and reverse curriculum generation bring adaptive, automatic sequencing to deep RL.
2019
Autocurricula via self-play
OpenAI’s hide-and-seek shows multi-agent competition generating an open-ended curriculum of emergent strategies; ALP-GMM tackles continuous task spaces.
2020
Surveys & UED
The JMLR framework survey consolidates the field; PAIRED and Prioritized Level Replay launch regret-based unsupervised environment design.
2022
Evolving curricula (POET → ACCEL)
Regret-based environment design scales to open-ended, ever-harder environment populations.
2025–26
Curriculum RL for reasoning
Easy-to-hard scheduling, difficulty filtering, and self-evolving curricula become standard tricks in RLVR/GRPO post-training of LLMs.

Where curriculum learning is used

DomainWhat the curriculum does
Robotics & controlGrow terrain roughness, perturbation strength, or task length; reverse curricula start near the goal. See RL in robotics, continuous control.
Games & self-playSelf-play supplies an automatic opponent curriculum; procedural levels raise difficulty. See AlphaZero & MuZero.
Procedurally generated environmentsUED (PLR, PAIRED, POET) generates levels at the student’s frontier for robust generalization.
LLM reasoningEasy-to-hard ordering and difficulty filtering keep the RLVR/GRPO gradient signal alive. See RL for reasoning.
Multi-agentCompetitive co-adaptation creates autocurricula. See multi-agent RL.

Pitfalls and open problems

  • A bad curriculum can hurt. Ordering that is too aggressive, or that drifts away from the target distribution, can slow learning or bias the final policy — sometimes worse than no curriculum at all.
  • Forgetting. As the agent moves to harder tasks it can forget how to solve earlier ones; PLR-style replay and mixed sampling exist precisely to counter this.
  • Noisy progress signals. Learning progress and regret are estimated from a few noisy rollouts, so the teacher can chase noise or get stuck. Robust estimation is an active area.
  • The meta-problem. Optimal sequencing is its own hard RL problem; every practical method is a heuristic approximation, and which heuristic wins is domain-dependent.
  • Designing the task space. ACL only sequences within the tasks you give it — if the space lacks a smooth path to the target, no teacher can build a good curriculum.

Researcher takes

Minqi Jiang (UCL DARK / Meta FAIR) reframes Prioritized Level Replay as a form of unsupervised environment design — quietly building a minimax-regret curriculum that drives zero-shot generalization.

Frequently asked questions

How is curriculum learning different from reward shaping?

They are complementary levers on the same problem (sparse signal). Reward shaping changes the reward function to give denser intermediate feedback on a fixed task. Curriculum learning leaves the reward alone and changes which tasks the agent trains on, in what order. You can — and often do — use both together.

What’s the difference between manual and automatic curriculum learning?

A manual curriculum is a human-authored, usually fixed schedule of tasks. Automatic curriculum learning (ACL) lets an algorithm — a teacher — choose tasks online based on the agent’s measured learning, so the sequence adapts to how fast the agent actually learns. ACL scales to large task spaces and removes hand-tuning, at the cost of extra machinery and a signal that can be noisy.

Is self-play a form of curriculum learning?

Yes — it is the most influential automatic curriculum. Training against copies of yourself means the opponent is always near your own level, so difficulty scales automatically as you improve. This autocurriculum is what made AlphaZero and OpenAI’s hide-and-seek work without any hand-designed schedule.

Does curriculum learning help when training LLMs with RL?

Often, yes. In RLVR/GRPO post-training, problems that are uniformly too hard or too easy produce no gradient. Ordering or filtering problems toward intermediate difficulty keeps the learning signal alive — especially for smaller models. But it is not guaranteed: some 2026 studies find fixed easy-to-hard schedules underperform uniform sampling on certain reasoning tasks, so it remains an empirical choice. See RL for reasoning.

Key papers

Reward shaping · Exploration vs exploitation · Curiosity & intrinsic motivation · Multi-agent RL · AlphaZero & MuZero · RL for reasoning · What is reinforcement learning?