Reward Shaping in Reinforcement Learning

Key takeaways

Reward shaping adds extra reward signal to guide an agent toward a goal, fixing the slow learning caused by sparse rewards.
Done naively it changes what the agent optimizes for — the classic failure is reward hacking, where the agent exploits the bonus instead of solving the task.
Potential-based reward shaping (PBRS) is the safe form: shaping built from a state potential Φ provably leaves the set of optimal policies unchanged.
Modern RL still leans on shaping — as intrinsic/curiosity bonuses, curriculum design, and dense rewards for robotics — but verifiable and learned rewards now do a lot of the same work.

What is reward shaping?

Reward shaping is the practice of adding extra, designer-supplied reward signal to a reinforcement learning problem so the agent learns faster. The environment’s true objective often gives reward only at the very end — win the game, reach the goal, solve the puzzle. That signal is sparse: an agent wandering randomly may take millions of steps before it stumbles onto any reward at all, and until it does there is nothing to learn from. Shaping injects a steady stream of smaller hints — “you’re getting closer,” “you picked up the key” — that point the agent in the right direction long before it ever sees the real payoff.

The catch is that you are now optimizing a different reward function than the one you actually care about. Get the shaping wrong and the agent will happily maximize your hints while completely ignoring the goal. The central question of this topic is therefore: how do you add helpful guidance without changing the problem you meant to solve?

Sparse vs. shaped reward. With sparse reward the agent gets a signal only at the goal G, so early exploration is blind. A shaping term adds a gradient of reward over the state space that points toward G.

Why sparse rewards are hard

Reinforcement learning improves a policy by propagating reward backward through the states that led to it (see temporal-difference learning and value functions). If reward is zero until the final step, there is nothing to propagate during the long initial phase of training. The agent’s exploration is effectively a blind random walk, and the probability of randomly executing the exact sequence that reaches the goal can be astronomically small in any non-trivial environment.

Montezuma’s Revenge — the Atari game that defeated early DQN agents — is the canonical example: you must collect a key, cross several rooms, and avoid hazards before any points appear. A purely sparse-reward agent essentially never sees a positive signal. Reward shaping (or its close cousins, intrinsic motivation and curriculum learning) is what makes such problems tractable.

1999

Ng, Harada & Russell prove the potential-based shaping theorem

Reward an unshaped agent sees until it first reaches a sparse goal

Φ(s)

A single state function is all PBRS needs to stay safe

The naive approach — and how it backfires

The intuitive move is to just hand-write a denser reward: give a robot points for moving toward the target, a racing agent points for speed, a cleaning agent points for each bit of dirt removed. Sometimes this works. Often it produces reward hacking — the agent maximizes your proxy in ways that defeat the real goal.

The deeper issue is that adding any reward term generally changes the optimal policy. If you reward a navigation agent for “facing the goal,” its best strategy might be to stand still and face the goal forever. This is an instance of Goodhart’s law: when a measure becomes a target, it ceases to be a good measure. We need a way to add guidance that is mathematically guaranteed not to change which policies are optimal.

Good shaping

Encodes progress toward the goal as a difference of a potential. Speeds learning, leaves the optimal policy untouched, and washes out over a full trajectory.

Bad shaping

Rewards a proxy behavior (facing the goal, hitting respawning targets). Creates new, unintended optima the agent will gleefully exploit — reward hacking.

Potential-based reward shaping (the safe form)

In 1999 Andrew Ng, Daishi Harada, and Stuart Russell answered exactly this question. Their result, potential-based reward shaping (PBRS), says: define a potential function $\Phi(s)$ over states — think of it as a heuristic estimate of “how good is it to be here” — and build your shaping reward as the difference in potential between consecutive states:

F(s, a, s') = \gamma\,\Phi(s') - \Phi(s)

The agent trains on the shaped reward $r'(s,a,s') = r(s,a,s') + F(s,a,s')$ . The theorem proves that this is necessary and sufficient for the shaping to preserve the set of optimal policies: every policy optimal in the shaped MDP is optimal in the original, and vice versa, for any bounded $\Phi$ and any environment dynamics.

Why a potential difference is safe: along any trajectory the shaping terms telescope. Every minus-Phi cancels the previous plus-gamma-Phi, leaving only boundary terms — so the ranking of complete policies is unchanged.

There is a beautiful equivalence behind this. Adding a potential-based shaping term is the same as initializing the value function with $\Phi$ . Concretely, the optimal Q-function of the shaped MDP is just the original one shifted by the potential:

Q^{*}_{M'}(s, a) = Q^{*}_{M}(s, a) - \Phi(s)

Because $\Phi(s)$ does not depend on the action, it shifts every action’s value at a state by the same amount — so the argmax (the greedy policy) is identical. The shaping changes how fast the agent learns the values, not which policy those values prefer.

How to design a potential-based reward

Pick a potential that estimates 'goodness of a state'

Choose $\Phi(s)$ as a cheap heuristic for how close $s$ is to success. For navigation, $\Phi(s) = -\text{distance}(s, \text{goal})$ . For a game, it could be score-so-far, pieces captured, or progress along a known subgoal sequence. Higher potential should mean closer to the objective.

Form the shaping term as a discounted potential difference

On each transition compute $F = \gamma\,\Phi(s') - \Phi(s)$ and add it to the environment reward. Moving to a higher-potential state yields a positive bonus; backsliding yields a penalty. The discount $\gamma$ must match the one used by your learning algorithm.

Train on the shaped reward exactly as before

Feed $r + F$ into Q-learning, PPO, actor-critic, or any RL method. No algorithmic changes are needed — shaping lives entirely in the reward channel.

Verify the policy, not the reward curve

Evaluate the trained policy on the original (unshaped) objective. A rising shaped-reward curve is reassuring but not proof; the unshaped task metric is the ground truth.

Go deeper: dynamic and action-dependent potentials

The original theorem requires a static state potential $\Phi(s)$ . Later work extended this safely. Devlin and Kudenko (2012) proved that a time-varying potential $\Phi(s, t)$ still preserves optimal policies, enabling potentials that improve as learning proceeds (e.g. a learned value estimate). Wiewiora, Cottrell, and Elkan (2003) introduced action-dependent advice potentials of the form $F = \gamma\,\Phi(s', a') - \Phi(s, a)$ , which let you encode “in this state, prefer this action” — but the potential $\Phi(s,a)$ must then be added back into the Q-values during action selection to keep the guarantee. More recent results (2024–2025) extend potential-based guarantees to non-Markovian and intrinsic-motivation bonuses, showing curiosity-style rewards can be framed as potentials to retain optimality.

Shaping vs. its modern cousins

Reward shaping is one of several ways to attack sparse reward; they are often combined.

Approach	What it adds	Preserves optimal policy?	Best for
Potential-based shaping	$\gamma\Phi(s') - \Phi(s)$ from a heuristic potential	Yes, provably	Domains with a good “distance-to-goal” heuristic
Hand-crafted dense reward	Designer-chosen bonuses for proxy behaviors	No — risks reward hacking	Quick prototypes; risky in production
Intrinsic motivation / curiosity	Bonus for novel or surprising states	Approximately, if framed as a potential	Hard-exploration games (Montezuma’s Revenge)
Curriculum learning	Easier task variants first, then harder	N/A — changes the task sequence, not the reward	Tasks with a natural difficulty gradient
Learned reward models	A model trained on human/AI preference	Replaces the reward entirely	Fuzzy goals: see reward models, RLHF
Verifiable reward (RLVR)	Programmatic check (tests pass, answer correct)	It is the true reward	Math, code, checkable reasoning — RLVR

A short history

1990s

Ad hoc dense rewards

Early RL practitioners hand-tuned rewards to speed learning — effective but unprincipled, and prone to changing the task.

1999

Potential-based shaping theorem

Ng, Harada & Russell prove that potential-based shaping is exactly the class of reward transforms that preserve optimal policies.

2003

Advice potentials

Wiewiora, Cottrell & Elkan extend the theory to action-dependent “advice,” and show shaping equals value-function initialization.

2012

Dynamic potentials

Devlin & Kudenko prove time-varying potentials remain optimality-preserving, allowing potentials that learn during training.

2016–18

Intrinsic motivation & reward-hacking lore

Curiosity-driven exploration cracks hard-exploration games; OpenAI’s CoastRunners boat becomes the textbook reward-hacking failure.

2022–26

Learned & verifiable rewards

RLHF, DPO and RLVR shift the frontier from hand-shaping toward learned or programmatic reward — while shaping persists in robotics and control.

Where reward shaping is used today

Robotics and locomotion — dense rewards on forward velocity, foot clearance, energy, and posture are standard for training walking and manipulation policies; PBRS keeps such bonuses from corrupting the true objective.
Game playing — progress signals (board control, captured material, distance traveled) accelerate learning in long-horizon games.
Hard-exploration RL — intrinsic/curiosity bonuses, increasingly framed as potentials to stay optimality-preserving, drive exploration where the environment reward is essentially absent.
Hierarchical RL — subgoal-completion rewards shape lower-level policies, an implicit form of shaping by task decomposition.

Building reward pipelines, simulation environments, and the human/verifier feedback that powers modern reward signals is now its own industry — see the companies building RL environments.

Limitations and open problems

Designing a good potential needs domain knowledge. A bad $\Phi$ is safe (still optimality-preserving) but useless — or even slows learning if it points the wrong way.
The optimality guarantee is asymptotic. PBRS preserves the final optimal policy, but a poorly chosen potential can still hurt sample efficiency along the way.
Non-potential shaping is everywhere in practice. Real engineering teams routinely add non-potential bonuses for speed — and routinely get bitten by reward hacking as a result. See RL safety and alignment.
The frontier has partly moved on. For fuzzy objectives, learning the reward (reward models) or verifying it (RLVR) often beats hand-shaping — but neither fully replaces shaping in control and robotics.

Frequently asked questions

Does reward shaping change what the agent learns?

It can. Any added reward term generally shifts the optimal policy — that is the source of reward hacking. The exception is potential-based shaping, $\gamma\Phi(s') - \Phi(s)$ , which Ng, Harada & Russell proved leaves the set of optimal policies exactly unchanged for any bounded potential.

What is the difference between reward shaping and a reward model?

Reward shaping adds hand-designed signal on top of a known environment reward to speed learning. A reward model learns the reward itself from data (usually human preferences) when no good reward function exists. Shaping assumes you know the goal; reward modeling is for when “good” is hard to specify.

Why use a potential difference instead of just rewarding good states?

Rewarding good states directly (e.g. +1 every step you face the goal) creates a new optimum: the agent loiters in those states to farm the bonus. A potential difference telescopes over a trajectory, so the total shaping reward depends only on start and end states — there is no way to gain by lingering, which is precisely why the optimal policy is preserved.

How do I choose the potential function Φ?

Pick something that approximates the optimal value $V^{*}(s)$ — a heuristic for how good each state is. Negative distance-to-goal, current score, or progress along known subgoals all work. The better $\Phi$ tracks $V^{*}$ , the faster training converges; even a crude guess helps at no risk to correctness.

Key papers and references

Policy Invariance Under Reward Transformations — Ng, Harada & Russell, ICML 1999 — the founding potential-based shaping theorem.
Potential-Based Shaping and Q-Value Initialization are Equivalent — Wiewiora, 2003 — the value-initialization equivalence and action-dependent advice.
Potential-Based Reward Shaping for Intrinsic Motivation — 2024 — framing curiosity bonuses as optimality-preserving potentials.
Improving the Effectiveness of Potential-Based Reward Shaping — 2025 — practical guidance on getting real speedups from PBRS.
Action-Dependent Optimality-Preserving Reward Shaping — 2025 — recent extensions of the guarantee.

What is reinforcement learning? · Markov decision processes · Value functions · Exploration vs. exploitation · Reward models · RLVR · RL safety and alignment · RL in robotics

Reward Shaping in Reinforcement Learning

What is reward shaping?

Why sparse rewards are hard

The naive approach — and how it backfires

Potential-based reward shaping (the safe form)

How to design a potential-based reward

Shaping vs. its modern cousins

A short history

Where reward shaping is used today

Limitations and open problems

Frequently asked questions

Key papers and references

Related