Hindsight Experience Replay (HER) in Sparse Reward Environments

The difference between a sparse and dense reward regime in Reinforcement Learning dictates how hard it will be for an agent to learn. It is the difference between following a GPS that tells you to “turn left” at every intersection (Dense), versus being dropped in a forest with no map until you accidentally stumble upon a treasure chest (Sparse).

In a sparse reward environment, standard Temporal Difference Learning breaks down. If an agent wanders around for 1,000 steps getting a reward of 0, the Bellman update leaves the Value Function as a completely flat plain of zeros. There is no gradient, and the agent has no idea which direction is better.

The Trick: Hindsight Experience Replay (HER)

The fundamental trick of HER is equivalent to shooting a basketball, completely missing the hoop, hitting the backboard, and telling yourself: “Actually, my goal was to hit the backboard. I executed that perfectly.”

By mathematically lying to the agent about what its goal was, HER guarantees that the agent receives a dense reward signal even from a completely failed episode.

1. The Universal Value Function Approximator (UVFA)

Before HER can work, we must change how the agent views the world. In standard RL, the Value Function asks, “What is the value of this state?” ($Q(s, a)$).

In HER, we use a UVFA. We force the neural network to accept the Goal ($g$) as an explicit input. The question becomes, “What is the value of this state, given that I am trying to reach goal $g$?”

\[Q(s, a, g)\]

Because the goal is now a variable input, we can swap it out whenever we want.

2. The Hindsight Relabeling

Imagine a robotic arm trying to slide a puck to a specific target square ($g_{true}$). The agent misses the target completely, and the puck stops at a useless location ($s_{final}$).

In standard RL, this fails because every reward is $0$. HER looks at that failed episode, creates a synthetic copy of every step, and replaces the true goal with a fake goal: $g_{fake} = s_{final}$.

Now, HER recalculates the reward for these synthetic steps as if $s_{final}$ had been the goal all along. For the last step of the episode, the agent transitioned into $s_{final}$. Since the goal is now also $s_{final}$, the reward function triggers, granting a $+1$ reward!

3. Propagating Credit Backward

When this synthetic experience is fed into the Bellman equation, the $+1$ propagates backward through the synthetic trajectory.

Even if only the final step receives the $+1$, earlier actions get credited for setting up the success via the Bellman update and discount factor ($\gamma$):

Updating the Win (Final Step): $Q(D, a_4, E) \leftarrow 1 + \gamma (0) = 1.0$
Updating the Setup: $Q(C, a_3, E) \leftarrow 0 + \gamma (1.0) = 0.99$
Updating the Approach: $Q(B, a_2, E) \leftarrow 0 + \gamma (0.99) = 0.98$

The algorithm successfully connects the very first movement to the final outcome, solving the temporal credit assignment problem for that fake goal.

Generalization

If we teach the agent how to reach random spots it accidentally hit, how does it ever learn to hit the real target?

Because the neural network is a continuous function approximator taking $(s, g)$ as inputs, it generalizes. By learning the physics of how to move the puck to various random coordinate goals, the network builds a robust internal model of the environment’s dynamics. When you finally ask it to get to the true goal, it interpolates from all the “fake” goals it mastered, effortlessly steering toward the real target.