The Connection Between Dynamic Programming and Reinforcement Learning

There is a beautiful connection between traditional Dynamic Programming (DP)—like you see on LeetCode—and Temporal Difference (TD) Learning in Reinforcement Learning (RL). They are essentially the exact same mathematical engine operating under different levels of ignorance about the world.

Here is exactly how they map to each other.

1. The Core Engine: Optimal Substructure vs. The Bellman Equation

In a LeetCode DP problem like “Minimum Path Sum,” the core idea is that the best path from point A to point C through point B must contain the optimal path from A to B. You break a large problem into smaller, overlapping subproblems.

In RL, this exact concept is formalized mathematically by the Bellman Equation. It states that the optimal value of being in a current state $V(s)$ is equal to the immediate reward you get, plus the discounted optimal value of wherever you end up next $V(s’)$:

\[V(s) = \max_a \left( R(s, a) + \gamma V(s') \right)\]

2. The Cache: Tabulation vs. The Value Function

In competitive programming, you use a dp[] array or a memoization hash map to store the answers to subproblems so you don’t recompute them.

In RL, this array is called a Value Function ($V$-table) or a Q-Table ($Q(s,a)$). When an RL agent maintains a Q-table, it is literally just building a DP tabulation array incrementally over time.

3. The Big Divide: Knowing the Rules

Here is where the two fields diverge, and why TD Learning has to exist.

LeetCode DP (Model-Based): You are given the exact rules of the environment. If you take a step right, you know with 100% certainty you will move to grid[r][c+1], and you know exactly what it costs. Because you have a perfect “model,” you can calculate the entire dp[] array offline, in one shot, using a nested for loop.
TD Learning (Model-Free): Imagine trying to solve “Minimum Path Sum,” but the grid is invisible. You don’t know the dimensions, you don’t know the costs, and sometimes when you try to step right, the wind blows you down instead (stochastic transitions).

You can’t write a deterministic for loop to solve an invisible, unpredictable grid. You have to put an agent inside it and let it walk around.

4. Bootstrapping: The TD Update

Because the RL agent doesn’t know the rules upfront, it has to guess the dp[] values and update them as it explores.

Because it approximates the core DP processes using function approximation and sampling, TD Learning (and much of modern RL) is often mathematically classified as Approximate Dynamic Programming (or Neuro-Dynamic Programming). Temporal Difference Learning is essentially Dynamic Programming done through sampling.

Instead of computing the exact value using a perfect formula, a TD agent takes one step, looks at the immediate reward it just got, looks at its current guess for the state it landed in, and updates its previous state’s value backwards.

Both methods use bootstrapping — updating an estimate based on another estimate.

DP Bootstrapping (Exact): dp[i] = cost[i] + dp[i+1]
TD Bootstrapping (Sampled): $V(s) \leftarrow V(s) + \alpha [R + \gamma V(s’) - V(s)]$

The term in the brackets is the TD Error — the difference between what the agent thought dp[i] was going to be, and what it actually experienced after taking a step.

From Stochastic RL to Deterministic DP

You might think using the formal Bellman equation would make classic DP problems easier to solve. Ironically, applying strict RL math to competitive programming is overkill.

The full Bellman Optimality Equation handles the chaos of the real world. It looks like this:

\[V^*(s) = \max_a \sum_{s', r} p(s', r | s, a) \left[ r + \gamma V^*(s') \right]\]

To use this on a standard CP problem, you’d have to account for machinery that simply doesn’t exist in LeetCode:

Transition Probabilities ($p$): In CP, if you take a step right, you go right 100% of the time. You don’t need probability distributions.
The Discount Factor ($\gamma$): CP problems are finite (you reach the end of the array or the target sum). You don’t need to discount future rewards to prevent infinite loops.
Expected Values ($\sum$): Because there is no randomness in CP, there is no need to calculate expected values across multiple possible next states.

When you strip all that stochastic RL machinery away, the full Bellman equation collapses into a simple deterministic recurrence relation:

\[V(s) = \max_a \left( \text{cost} + V(s') \right)\]

This is exactly the core logic you are already writing in deterministic competitive programming!

Using the RL Mindset

While the full math is overkill, what you should steal from RL is the mental model.

Many people struggle with classic DP because they try to think about it in terms of manipulating a 2D array (dp[i][j]). If you instead think like an RL agent acting in an environment, the recursive logic writes itself.

Define these three core concepts:

1. How to Find the State ($s$)

Finding the right state is often the hardest part of DP. In RL theory, a valid state must satisfy the Markov Property. This means the state must be a sufficient statistic of the entire history. In plain English: if you were dropped into this state out of nowhere with zero memory of the past, do you have 100% of the information needed to make the optimal next move?

To find the minimal state variables, ask yourself:

Position: Where am I currently evaluating? (e.g., index i in an array, node u in a graph).
Resources/Constraints: What limits my future actions? (e.g., k capacity left in a knapsack, e eggs remaining).
Historical Context: Do I need to remember a specific past choice that restricts future choices? (e.g., is_holding_stock boolean flag, or a bitmask of visited nodes).

These minimal variables become the exact parameters of your recursive DP function.

2. The Actions ($a$)

What are the valid, mutually exclusive moves I can make from this specific state? (e.g., “Take the item” or “Skip the item”, “Drop egg from floor $x$”).

3. The Reward/Cost ($r$)

What is the immediate consequence of taking this action? (e.g., “I gain the item’s value”, “I spend 1 attempt”).

Example: The Knapsack Problem

Let’s start from our deterministic Bellman equation: $V(s) = \max_a \left( \text{Reward}(s, a) + V(s') \right)$

Instead of trying to visualize a grid and figure out the array transition dp[i][w] = max(dp[i-1][w], dp[i-1][w-wt[i]] + val[i]), you define the state-action transitions:

State ($s$): (index, remaining_weight)
Action 1 (Skip): $s’ = $ (index + 1, remaining_weight). Reward = 0.
Action 2 (Take): $s’ = $ (index + 1, remaining_weight - wt[index]). Reward = val[index]. (Only valid if remaining_weight >= wt[index])

Now, plug these specific definitions directly into the Bellman equation. We evaluate the maximum over our two possible actions:

\[V(\text{index}, \text{weight}) = \max \begin{cases} 0 + V(\text{index} + 1, \text{weight}) & \text{(Skip)} \\ \text{val}[\text{index}] + V(\text{index} + 1, \text{weight} - \text{wt}[\text{index}]) & \text{(Take)} \end{cases}\]

If you change the notation from the recursive function $V(\dots)$ to the array $dp[\dots]$, you get the exact classical DP tabulation formula!

Example: The Egg Drop Problem

Consider the classic “Egg Drop” problem: You need to find the minimum number of attempts required in the worst-case scenario to find the highest floor from which an egg can be dropped without breaking, given a building of $n$ floors and $k$ eggs.

Instead of changing our base $\max$ equation to handle “costs”, we can stick with the standard Bellman equation by simply using negative rewards!

\[V(s) = \max_a \left( \text{Reward}(s, a) + V(s') \right)\]

Let’s define the components:

State ($s$): The number of eggs you have left and the number of floors you still need to test. $s = (k, n)$.
Action ($a$): Choose to drop an egg from any floor $x$ between 1 and $n$.
Reward: Every action (drop) costs 1 attempt, so we give a penalty: $\text{Reward}(s, a) = -1$.
The Next State ($s’$): The environment is adversarial. It wants to give us the worst-case scenario, which means it will choose the outcome that minimizes our overall reward (gives us the most negative score). The two possible physics outcomes are:
- The egg breaks: The critical floor must be below $x$. $s’_{break} = (k-1, x-1)$.
- The egg survives: The critical floor is above $x$. $s’_{survive} = (k, n-x)$.

So, for any action $x$, the adversarial environment minimizes the value of the next state: $V(s') = \min \left( V(k-1, x-1), V(k, n-x) \right)$

Now, plug the negative Reward and $V(s’)$ back into our standard Bellman equation (where our agent maximizes the value against the environment):

\[V(k, n) = \max_{1 \le x \le n} \left( -1 + \min \left( V(k-1, x-1), V(k, n-x) \right) \right)\]

If we factor out the negative sign to convert our “negative reward” $V(k,n)$ back into a positive “number of attempts” (let’s call it $A(k,n)$ where $A(k,n) = -V(k,n)$), the $\max \min$ structurally flips into a $\min \max$!

$-A(k,n) = \max_{1 \le x \le n} \left( -1 - \max \left( A(k-1, x-1), A(k, n-x) \right) \right)$ $-A(k, n) = - \min_{1 \le x \le n} \left( 1 + \max \left( A(k-1, x-1), A(k, n-x) \right) \right)$ $A(k, n) = 1 + \min_{1 \le x \le n} \max \left( A(k-1, x-1), A(k, n-x) \right)$

By mathematically framing the problem as States, Actions, and negative Rewards, a notoriously tricky problem turns into a simple substitution exercise, without ever needing to change the core Bellman equation!

Why DP Uses $V(s)$ and RL Needs $Q(s, a)$

You might notice that in our DP examples, we only ever used the State Value Function, $V(s)$. In contrast, many popular RL algorithms (like Q-Learning) revolve around the Action-Value Function, $Q(s, a)$. Why the difference?

The choice between $V(s)$ and $Q(s, a)$ comes down entirely to whether or not you know the rules of the game (the transition model).

In Model-Based DP (LeetCode): You know exactly the physics of the environment. If you are in state $s$ and consider taking action $a$, you can simply calculate the resulting next state $s’$. Because you can mathematically “look ahead” into the future, you only need to store the value of being in a state: $V(s)$. You evaluate actions on the fly by peeking at $V(s’)$.
In Model-Free RL: The agent is blind. If the agent is in state $s$ and considers taking action $a$, it has no idea what state $s’$ will result from that action. If the agent only stored $V(s)$, it wouldn’t know which action to pick because it can’t simulate the future!

To solve this blindness, Model-Free RL stores $Q(s, a)$ — the value of taking action $a$ in state $s$. By caching the value of the action itself, the agent can simply pick $\max_a Q(s, a)$ without needing to know why it’s good or what state it leads to.

In short: if you have the formula for the environment, $V(s)$ is sufficient and saves memory. If you are exploring an unknown environment, you must use $Q(s, a)$.

When Does the Bellman Equation Hold?

The Bellman equation holds true if, and only if, your problem environment can be perfectly modeled as a Markov Decision Process (MDP). Your problem must obey two unbreakable laws:

The Markov Property (Memorylessness): The future depends only on the present state, not on the history of how you got there. If you were dropped into a state with zero memory of the past, you should have 100% of the information needed to make the optimal next move. (e.g., Chess holds this property; Poker fails it due to hidden information and past betting sequences).
The Principle of Optimality (Optimal Substructure): An optimal policy must contain optimal sub-policies. If the best path from city A to city C goes through city B, then the segment from B to C must be the absolute best way to get from B to C.