How to calculate TD(lam) in Reinforcement Learning

Estimating state value in reinforcement learning is a hard task, since the sample process is very noisy. $TD(\lambda)$ learning try to take as much information as possible to estimate the state value. In this note, I will briefly give the derivation of $TD(\lambda)$ and its pseudo-code for computing.


Definitions in Reinforcement Learning

We mainly regard reinforcement learning process as a Markov Decision Process(MDP): an agent interacts with environment by making decisions at every step/timestep, gets to next state and receives reward. This makes a state-action chain like

Where $\mathbf{s}_t$ represents the agent’s state at time $t$, $\mathbf{a}_t$ represents the decision made by agent at time t with its state $\mathbf{s}_t$, $r_t$ represents the reward agent received between states $\mathbf{s}_t$ and $\mathbf{s}_{t+1}$ after action $\mathbf{a}_t$ was taken.

To evaluate the quality of policy $\pi(\mathbf{a}_t|\mathbf{s}_t)$, accumulated reward for state $\mathbf{s}_T$ with decay ratio $\gamma$ is defined as following:

Also, the $Q$ value for state $\mathbf{s}_T$ after taking action $\mathbf{a}_T$ is defined as


Bellman Equation and Update Rules

Assume we have an accurate estimation of accumulated reward $R$ of policy $\pi(\mathbf{a}_t|\mathbf{s}_t)$, denoted by $V_\pi^*$, then we will have a equilibrium like:

or

However, the accurate estimation is exactly what we don’t know, so we need to improve the accuracy of our estimation based on current samples. Assume we have a series of functions $V_\pi^{(n)}$ to approach to $V_\pi^*$, then we can update the function by


Derivation of $TD(\lambda)$

$TD(n)$ learning

From a chain of samples, we can update the state value by

But how about other states in the chian? The Bellman equation can be also written as any of the followings:

Define $G_t^{(n)}$ as estimated state value considering $n$ step further.

Using $G^{(n)}$ to update value can take $n$ steps into account, we call this $TD(n)$ learning (Temporal Difference Learning).

$TD(\lambda)$ learning

What if we make a combination of all these $G^{(n)}$ ? By weighted averaging, we can get $G^{(\lambda)}$ as unbiased estimation of $R$, and takes all information into consideration.

So, how to calculate $G_t^{(\lambda)}$ when we know the sample chain $\{\mathbf{s}_0, \mathbf{a}_0, r_0, \mathbf{s}_1, \cdots\}$? Here is the formula derivation. (Lemma: $G_t^{(n)} = r_t + \gamma G_{t+1}^{(n-1)}$, I leave the proof for readers.)

Define advantage function $\delta_t^{(n)} = G_t^{(n)} - V_t$, we can get the recurrent relation


Psudo Code for Computing $TD(\lambda)$

When we have a sample chain

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
def TD_lambda(s, r, value_func, gamma, lam):
"""
Inputs:
s list of states
r list of rewards
value_func function predict accumulated reward for state
gamma reward decal factor
lam TD(lam) parameter
Outputs:
TD_lam list of TD(lam) value for each state
"""
for i in range(n):
V[i] = value_func(s[i]) # compute predicted value for each state

V[n] = 0 if (s[n] is END) else value_func(s[n]) # predicted value for end state
delt_lam[n] = 0

for i in reversed(range(n)): # calculate in reversed order
# calculate delt(1) for each state
delt_1[i] = r[i] + gamma * V[i+1] - V[i]
# calculate delt(lam) for each state
delt_lam[i] = delt_1[i] + gamma * lam * delt_lam[i+1]
# calculate TD(lam) for each state
TD_lam[i] = delt_lam[i] + V[i]

return TD_lam

Reference

  1. Wikipedia Page: Temporal difference learning
  2. Alister Reis’s blog: Reinforcement Learning: Eligibility Traces and TD(lambda)
  3. David Silver’s slides
  4. Richard S. Sutton’s paper of TD(lam): Learning to predict by the methods of temporal differences