Estimating state value in reinforcement learning is a hard task, since the sample process is very noisy. $TD(\lambda)$ learning try to take as much information as possible to estimate the state value. In this note, I will briefly give the derivation of $TD(\lambda)$ and its pseudo-code for computing.
- Definitions in Reinforcement Learning
- Bellman Equation and Update Rules
- Derivation of $TD(\lambda)$
- Psudo Code for Computing $TD(\lambda)$
- Reference
Definitions in Reinforcement Learning
We mainly regard reinforcement learning process as a Markov Decision Process(MDP): an agent interacts with environment by making decisions at every step/timestep, gets to next state and receives reward. This makes a state-action chain like
Where $\mathbf{s}_t$ represents the agent’s state at time $t$, $\mathbf{a}_t$ represents the decision made by agent at time t with its state $\mathbf{s}_t$, $r_t$ represents the reward agent received between states $\mathbf{s}_t$ and $\mathbf{s}_{t+1}$ after action $\mathbf{a}_t$ was taken.
To evaluate the quality of policy $\pi(\mathbf{a}_t|\mathbf{s}_t)$, accumulated reward for state $\mathbf{s}_T$ with decay ratio $\gamma$ is defined as following:
Also, the $Q$ value for state $\mathbf{s}_T$ after taking action $\mathbf{a}_T$ is defined as
Bellman Equation and Update Rules
Assume we have an accurate estimation of accumulated reward $R$ of policy $\pi(\mathbf{a}_t|\mathbf{s}_t)$, denoted by $V_\pi^*$, then we will have a equilibrium like:
or
However, the accurate estimation is exactly what we don’t know, so we need to improve the accuracy of our estimation based on current samples. Assume we have a series of functions $V_\pi^{(n)}$ to approach to $V_\pi^*$, then we can update the function by
Derivation of $TD(\lambda)$
$TD(n)$ learning
From a chain of samples, we can update the state value by
But how about other states in the chian? The Bellman equation can be also written as any of the followings:
Define $G_t^{(n)}$ as estimated state value considering $n$ step further.
Using $G^{(n)}$ to update value can take $n$ steps into account, we call this $TD(n)$ learning (Temporal Difference Learning).
$TD(\lambda)$ learning
What if we make a combination of all these $G^{(n)}$ ? By weighted averaging, we can get $G^{(\lambda)}$ as unbiased estimation of $R$, and takes all information into consideration.
So, how to calculate $G_t^{(\lambda)}$ when we know the sample chain $\{\mathbf{s}_0, \mathbf{a}_0, r_0, \mathbf{s}_1, \cdots\}$? Here is the formula derivation. (Lemma: $G_t^{(n)} = r_t + \gamma G_{t+1}^{(n-1)}$, I leave the proof for readers.)
Define advantage function $\delta_t^{(n)} = G_t^{(n)} - V_t$, we can get the recurrent relation
Psudo Code for Computing $TD(\lambda)$
When we have a sample chain
1 | def TD_lambda(s, r, value_func, gamma, lam): |
Reference
- Wikipedia Page: Temporal difference learning
- Alister Reis’s blog: Reinforcement Learning: Eligibility Traces and TD(lambda)
- David Silver’s slides
- Richard S. Sutton’s paper of TD(lam): Learning to predict by the methods of temporal differences