Equivalence of Forward and Backward view in TD($\lambda$) method
To Do:
- MC Estimate
- TD Estimate
- n-step TD estimate
- TD($\lambda$) method
- SARSA algorithm
- Equivalence of FV and BV
The temporal difference (TD) error from eligibility trace update rule is,
\[\Delta V_t^{TD}(s) = \alpha \delta_t e_t(s)\]where, \(\delta_t = r_{t+1} + \gamma V(s_{t+1}) - V(s_t)\) and \(e_t(s) = \gamma \lambda e_{t-1}(s) + I_{ss_t}\) with \(I_{ss_t} = 1\) if \(s=s_t\) and \(0\) otherwise.
Expanding \(e_t(s)\) we obtain
\(e_t(s) = I_{ss_t} + \gamma \lambda e_{t-1}(s)\) \(e_t(s) = I_{ss_t} + \gamma \lambda \left(\gamma \lambda e_{t-2}(s) + I_{ss_{t-1}}\right)\) \(e_t(s) = I_{ss_t} + \gamma \lambda I_{ss_{t-1}} + \gamma^2 \lambda^2 I_{ss_{t-2}} + \gamma^3 \lambda^3 I_{ss_{t-3}} + \dots\) \(e_t(s) = \sum_{k=0}^t (\gamma \lambda)^{t-k} I_{ss_k}\)
The sum of TD errors over the trajectory in backward view for a state $s$ is,
\(\sum_{t=0}^{T-1} \Delta V_t^{TD}(s) = \sum_{t=0}^{T-1} \alpha \delta_t e_t(s)\) \(\sum*{t=0}^{T-1} \Delta V_t^{TD}(s) = \sum*{t=0}^{T-1} \alpha \delta*t \left( \sum*{k=0}^{t} (\gamma \lambda)^{t-k} I\_{ss_k}\right)\)
Interchanging \(t\) with \(k\) and \(k\) with \(t\) on the RHS, we obtain
\[\sum_{t=0}^{T-1} \Delta V_t^{TD}(s) = \sum_{k=0}^{T-1} \alpha \delta_t \left( \sum_{t=0}^{k} (\gamma \lambda)^{k-t} I_{ss_k}\right)\]Expanding the RHS results in,
\(\sum_{t=0}^{T-1} \Delta V_t^{TD}(s) = \alpha \delta_0 (\gamma \lambda)^0 I_{ss_0} + \delta_1 ( \gamma \lambda I_{ss_0} + (\gamma \lambda)^0 I_{ss_1}) +\) \(+ \delta*2 ( \gamma^2 \lambda^2 I*{ss*0} + \gamma \lambda I*{ss*1} + (\gamma \lambda)^0 I*{ss_2})\)
\[+ \delta*3 ( \gamma^3 \lambda^3 I*{ss*0} + \gamma^2 \lambda^2 I*{ss*1} + \gamma \lambda I*{ss*2} + (\gamma \lambda)^0 I*{ss_3}) + \dots\] \[= \alpha \left( I*{ss_0} \sum*{k=0}^{T-1} (\gamma \lambda)^k \delta*k + I*{ss*1} \sum*{k=1}^{T-1} (\gamma \lambda)^{k-1} \delta_k + \dots \right)\] \[= \alpha \sum*{t=0}^{T-1} I*{ss*t} \left( \sum*{k=t}^{T-1} (\gamma \lambda)^{k-t} \delta_k \right)\]Now, from the foward view the update rule is,
\[\Delta V_t^{\lambda}(s_t) = \alpha (L_t^{\lambda} - V_t(s_t))\]where, \(L_t^{\lambda}\) is the weighted sum of n-step returns
Dividing the above equation by $\alpha$, we get
\(\frac{1}{\alpha} \Delta V_t^{\lambda}(s_t) = L_t^{\lambda} - V_t(s_t)\) \(= - V*t(s_t) + (1-\lambda) (R*{t+1} + \gamma V*t(s*{t+1})) + (1-\lambda)\)
Enjoy Reading This Article?
Here are some more articles you might like to read next: