人工智能 - 为什么回溯的定义不一致？ - 吾爱随笔录

在Learning by Playing-Solving Sparse Reward Tasks from Scratch一文的第4.3节中，作者将Retrace定义为

Q^{r e t} = \sum_{j = i}^{\infty} (γ^{j - i} \prod_{k = i}^{j} c_{k}) [r (s_{j}, a_{j}) + δ_{Q} (s_{i}, s_{j})], δ_{Q} (s_{i}, s_{j}) = E_{π_{θ^{'}} (a | s)} [Q^{π} (s_{i}, \cdot; ϕ^{'})] - Q^{π} (s_{j}, a_{j}; ϕ^{'}) c_{k} = min (1, \frac{π_{θ^{'}} (a_{k} | s_{k})}{b (a_{k} | s_{k})})

$Q^{ret}=\sum_{j=i}^\infty\left(\gamma^{j-i}\prod_{k=i}^jc_k\right)[r(s_j,a_j)+\delta_Q(s_i,s_j)],\\ \delta_Q(s_i,s_j)=\mathbb E_{\pi_{\theta'}(a|s)}[Q^\pi(s_i,\cdot;\phi')]-Q^\pi(s_j,a_j;\phi')\\ c_k=\min\left(1,{\pi_{\theta'}(a_k|s_k)\over b(a_k|s_k)}\right)$ 我省略的地方

T

$\mathcal T$ 为简单起见。我对的定义很困惑

Q^{r e t}

$Q^{ret}$ ，这似乎与Safe and efficcient off-policy 强化学习中定义的 Retrace 不一致：

R Q (x, a) := Q (x, a) + E_{μ} [\sum_{t \geq 0} γ^{t} (\prod_{s = 1}^{t} c_{s}) (r_{t} + γ E_{π} Q (x_{t + 1}, \cdot) - Q (x_{t}, a_{t})]

$\mathcal RQ(x,a):=Q(x,a)+\mathbb E_\mu[\sum_{t\ge0}\gamma^t\left(\prod_{s=1}^tc_s\right)(r_t+\gamma\mathbb E_\pi Q(x_{t+1},\cdot)-Q(x_t,a_t)]$

我应该怎么做 $Q^{ret}$ 在第一篇论文中？