人工智能 - Why do authors track γtγt in Prioritized Experience Replay Paper? - 吾爱随笔录

Why do authors track γtγt in Prioritized Experience Replay Paper?

人工智能 dqn deep-rl experience-replay

2021-10-22 16:55:02

In the original prioritized experience replay paper, the authors track $\gamma_t$ in every state transition tuple (see line 6 in algorithm below):

Why do the authors track this at every time step? Also, many blog posts and implementations leave this out (including I believe the OpenAI implementation on github).

Can someone explain explicitly how $\gamma_t$ is used in this algorithm?

Note: I understand the typical use of $\gamma$ as a discount factor. But typically gamma remains fixed. Which is why I’m curious as to the need to track it.

1个回答

In some cases we may wish to have a discount factor $\gamma_t$ which depends on time $t$ (or depends on state $s_t$ and/or action $a_t$ , leading to an indirect dependence on time $t$ ). Indeed we do not usually do this, but it does happen sometimes.

I guess that, from a theoretical point of view, it was very easy of the authors to make their algorithm more flexible/general and also support this (somewhat rare) case of time-varying discount factor. If it had been very complicated for them to support this option, they may have chosen not to; but if it's trivial to do so, well, why not?

Practical implementations will often indeed ignore that possibility if they're not using it, and can avoid including $\gamma_t$ values in the replay buffer altogether if it is known to be a constant $\gamma_t = \gamma$ for all $t$ . As far as I can see, in the experiments discussed in this paper they also only used a fixed, constant $\gamma$ .

其它你可能感兴趣的问题

上一篇有没有办法计算神经网络计算的函数的封闭形式表达式？下一篇除了 GAN 之外的其他深度学习图像生成技术？