Why do authors track γtγt in Prioritized Experience Replay Paper?

人工智能 dqn deep-rl experience-replay
2021-10-22 16:55:02

In the original prioritized experience replay paper, the authors track γt in every state transition tuple (see line 6 in algorithm below):

algorithm

Why do the authors track this at every time step? Also, many blog posts and implementations leave this out (including I believe the OpenAI implementation on github).

Can someone explain explicitly how γt is used in this algorithm?

Note: I understand the typical use of γ as a discount factor. But typically gamma remains fixed. Which is why I’m curious as to the need to track it.

1个回答

In some cases we may wish to have a discount factor γt which depends on time t (or depends on state st and/or action at, leading to an indirect dependence on time t). Indeed we do not usually do this, but it does happen sometimes.

I guess that, from a theoretical point of view, it was very easy of the authors to make their algorithm more flexible/general and also support this (somewhat rare) case of time-varying discount factor. If it had been very complicated for them to support this option, they may have chosen not to; but if it's trivial to do so, well, why not?

Practical implementations will often indeed ignore that possibility if they're not using it, and can avoid including γt values in the replay buffer altogether if it is known to be a constant γt=γ for all t. As far as I can see, in the experiments discussed in this paper they also only used a fixed, constant γ.