在有关 DQN 的示例和教程中,我经常注意到,在体验回放(训练)阶段,人们倾向于使用随机梯度下降/在线学习。(例如链接1,链接2)
# Sample minibatch from the memory
minibatch = random.sample(self.memory, batch_size)
# Extract informations from each memory
for state, action, reward, next_state, done in minibatch:
# if done, make our target reward
target = reward
if not done:
# predict the future discounted reward
target = reward + self.gamma * \
np.amax(self.model.predict(next_state)[0])
# make the agent to approximately map
# the current state to future discounted reward
# We'll call that target_f
target_f = self.model.predict(state)
target_f[0][action] = target
为什么他们不能使用小批量来代替?我是 RL 的新手,但在深度学习中,人们倾向于使用小批量,因为它们会产生更稳定的梯度。同样的原则是否适用于 RL 问题?引入的随机性/噪声实际上对学习过程有益吗?我错过了什么,还是这些来源都错了?
笔记:
并非所有来源都依赖于随机梯度下降:例如 keras-rl 似乎依赖于小批量(https://github.com/keras-rl/keras-rl/blob/master/rl/agents/dqn.py)