Pendulum-v0 工作实现的 A2C Continuous,对损失和熵计算的否定

数据挖掘 神经网络 分配 高斯 开放式健身房 演员评论家
2022-02-26 20:45:36

Pendulum-v0 的 A2C 连续实现非常好

当最后 10 或 20 的平均值高于 -20 时,代码具有停止执行的代码段,但结果如下所示:

episode: 706   score: [-13.13392661]
episode: 707   score: [-12.91221984]
episode: 708   score: [-50.38036647]
episode: 709   score: [-74.58410041]
episode: 710   score: [-138.1596521]
episode: 711   score: [-87.3867222]
episode: 712   score: [-63.28444052]
episode: 713   score: [-0.37368592]
episode: 714   score: [-13.28473712]
episode: 715   score: [-117.78089523]
episode: 716   score: [-25.65207563]
episode: 717   score: [-0.36829411]
episode: 718   score: [-50.81750735]
episode: 719   score: [-0.33565775]
episode: 720   score: [-0.47168285]
episode: 721   score: [-0.35240929]
episode: 722   score: [-0.40577252]
episode: 723   score: [-0.37114168]
episode: 724   score: [-25.73963544]
episode: 725   score: [-37.70957794]

即使有线reward/10,还是很不错的。但是,我不明白这些关于否定损失的行以及为什么熵方程看起来与我在 Packt Publishing Deep Reinforcement Learning Hands-On 中看到的不同: DL 动手数学解释

编码:

 def actor_optimizer(self):
        #placeholders for actions and advantages parameters coming in
        action = K.placeholder(shape=(None, 1))
        advantages = K.placeholder(shape=(None, 1))

        # mu = K.placeholder(shape=(None, self.action_size))
        # sigma_sq = K.placeholder(shape=(None, self.action_size))

        mu, sigma_sq = self.actor.output

        #defined a custom loss using PDF formula, K.exp is element-wise exponential
        pdf = 1. / K.sqrt(2. * np.pi * sigma_sq) * K.exp(-K.square(action - mu) / (2. * sigma_sq))
        #log pdf why?
        log_pdf = K.log(pdf + K.epsilon())
        #entropy looks different from log(sqrt(2 * pi * e * sigma_sq))
        #Sum of the values in a tensor, alongside the specified axis.
        entropy = K.sum(0.5 * (K.log(2. * np.pi * sigma_sq) + 1.))

        exp_v = log_pdf * advantages
        #entropy is made small before added to exp_v
        exp_v = K.sum(exp_v + 0.01 * entropy)
        #loss is a negation
        actor_loss = -exp_v

        #use custom loss to perform updates with Adam, ie. get gradients
        optimizer = Adam(lr=self.actor_lr)
        updates = optimizer.get_updates(self.actor.trainable_weights, [], actor_loss)
        #adjust params with custom train function
        train = K.function([self.actor.input, action, advantages], [], updates=updates)
        #return custom train function
        return train

同样,编码的熵方程是这样的: entropy = K.sum(0.5 * (K.log(2. * np.pi * sigma_sq) + 1.))它看起来与上面教科书照片中给出的不同。

另外,为什么损失是否定的?actor_loss = -exp_v?

是否因为它是梯度上升而不是策略梯度的目标函数的梯度下降而被否定?

1个回答

同样,编码的熵方程是这样的: entropy = K.sum(0.5 * (K.log(2. * np.pi * sigma_sq) + 1.)) 看起来与上面教科书照片中给出的不同。

经过简单的代数运算后它们是相同的。带有pdf的单变量高斯分布的熵p(x|μ,σ)

H(p)=ln(2πeσ2)=0.5ln(2πeσ2)=0.5(ln(2πσ2)+1)
在策略梯度中,我们假设这些σs 是各向同性的。因此总熵是上面的总和Hs。

另外,为什么损失是否定的?actor_loss = -exp_v? 是否因为它是梯度上升而不是策略梯度的目标函数的梯度下降而被否定?

是的。在策略梯度中,您希望最大化log_pdf获得更高情节回报的可能性advantages同时,我们更喜欢高熵策略。深度学习框架通常只提供优化器来最小化损失。因此你否定exp_v为损失。