数据挖掘 - 如何在 backprop 中应用 softmax 的梯度 - 吾爱随笔录

我最近做了一个作业，我必须学习 MNIST 10 位分类的模型。硬件有一些脚手架代码，我应该在这段代码的上下文中工作。

我的家庭作业有效/通过了测试，但现在我正试图从头开始做这一切（我自己的 nn 框架，没有硬件脚手架代码），我被困在 backprop 步骤中应用 softmax 的宏大，甚至认为硬件是什么脚手架代码可能不正确。

硬件让我使用他们所谓的“softmax 损失”作为 nn 中的最后一个节点。这意味着，出于某种原因，他们决定将 softmax 激活与交叉熵损失结合在一起，而不是将 softmax 视为激活函数而将交叉熵视为单独的损失函数。

然后硬件损失函数看起来像这样（由我进行了最少的编辑）：

class SoftmaxLoss:
    """
    A batched softmax loss, used for classification problems.
    input[0] (the prediction) = np.array of dims batch_size x 10
    input[1] (the truth) = np.array of dims batch_size x 10
    """
    @staticmethod
    def softmax(input):
        exp = np.exp(input - np.max(input, axis=1, keepdims=True))
        return exp / np.sum(exp, axis=1, keepdims=True)

    @staticmethod
    def forward(inputs):
        softmax = SoftmaxLoss.softmax(inputs[0])
        labels = inputs[1]
        return np.mean(-np.sum(labels * np.log(softmax), axis=1))

    @staticmethod
    def backward(inputs, gradient):
        softmax = SoftmaxLoss.softmax(inputs[0])
        return [
            gradient * (softmax - inputs[1]) / inputs[0].shape[0],
            gradient * (-np.log(softmax)) / inputs[0].shape[0]
        ]

正如你所看到的，它在前向执行 softmax(x) 然后交叉熵损失。

但是在反向传播上，它似乎只做交叉熵的导数，而不做 softmax 的导数。Softmax 就这样保留了下来。

它不应该也对softmax的输入求softmax的导数吗？

假设它应该采用softmax的导数，我不确定这个硬件实际上是如何通过测试的......

现在，在我自己的从头实现中，我将 softmax 和交叉熵分开的节点，就像这样（p 和 t 代表预测和真实）：

class SoftMax(NetNode):
    def __init__(self, x):
        ex = np.exp(x.data - np.max(x.data, axis=1, keepdims=True))
        super().__init__(ex / np.sum(ex, axis=1, keepdims=True), x)

    def _back(self, x):
        g = self.data * (np.eye(self.data.shape[0]) - self.data)
        x.g += self.g * g
        super()._back()

class LCE(NetNode):
    def __init__(self, p, t):
        super().__init__(
            np.mean(-np.sum(t.data * np.log(p.data), axis=1)),
            p, t
        )

    def _back(self, p, t):
        p.g += self.g * (p.data - t.data) / t.data.shape[0]
        t.g += self.g * -np.log(p.data) / t.data.shape[0]
        super()._back()

如您所见，我的交叉熵损失 (LCE) 与硬件中的导数相同，因为这是损失本身的导数，还没有进入 softmax。

但是，我仍然需要做 softmax 的导数，以将其与 loss 的导数链接起来。这就是我卡住的地方。

对于 softmax 定义为：

$一个$

导数通常定义为：

$b$

但是我需要一个导数，它会产生一个与 softmax 的输入大小相同的张量，在这种情况下，batch_size x 10。所以我不确定上述内容应该如何仅应用于 10 个组件，因为这意味着我将针对所有输出（所有组合）或 $C$ 矩阵形式对所有输入进行区分。