数据挖掘 - 正常的 Glorot 初始化从何而来？ - 吾爱随笔录

著名的 Glorot 初始化首先在《理解训练深度前馈神经网络的难度》一文中进行了描述。在本文中，他们推导出以下统一初始化，参见。方程。(16) 在他们的论文中：

\begin{matrix} (16) & W \sim U [- \frac{\sqrt{6}}{\sqrt{n_{j} + n_{j + 1}}}, \frac{\sqrt{6}}{\sqrt{n_{j} + n_{j + 1}}}] . \end{matrix}

$\begin{equation} W \sim U\left[ -\frac{\sqrt{6}}{\sqrt{n_j + n_{j+1}}}, \frac{\sqrt{6}}{\sqrt{n_j + n_{j+1}}}\right]. \tag{16}\end{equation}$

如果我们看一下权重初始化的 PyTorch 文档，那么有两个Glorot (Xavier) 初始化，即torch.nn.init.xavier_uniform_(tensor, gain=1.0)和torch.nn.init.xavier_normal_(tensor, gain=1.0)。根据文档，后者的初始化由正态分布给出 $\mathcal N(0, \sigma^2)$ ，其中标准差由下式给出

σ = \sqrt{\frac{2}{n_{j} + n_{j + 1}}} .

$\sigma = \sqrt{\frac{2}{n_{j} + n_{j+1}}}.$

问题：

1.) 为什么我们有一个 $2$ 代替 $6$ 在正常 Glorot 初始化的标准差中？

2.) 正常的 Glorot 初始化从何而来？那么基本上，是否有上述论文的后续论文证明了普通 Glorot 与统一初始化相比的优越性？

谢谢！