机器算法验证 - 执行岭回归时与 Vowpal Wabbit 的多遍行为混淆 - 吾爱随笔录

在尝试进行在线多遍学习时，我遇到了许多 Vowpal Wabbit 的特殊性/误解。

具体来说，我需要解决岭线性回归问题，包括N=4e6点和周围K=2.38e5特征的总数。每个点都有稀疏的特征，通常在 10 到 100 之间。由于N太大而无法容纳我 PC 的有限内存，我决定使用 Vowpal Wabbit 的核外 SGD。特征以文本表示：

vw -d train.dat -c -f train.model --passes 10 --loss_function squared
    --l2 0.00001 -l 0.05

Num weight bits = 18
learning rate = 0.05
initial_t = 0
power_t = 0.5
decay_learning_rate = 1
using l2 regularization = 1e-05
using cache_file = vw/Train_0.cache
ignoring text input in favor of cache input
num sources = 1
average    since         example     example  current  current  current
loss       last          counter      weight    label  predict features
0.028051   0.028051            3         3.0   0.0010   0.0070       24
0.089131   0.150212            6         6.0   0.0036   0.0256       20
0.131556   0.182465           11        11.0   0.0076   0.0297       47
0.102226   0.072896           22        22.0   0.0052   0.0368       16
0.095826   0.089426           44        44.0   0.0539   0.1463       64
0.094730   0.093608           87        87.0   0.0058   0.0605        8
0.084809   0.074887          174       174.0   0.0177   0.0914       13
0.073454   0.062099          348       348.0   0.2019   0.1518       18
0.065183   0.056912          696       696.0   0.0044   0.1866       20
0.061144   0.057106         1392      1392.0   0.8041   0.2107       12
0.060112   0.059079         2784      2784.0   0.0454   0.1215       15
0.054327   0.048543         5568      5568.0   0.0875   0.0016       58
0.052169   0.050009        11135     11135.0   0.0194   0.0785       41
0.048767   0.045365        22269     22269.0   0.6126   0.1835       42
0.046333   0.043900        44537     44537.0   0.0149   0.0000       55
0.045165   0.043997        89073     89073.0   0.4736   0.1643       15
0.043814   0.042463       178146    178146.0   0.0561   0.0189       51
0.042698   0.041581       356291    356291.0   0.0161   0.0412       54
0.042067   0.041436       712582    712582.0   0.0012   0.1215       25
0.041751   0.041435      1425163   1425163.0   0.0540   0.0677       33
0.042095   0.042439      2850326   2850326.0   0.1088   0.1141       41
0.000000   0.000000      5700651   5700651.0   0.0028  -0.0000       43 h
0.000000   0.000000     11401301  11401301.0   0.0006   0.1667       32 h
0.000000   0.000000     22802601  22802601.0   0.0534   0.1617       12 h

finished run
number of examples per pass = 3702833
passes used = 10
weighted example sum = 3.70283e+07
weighted label sum = 5.96004e+06
average loss = 0 h
best constant = 0.160959
total feature number = 1244815410

最后h三行右端的表示损失是根据holdout 的验证数据子集计算的。为什么我的平均损失为 0，而据推测，坚持预测是不正确的，正如最后 3 行current label之间的差异所表明的那样？current predict这种奇怪的事情掩盖了我的算法是否已经收敛的事实......

vw我可以通过忽略的--passes选项并手动执行“通过”以及在每次“通过”后对数据进行 IO 密集型改组来绕过这种晦涩难懂的解决方法：

for i in {0..10}
do
    # Custom script that shuffles train.dat -> train.dat
    ./shuffledata train.dat train.dat
    vw -d train.dat -c -k --l2 0.00001 -l 0.05 -i train.model -f train.model
done

但是，上面的两个代码（带有--passes选项和不带--passes选项 + 洗牌数据脚本）的执行几乎与以下代码片段完全相同：

vw -d train.dat -c --l2 0.00001 -l 0.05 -f train.model

换句话说，就好像“收敛”根本不需要通行证。这些经历引导我提出我的问题。

为什么在在线 SGD 设置中多次通过（混洗和非混洗）给出与单次通过相同的结果？
为什么我的坚持错误为 0？