机器算法验证 - 如何解释 Weka Logistic 回归输出？ - 吾爱随笔录

如何解释 Weka Logistic 回归输出？

机器算法验证回归物流数据挖掘威卡

2022-04-15 12:10:38

weka.classifiers.functions.Logistic请帮助解释WEKA 库产生的逻辑回归结果。

我使用来自 WEKA 示例的数字数据：

@relation weather

@attribute outlook {sunny, overcast, rainy}
@attribute temperature real
@attribute humidity real
@attribute windy {TRUE, FALSE}
@attribute play {yes, no}

@data
sunny,85,85,FALSE,no
sunny,80,90,TRUE,no
overcast,83,86,FALSE,yes
rainy,70,96,FALSE,yes
rainy,68,80,FALSE,yes
rainy,65,70,TRUE,no
overcast,64,65,TRUE,yes
sunny,72,95,FALSE,no
sunny,69,70,FALSE,yes
rainy,75,80,FALSE,yes
sunny,75,70,TRUE,yes
overcast,72,90,TRUE,yes
overcast,81,75,FALSE,yes
rainy,71,91,TRUE,no

要创建逻辑回归模型，我使用以下命令：

java -cp WEKA_INS/weka.jar weka.classifiers.functions.Logistic -t WEKA_INS/data/weather.numeric.arff -T WEKA_INS/data/weather.numeric.arff -d ./weather.numeric.model.arff

以下是这三个论点的含义：

-t <name of training file> : Sets training file.
-T <name of test file> : Sets test file. 
-d <name of output file> : Sets model output file.

运行上述命令会产生以下输出：

Logistic Regression with ridge parameter of 1.0E-8
Coefficients...
              Class
Variable                    yes
===============================
outlook=sunny           -6.4257
outlook=overcast        13.5922
outlook=rainy           -5.6562
temperature             -0.0776
humidity                -0.1556
windy                    3.7317
Intercept                22.234

Odds Ratios...
              Class
Variable                    yes
===============================
outlook=sunny            0.0016
outlook=overcast    799848.4264
outlook=rainy            0.0035
temperature              0.9254
humidity                 0.8559
windy                   41.7508


Time taken to build model: 0.05 seconds
Time taken to test model on training data: 0 seconds

=== Error on training data ===
Correctly Classified Instances          11               78.5714 %
Incorrectly Classified Instances         3               21.4286 %
Kappa statistic                          0.5532
Mean absolute error                      0.2066
Root mean squared error                  0.3273
Relative absolute error                 44.4963 %
Root relative squared error             68.2597 %
Total Number of Instances               14     

=== Confusion Matrix ===
 a b   <-- classified as
 7 2 | a = yes
 1 4 | b = no

问题：

报告第一部分：

// Coefficients...

              Class
Variable                    yes
===============================
outlook=sunny           -6.4257
outlook=overcast        13.5922
outlook=rainy           -5.6562
temperature             -0.0776
humidity                -0.1556
windy                    3.7317
Intercept                22.234

我是否理解正确，实际上是在将它们加在一起以产生等于Coefficients的类属性值之前应用于每个属性的权重？playyes

报告第二部分：

// Odds Ratios...

              Class
Variable                    yes
===============================
outlook=sunny            0.0016
outlook=overcast    799848.4264
outlook=rainy            0.0035
temperature              0.9254
humidity                 0.8559
windy                   41.7508

“赔率比”是什么意思？
play它们是否也都与等于的类属性有关yes？
为什么价值outlook=overcast比价值大那么多outlook=sunny？

混淆矩阵

=== Confusion Matrix ===
 a b   <-- classified as
 7 2 | a = yes
 1 4 | b = no

“混淆矩阵”是什么意思？

非常感谢你的帮助！

1个回答

让我解释一下赔率的一般含义。

几率是成功概率与失败概率之间的比率，即 $\displaystyle \frac{p_{i}}{1-p_{i}}$ . 比方说 $p_{i}$ 对于给定的事件是 0.6，那么该事件的几率是 $0.6/0.4=1.5$ .

1-正如您所说，由于逻辑回归输出基于以下等式的概率：

logit (p_{i}) = \log \frac{p_{i}}{1 - p_{i}} = β_{0} + β_{1} x_{1} + . . . + β_{k} x_{k}

$\text{logit}(p_{i}) = \log{\frac{p_{i}}{1-p_{i}}} = \beta_{0} + \beta_{1}x_{1} + ... + \beta_{k}x_{k}$

系数是指每个 $\beta_{i}$ .

2-优势比只是您之前找到的权重的指数。例如，您拥有的第一个系数是outlook=sunny: -6.4257。如果你计算 $\exp(-6.4257)$ 你得到0.0016的是优势比表中的相应值。

outlook=sunny在这种情况下，的系数与其优势比之间的关系是的优势outlook=sunny超过的优势的对数outlook=¬sunny：

\log \frac{O d d s (o u t l o o k = s u n n y)}{O d d s (o u t l o o k = \neg s u n n y)}

$\displaystyle \log{\frac{Odds(outlook=sunny)}{Odds(outlook=¬sunny)}}$

例如，几率outlook=sunny是晴天可以玩的概率高于晴天不能玩的概率。同样，您可以计算的几率outlook=¬sunny。该比率的对数outlook=sunny是逻辑回归中附加到变量的系数值。但是，在此特定示例中，由于您有多个变量作为预测变量，因此有必要固定其他变量的值。现在你可以明白为什么outlook=overcast会有这样的价值了。赔率对outlook=overcast结果非常有利yes，产生很高的正值。

可以在此处找到一个更简单的示例。

3.- 混淆矩阵非常简单。例如，在第一行中，它告诉您在训练数据中yes分类为您分类的实例数yes（即 7）和分类为yes您分类为no(2) 的实例数。第二行等效于分类为的实例no。

其它你可能感兴趣的问题

上一篇在 LibsSVM 中正确使用交叉验证下一篇如何处理预测时间序列数据