机器算法验证 - 使用 GLM 进行现实检查 - 吾爱随笔录

如果你愿意，我需要一个现实检查。我有一个数据集，其中我知道有多少两个物种的个体蝴蝶同时出现在一个草地上（但并非总是如此）。我还有其他变量，例如湿/干草甸，密集耕种/未耕种，湿草甸或干草甸覆盖的草甸周围区域的百分比......

这是数据集的头部。共 50 行，每个品种 25 行。请注意，除计数和物种外，所有列都相同，表明它们来自同一采样位置。

> head(dej)
  count     type1 type2 perc.for.100m perc.dry.100m perc.wet.100m species
1     1 intensive   dry        13.836        22.724         0.000   reali
2     3 extensive   wet         6.877         1.613        52.213   reali
3     4 intensive   wet        22.770         0.537        44.901   reali
4     6 intensive   dry        17.346        42.322         6.359   reali
5     1 extensive   wet        34.854         9.091        11.950   reali
6     2 extensive   dry        50.387        19.245         0.000   reali
...
26     0 intensive   dry        13.836        22.724         0.000 sinapis
27     0 extensive   wet         6.877         1.613        52.213 sinapis
28     0 intensive   wet        22.770         0.537        44.901 sinapis
29     0 intensive   dry        17.346        42.322         6.359 sinapis
30     1 extensive   wet        34.854         9.091        11.950 sinapis
31     1 extensive   dry        50.387        19.245         0.000 sinapis
...

我很想知道这些变量中的任何一个是否会影响物种及其各自的数量。

这就是“完整”模型的结果。

glm(formula = count ~ type1 + type2 + perc.for.100m + perc.dry.100m + 
    perc.wet.100m + species, family = poisson, data = dej)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.8458  -1.1414  -0.4546   0.8297   2.2145  

Coefficients:
               Estimate Std. Error z value Pr(>|z|)   
(Intercept)    0.028129   0.523509   0.054  0.95715   
type1intensive 0.196699   0.191960   1.025  0.30551   
type2wet       0.071841   0.334286   0.215  0.82984   
perc.for.100m  0.003741   0.008277   0.452  0.65130   
perc.dry.100m  0.010952   0.010750   1.019  0.30829   
perc.wet.100m  0.007467   0.011596   0.644  0.51960   
speciessinapis 0.597837   0.187689   3.185  0.00145 **

这听起来像是正确的方法吗？

一些附加信息

作为旁注，根据我对数据的探索，我希望计数将（至少）取决于 type2 变量，唉，这不是我得到的。

在此处输入图像描述

使用“反向逻辑”，我尝试了是否可以使用我的数据预测物种，这大概证实了上述结果。

Call:
glm(formula = species ~ type1 + type2 + perc.for.100m + perc.dry.100m + 
    perc.wet.100m + count, family = binomial, data = dej)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.6322  -1.0136  -0.1568   1.0592   1.6407  

Coefficients:
                Estimate Std. Error z value Pr(>|z|)  
(Intercept)    -0.351192   1.658052  -0.212   0.8323  
type1intensive -0.170583   0.651611  -0.262   0.7935  
type2wet       -0.107377   1.078726  -0.100   0.9207  
perc.for.100m  -0.002806   0.026807  -0.105   0.9166  
perc.dry.100m  -0.010227   0.036982  -0.277   0.7821  
perc.wet.100m  -0.006486   0.038071  -0.170   0.8647  
count           0.345036   0.153811   2.243   0.0249 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 69.315  on 49  degrees of freedom
Residual deviance: 63.431  on 43  degrees of freedom
AIC: 77.431

编辑 1

Aniko 注意到 type2 和物种之间可能存在相互作用。的确！

Call:
glm(formula = count ~ type1 + type2 * species + perc.for.100m + 
    perc.dry.100m + perc.wet.100m, family = poisson, data = dej)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-3.0859  -1.1350  -0.1947   0.7109   2.7470  

Coefficients:
                         Estimate Std. Error z value Pr(>|z|)    
(Intercept)             -0.357165   0.559987  -0.638  0.52360    
type1intensive           0.196699   0.191960   1.025  0.30551    
type2wet                 0.704769   0.429087   1.642  0.10049    
speciessinapis           1.145132   0.306847   3.732  0.00019 ***
perc.for.100m            0.003741   0.008277   0.452  0.65130    
perc.dry.100m            0.010952   0.010750   1.019  0.30829    
perc.wet.100m            0.007467   0.011596   0.644  0.51960    
type2wet:speciessinapis -0.962811   0.394038  -2.443  0.01455 *

编辑 2

在删除了不重要的项之后（假设我为了数据挖掘找到了全局最大值），故事又朝着正确的方向发展。

Call:
glm(formula = count ~ type2 * species, family = poisson, data = dej)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.7080  -1.1617  -0.1582   0.6979   3.1599  

Coefficients:
                        Estimate Std. Error z value Pr(>|z|)    
(Intercept)               0.1542     0.2673   0.577  0.56408    
type2wet                  0.6821     0.3237   2.107  0.03508 *  
speciessinapis            1.1451     0.3068   3.732  0.00019 ***
type2wet:speciessinapis  -0.9628     0.3940  -2.443  0.01455 *