机器算法验证 - 使用 R 进行逻辑回归 - 吾爱随笔录

我正在进行逻辑回归，我创建了以下测试数据（两个预测变量和标准是二元变量）：

   UV1 UV2 AV
1    1   1  1
2    1   1  1
3    1   1  1
4    1   1  1
5    1   1  1
6    1   1  1
7    1   1  1
8    0   0  1
9    0   0  1
10   0   0  1
11   1   1  0
12   1   1  0
13   1   0  0
14   1   0  0
15   1   0  0
16   1   0  0
17   1   0  0
18   0   0  0
19   0   0  0
20   0   0  0

AV = $\frac{dependent variable}{criterion}$

$\frac{UV1}{UV2} = \frac{both independant variables}{predictors}$

为了测量 UV 对 AV 的影响，逻辑回归是必要的，因为 AV 是一个二元变量。因此我使用了以下代码

> lrmodel <- glm(AV ~ UV1 + UV2, data = lrdata, family = "binomial")

包括"family = "binomial""。这个对吗？

关于我的测试数据，我想知道整个模型，尤其是估计量和重要性：

> summary(lrmodel)


Call:
glm(formula = AV ~ UV1 + UV2, family = "binomial", data = lrdata)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.7344  -0.2944   0.3544   0.7090   1.1774  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.065e-15  8.165e-01   0.000    1.000
UV1         -1.857e+01  2.917e+03  -0.006    0.995
UV2          1.982e+01  2.917e+03   0.007    0.995

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 27.726  on 19  degrees of freedom
Residual deviance: 17.852  on 17  degrees of freedom
AIC: 23.852

Number of Fisher Scoring iterations: 17

为什么UV2不显着。因此看到，对于 AV = 1 组，有 7 例 UV2 = 1，对于 AV = 0 组，只有 3 例 UV2 = 1。我期待 UV2 是一个重要的鉴别器。
尽管 UV 并不显着，但在我看来，估计量非常高（例如，对于 UV2 = 1.982e+01）。这怎么可能？
为什么截距不是 0,5？我们有 5 个 AV = 1 的案例和 5 个 AV = 0 的案例。

此外：我创建了 UV1 作为我预计不显着的预测因子：对于 AV = 1 组，有 5 个病例 UV1 = 1，对于 AV = 0 组，也有 5 个病例 UV1 = 1。

我从物流中获得的整个“图片”让我感到困惑......

什么更消耗我：当我运行“非逻辑”回归时（通过省略“family = “binomial”）

> lrmodel <- glm(AV ~ UV1 + UV2, data = lrdata,)

我得到了预期的结果

Call:
glm(formula = AV ~ UV1 + UV2, data = lrdata)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.7778  -0.1250   0.1111   0.2222   0.5000  

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)   0.5000     0.1731   2.889  0.01020 * 
UV1          -0.5000     0.2567  -1.948  0.06816 . 
UV2           0.7778     0.2365   3.289  0.00433 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for gaussian family taken to be 0.1797386)

    Null deviance: 5.0000  on 19  degrees of freedom
Residual deviance: 3.0556  on 17  degrees of freedom
AIC: 27.182

Number of Fisher Scoring iterations: 2

UV1 不显着！:-)
UV2 对 AV = 1 有积极影响！:-)
截距为 0.5！:-)

我的总体问题：为什么逻辑回归（包括“family = “binomial”）没有产生预期的结果，但“NOT-logistic”回归（不包括“family = “binomial”）会产生结果？

更新：由于 UV1 和 UV 2 的相关性，是上述观察结果。Corr = 0.56 处理 UV2 的数据后

AV: 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0

UV1: 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0

UV2: 0, 0, 0, 1, 1, 1, 1, 1, 1, 1 , 1, 1, 0, 0, 0, 0, 0, 0, 0, 0

（我将三个 0 的位置与 UV2 中的三个 1 更改为 UV1 和 UV2 之间的相关性 < 0.1）因此：

UV1 UV2 AV
1    1   0  1
2    1   0  1
3    1   0  1
4    1   1  1
5    1   1  1
6    1   1  1
7    1   1  1
8    0   1  1
9    0   1  1
10   0   1  1
11   1   1  0
12   1   1  0
13   1   0  0
14   1   0  0
15   1   0  0
16   1   0  0
17   1   0  0
18   0   0  0
19   0   0  0
20   0   0  0

为了避免相关性，我的结果更接近我的预期：

Call:
glm(formula = AV ~ UV1 + UV2, family = "binomial", data = lrdata)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-1.76465  -0.81583  -0.03095   0.74994   1.58873  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)  
(Intercept)  -1.1248     1.0862  -1.036   0.3004  
UV1           0.1955     1.1393   0.172   0.8637  
UV2           2.2495     1.0566   2.129   0.0333 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 27.726  on 19  degrees of freedom
Residual deviance: 22.396  on 17  degrees of freedom
AIC: 28.396

Number of Fisher Scoring iterations: 4

但是为什么相关性会影响逻辑回归的结果，而不是“非逻辑”回归的结果呢？