数据挖掘 - 没有多项式特征的线性回归模型可以过拟合吗？ - 吾爱随笔录

没有多项式特征的线性回归模型可以过拟合吗？

数据挖掘线性回归过拟合

2021-10-02 18:32:39

我在互联网上的一些文章中读到线性回归可能会过拟合。但是，当我们不使用多项式特征时，这可能吗？当我们有一个特征时，我们只是在数据点上绘制一条线，当我们有两个特征时，我们只是在平面上绘制。

2个回答

确实可以！

加入一堆预测能力极低或没有预测能力的预测变量，您将获得使这些预测有效的参数估计。但是，当您从样本中尝试时，您的预测会很糟糕。

set.seed(2020)

# Define sample size
#
N <- 1000

# Define number of parameters
#
p <- 750

# Simulate data
#
X <- matrix(rnorm(N*p), N, p)

# Define the parameter vector to be 1, 0, 0, ..., 0, 0
#
B <- rep(0, p)#c(1, rep(0, p-1))

# Simulate the error term
# 
epsilon <- rnorm(N, 0, 10)

# Define the response variable as XB + epsilon 
#
y <- X %*% B + epsilon

# Fit to 80% of the data
#
L <- lm(y[1:800]~., data=data.frame(X[1:800,]))

# Predict on the remaining 20%
#
preds <- predict.lm(L, data.frame(X[801:1000, ]))

# Show the tiny in-sample MSE and the gigantic out-of-sample MSE
#
sum((predict(L) - y[1:800])^2)/800
sum((preds - y[801:1000,])^2)/200

我得到了一个样本内 MSE $7.410227$ 和样本外的 MSE $1912.764$ .

可以模拟数百次以表明这不仅仅是侥幸。

set.seed(2020)

# Define sample size
#
N <- 1000

# Define number of parameters
#
p <- 750

# Define number of simulations to do
#
R <- 250

# Simulate data
#
X <- matrix(rnorm(N*p), N, p)

# Define the parameter vector to be 1, 0, 0, ..., 0, 0
#
B <- c(1, rep(0, p-1))

in_sample <- out_of_sample <- rep(NA, R)

for (i in 1:R){
  
  if (i %% 50 == 0){print(paste(i/R*100, "% done"))}

  # Simulate the error term
  # 
  epsilon <- rnorm(N, 0, 10)
  
  # Define the response variable as XB + epsilon
  #
  y <- X %*% B + epsilon
  
  # Fit to 80% of the data
  #
  L <- lm(y[1:800]~., data=data.frame(X[1:800,]))
  
  # Predict on the remaining 20%
  #
  preds <- predict.lm(L, data.frame(X[801:1000, ]))

  # Calculate the tiny in-sample MSE and the gigantic out-of-sample MSE
  #
  in_sample[i] <- sum((predict(L) - y[1:800])^2)/800
  out_of_sample[i] <- sum((preds - y[801:1000,])^2)/200
}

# Summarize results
#
boxplot(in_sample, out_of_sample, names=c("in-sample", "out-of-sample"), main="MSE")
summary(in_sample)
summary(out_of_sample)
summary(out_of_sample/in_sample)

该模型每次都严重过度拟合。

In-sample MSE summary
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  3.039   5.184   6.069   6.081   7.029   9.800

Out-of-sample MSE summary
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  947.8  1291.6  1511.6  1567.0  1790.0  3161.6

Paired Ratio Summary (always (!) much larget than 1)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  109.8   207.9   260.2   270.3   319.6   566.9

普通最小二乘法 (OLS) 非常稳健，在高斯马尔可夫假设下，它是最佳线性无偏估计器 (BLU)。因此，不存在被理解为问题的过度拟合，例如神经网络。如果你要这么说，那就是“合身”。

当您应用 OLS 的变体时，包括添加多项式或应用加法模型，当然会有好的和坏的模型。

使用 OLS，您需要确保满足基本假设，因为如果您违反重要假设，OLS 可能会出错。然而，OLS 的许多应用，例如计量经济学中的因果模型，并不知道过度拟合本身就是一个问题。模型通常通过添加/删除变量并检查 AIC、BIC 或调整后的 R 方来“调整”。

另请注意，OLS 通常不是预测建模的最佳方法。虽然 OLS 相当健壮，但神经网络或 boosting 之类的东西通常能够产生比 OLS 更好的预测（误差更小）。

编辑：当然你需要确保你估计一个有意义的模型。这就是为什么在选择模型（包括哪些变量）时应该查看 BIC、AIC、调整后的 R 平方的原因。“太大”的模型以及“太小”的模型（省略变量偏差）可能是一个问题。但是，在我看来，这不是过拟合的问题，而是模型选择的问题。

其它你可能感兴趣的问题

上一篇为什么我在预测时应该关注季节性数据？下一篇我应该为 PCA 使用 keras 还是 sklearn？