机器算法验证 - 手动计算R2R2与用于测试新数据不匹配R2R2 - 吾爱随笔录

手动计算R2R2与用于测试新数据不匹配R2R2

机器算法验证 r 相关性预测模型随机森林 r平方

2022-01-24 13:13:21

我知道这是一个相当具体的R问题，但我可能正在考虑不正确地解释比例方差。开始。 $R^2$

我正在尝试使用该R包randomForest。我有一些训练数据和测试数据。当我拟合随机森林模型时，该randomForest功能允许您输入新的测试数据进行测试。然后它会告诉您新数据中解释的方差百分比。当我看到这个时，我得到一个数字。

当我使用该predict()函数根据训练数据拟合的模型预测测试数据的结果值时，我将这些值与测试数据的实际结果值之间的相关系数平方，得到一个不同的数字。这些值不匹配。

这里有一些R代码来演示这个问题。

# use the built in iris data
data(iris)

#load the randomForest library
library(randomForest)

# split the data into training and testing sets
index <- 1:nrow(iris)
trainindex <- sample(index, trunc(length(index)/2))
trainset <- iris[trainindex, ]
testset <- iris[-trainindex, ]

# fit a model to the training set (column 1, Sepal.Length, will be the outcome)
set.seed(42)
model <- randomForest(x=trainset[ ,-1],y=trainset[ ,1])

# predict values for the testing set (the first column is the outcome, leave it out)
predicted <- predict(model, testset[ ,-1])

# what's the squared correlation coefficient between predicted and actual values?
cor(predicted, testset[, 1])^2

# now, refit the model using built-in x.test and y.test
set.seed(42)
randomForest(x=trainset[ ,-1], y=trainset[ ,1], xtest=testset[ ,-1], ytest=testset[ ,1])

1个回答

值不匹配的原因是报告变异解释而不是变异解释。我认为这是教科书中的普遍误解。前几天我什至在另一个线程上提到了这一点。如果您想要一个示例，请参阅（否则非常好）教科书 Seber and Lee, Linear Regression Analysis , 2nd。编。 $R^2$ randomForest $R^2$

的一般定义是 $R^2$

R^{2} = 1 - \frac{\sum_{i} (y_{i} - {\hat{y}}_{i})^{2}}{\sum_{i} (y_{i} - \bar{y})^{2}} .

$R^2 = 1 - \frac{\sum_i (y_i - \hat{y}_i)^2}{\sum_i (y_i - \bar{y})^2} .$

也就是说，我们计算均方误差，将其除以原始观测值的方差，然后将其从 1 中减去。（请注意，如果您的预测非常糟糕，该值可能会变为负数。）

现在，线性回归（带有截距项！）发生的情况是的平均值匹配。此外，残差向量与拟合值向量 hat{y} 正交。把这两件事放在一起，定义就简化为更常见的那一种，即（我在 }来表示线性回归。） $\hat{y}_i$ $\bar{y}$ $y - \hat{y}$ $\hat{y}$

R_{L R}^{2} = C o r r (y, \hat{y})^{2} .

$R^2_{\mathrm{LR}} = \mathrm{Corr}(y,\hat{y})^2 .$

L R

$\mathrm{LR}$

R_{L R}^{2}

$R^2_{\mathrm{LR}}$

randomForest调用使用第一个定义，所以如果你这样做

   > y <- testset[,1]
   > 1 - sum((y-predicted)^2)/sum((y-mean(y))^2)

你会看到答案匹配。

其它你可能感兴趣的问题

上一篇面向初学者的神经网络参考资料（教科书、在线课程）下一篇了解负二项分布中的参数