我有点困惑:通过插入符号训练的模型的结果与原始包中的模型有何不同?我阅读了在使用带有插入符号包的 RandomForest 的 FinalModel 进行预测之前是否需要预处理?但我在这里不使用任何预处理。
我通过使用 caret 包和调整不同的 mtry 值来训练不同的随机森林。
> cvCtrl = trainControl(method = "repeatedcv",number = 10, repeats = 3, classProbs = TRUE, summaryFunction = twoClassSummary)
> newGrid = expand.grid(mtry = c(2,4,8,15))
> classifierRandomForest = train(case_success ~ ., data = train_data, trControl = cvCtrl, method = "rf", metric="ROC", tuneGrid = newGrid)
> curClassifier = classifierRandomForest
我发现 mtry=15 是 training_data 上的最佳参数:
> curClassifier
...
Resampling results across tuning parameters:
mtry ROC Sens Spec ROC SD Sens SD Spec SD
4 0.950 0.768 0.957 0.00413 0.0170 0.00285
5 0.951 0.778 0.957 0.00364 0.0148 0.00306
8 0.953 0.792 0.956 0.00395 0.0152 0.00389
10 0.954 0.797 0.955 0.00384 0.0146 0.00369
15 0.956 0.803 0.951 0.00369 0.0155 0.00472
ROC was used to select the optimal model using the largest value.
The final value used for the model was mtry = 15.
我用 ROC 曲线和混淆矩阵评估了模型:
##ROC-Curve
predRoc = predict(curClassifier, test_data, type = "prob")
myroc = pROC::roc(test_data$case_success, as.vector(predRoc[,2]))
plot(myroc, print.thres = "best")
##adjust optimal cut-off threshold for class probabilities
threshold = coords(myroc,x="best",best.method = "closest.topleft")[[1]] #get optimal cutoff threshold
predCut = factor( ifelse(predRoc[, "Yes"] > threshold, "Yes", "No") )
##Confusion Matrix (Accuracy, Spec, Sens etc.)
curConfusionMatrix = confusionMatrix(predCut, test_data$case_success, positive = "Yes")
由此产生的混淆矩阵和准确性:
Confusion Matrix and Statistics
Reference
Prediction No Yes
No 2757 693
Yes 375 6684
Accuracy : 0.8984
....
现在我使用基本的 randomForest 包训练了具有相同参数和相同 training_data 的 Random Rorest:
randomForestManual <- randomForest(case_success ~ ., data=train_data, mtry = 15, ntree=500,keep.forest=TRUE)
curClassifier = randomForestManual
我再次为与上面相同的 test_data 创建了预测,并使用与上面相同的代码评估了混淆矩阵。但现在我得到了不同的措施:
Confusion Matrix and Statistics
Reference
Prediction No Yes
No 2702 897
Yes 430 6480
Accuracy : 0.8737
....
是什么原因?我错过了什么?