使用随机森林进行后处理

数据挖掘 分类 随机森林
2022-02-26 10:46:00

我知道,通过过滤掉带有随机森林树的决定不确定的标签的实例,并用另一个分类器对这些实例进行建模,可以得到更好的整体结果。我的问题是,如何在一个未标记的数据集上“组合”两个(或更多)分类器的分类?

1个回答

这是使用 R 的插入符号包的解决方案。首先对数据进行随机森林训练。然后将概率(来自投票)小于 99% 的所有观测值传递给模型 2,线性判别分析。仅使用来自未见过的重采样观察的概率,因为否则随机森林将完美地拟合训练数据。这就是插入符号所需要的。

对于不确定的情况,准确率稍高一些,但这可能是过度拟合,因为我尝试了几种不同的模型并且数据集很小。

我想知道这是否真的可以提高您的应用程序中样本外测试数据的性能。有没有推荐这种方法的论文?这种方法似乎类似于提升。我对我的一些数据进行了尝试,但无法提高从模型 1(随机森林)获得的样本外性能。

data(iris)
library(caret)
myTrainControl <- trainControl(method = "cv", number = 10,
                               savePredictions = T,
                               classProbs = TRUE)
set.seed(4213) # To get the same resamples every time
M1 <- train(y = iris$Species,
            x = iris[, !(names(iris) == "Species")],
            method = "rf",
            trControl = myTrainControl)

M1pred <- M1$pred[M1$pred$mtry == 2, ]
confusionMatrix(M1pred$pred, M1pred$obs)
# Accuracy 96%

# Inspect the predicted class probabilities:
probs <- cbind(M1pred$setosa,
                   M1pred$versicolor,
               M1pred$virginica)
colnames(probs) <- c("setosa", "versicolor", "virginica")
maxCol <- max.col(probs)
probs <- cbind(probs, maxCol)
maxProbs <- apply(probs, 1, function(x) x[x["maxCol"]])
summary(maxProbs)
# Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
# 0.5120  0.9805  0.9980  0.9645  1.0000  1.0000

# Let's define anything below 99% as uncertain:
uncertain <- which(maxProbs < 0.99)
length(uncertain) # 45
# Find rows in iris data that belong to 'uncertain' cases:
uncertainIndex <- M1pred$rowIndex[uncertain]
    confusionMatrix(M1pred$pred[uncertain],
                reference = M1pred$obs[uncertain])
# M1 Performance in uncertain cases:
#              Reference
# Prediction   setosa versicolor virginica
# setosa          5          0         0
# versicolor      0         17         3
# virginica       0          3        17
#
# Overall Statistics
# Accuracy : 0.8667

# Train new model on uncertain data only:
irisUncertain <- iris[uncertainIndex, ]
set.seed(4213) # To get the same resamples every time
M2 <- train(y = irisUncertain$Species,
            x = irisUncertain[, !(names(irisUncertain) == "Species")],
            method = "lda",
            trControl = myTrainControl)

M2pred <- M2$pred

confusionMatrix(M2pred$pred, reference = M2pred$obs)
#              Reference
# Prediction   setosa versicolor virginica
# setosa          4          0         0
# versicolor      1         16         0
# virginica       0          4        20
# 
# Overall Statistics
# Accuracy : 0.8889 
# (Small discrepancy: Why does caret report an accuracy of 88,5% for M2?)

# For new data predictions can be made as follows:
# (just as an example from the original data again)
# Some 'uncertain' cases are 'certain' now using the full M2 model
newdat <- irisUncertain[30:34, -5]
M1maxProbs <- apply(predict(M1, newdat, type = "prob"), 1, max)
ifelse(M1maxProbs < 0.99,
       paste("M2:", predict(M2, newdat)),
       paste("M1:", predict(M1, newdat)))
# 85               94              107              126 
# "M2: versicolor" "M1: versicolor"  "M2: virginica"  "M1: virginica" 
# 71 
# "M2: virginica"