我使用 randomForest 为以下数据集创建了一个模型:https ://archive.ics.uci.edu/ml/datasets/Contraceptive+Method+Choice
我要质疑的是,模型在我的训练和测试集上使用时的结果大不相同。
library(randomForest)
library(caret)
df <- read.csv('cmc.csv')
将值更改为因子
df$Wife.s.education <- as.factor(df$Wife.s.education)
df$Husband.s.education <- as.factor(df$Husband.s.education)
df$Wife.s.religion <- as.factor(df$Wife.s.religion)
df$Wife.s.now.working. <- as.factor(df$Wife.s.now.working.)
df$Husband.s.occupation <- as.factor(df$Husband.s.occupation)
df$Standard.of.living.index <- as.factor(df$Standard.of.living.index)
df$Media.exposure <- as.factor(df$Media.exposure)
#add string representation for readiblilty
df[df$Contraceptive.method.used == 1,]$Contraceptive.method.used <- "No-use"
df[df$Contraceptive.method.used == 2,]$Contraceptive.method.used <- "Long-term"
df[df$Contraceptive.method.used == 3,]$Contraceptive.method.used <- "Short-term"
df$Contraceptive.method.used <- as.factor(df$Contraceptive.method.used)
拆分数据:
set.seed(47)
ind <- sample(2, nrow(df), replace = TRUE, prob = c(0.7,0.3))
inTrain <- createDataPartition(y=df$Contraceptive.method.used, p=.7, list = FALSE)
training = df[ind==1,]
testing = df[ind==2,]
创建模型:
model <- randomForest(Contraceptive.method.used ~., data = training, proximity=TRUE)
#Prediction & Confusion Matrix - training data
p1 <- predict(model, training)
confusionMatrix(p1, training$Contraceptive.method.used)
混淆矩阵结果(训练):
Reference
Prediction Long-term No-use Short-term
Long-term 208 9 18
No-use 4 408 5
Short-term 16 14 338
Overall Statistics
Accuracy : 0.9353
95% CI : (0.9184, 0.9496)
No Information Rate : 0.4225
P-Value [Acc > NIR] : < 2e-16
Kappa : 0.9002
Mcnemar's Test P-Value : 0.09773
#Prediction & Confusion Matrix - testing data
p2 <- predict(model, testing)
confusionMatrix(p2, testing$Contraceptive.method.used)
混淆矩阵结果(测试)
Reference
Prediction Long-term No-use Short-term
Long-term 42 11 23
No-use 27 122 47
Short-term 36 65 80
Overall Statistics
Accuracy : 0.5386
95% CI : (0.4915, 0.5853)
No Information Rate : 0.4371
P-Value [Acc > NIR] : 8.972e-06
Kappa : 0.2788
Mcnemar's Test P-Value : 0.005869
正如我们所看到的,这两个结果发生了巨大的变化,如果我打印我的模型,我会得到以下结果:
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 3
OOB estimate of error rate: 46.96%
Confusion matrix:
Long-term No-use Short-term class.error
Long-term 76 63 89 0.6666667
No-use 45 282 104 0.3457077
Short-term 72 106 183 0.4930748
这让我相信测试结果是正确的,但是我不确定为什么在训练集上使用时会有如此大的差异,这是由于过度拟合吗?如果是这样,如何处理?
任何指导都会很棒。