我试图根据旅行类型、汽车类型、预订来源、开始时间、提前期(开始预订)和其他一些参数等参数来预测客户是否会取消。从 default.ct 的准确度下面的代码中,我做的第一个分类是给我 75% 的准确度。deep.ct 我生成的更深的树给了我 70% 的准确度。修剪树的准确性也逐渐保持不变。使用 adabag 包进行提升花费的时间太长,因为我对 19 个变量进行了近 5,00,000 次观察。xgboost 给了我最好的 mlogloss 值,大约为 0.43。
我可以做些什么来提高模型的准确性?
# Generate classification tree
default.ct <- rpart(tag ~ ., data = train.df, method = "class",
control=rpart.control(minsplit=2, minbucket=1, cp=0.001))
summary(default.ct)$used
printcp(default.ct)
# generate confusion matrix for training data
prp(default.ct, type = 1, extra = 1, under = TRUE, split.font = 1, varlen =
-10)
default.ct.point.pred.train <- predict(default.ct,train.df,type = "class")
confusionMatrix(default.ct.point.pred.train, train.df$tag)
deeper.ct <- rpart(tag ~ ., data = train.df, method = "class", cp = 0,
minsplit = 1)
# count number of leaves
length(deeper.ct$frame$var[deeper.ct$frame$var == "<leaf>"])
## Use cross-validation to prune the tree
cv.ct <- rpart(tag ~ ., data = train.df, method = "class", cp = 0, minsplit =
5, xval = 5)
# use printcp() to print the table.
printcp(cv.ct)
# Use variable c to store accuracy data for different cp and print it out
c <- list()
for (i in 1:nrow(cv.ct$cptable)){
pruned.ct <- prune(cv.ct,
cp = cv.ct$cptable[i])
pruned.ct.point.pred.train <- predict(pruned.ct,valid.df,type = "class")
c[i] <- confusionMatrix(pruned.ct.point.pred.train, valid.df$tag)$overall[1]
}
# prune the tree with second large cp and use it to predict validation data
pruned.ct <- prune(cv.ct, cp = cv.ct$cptable[2])
length(pruned.ct$frame$var[pruned.ct$frame$var == "<leaf>"])