机器算法验证 - 训练数据中不存在的新因子水平 - 吾爱随笔录

训练数据中不存在的新因子水平

机器算法验证 r 机器学习随机森林多类

2022-03-31 19:01:47

我收到“训练数据中不存在的新因素水平”错误。但是我检查了开发中每一列的 nlevels 和 class 以及测试数据，它们是相同的。有什么靠谱的解释吗？

2个回答

RF 通过 one-hot 编码处理因子。它为因子变量的每个级别创建一个新的虚拟列。当评分数据框中有新的或不同的因子水平时，就会发生不好的事情。

如果在定义因子时训练和测试一起存在于相同的数据结构中，则没有问题。当测试单独定义其因素时，您会遇到问题。

library("randomForest")

# Fit an RF on a few numerics and a factor. Give test set a new level.
N <- 100
df <- data.frame(num1 = rnorm(N), 
                 num2 = rnorm(N), 
                 fac = sample(letters[1:4], N, TRUE),
                 y = rnorm(N),
                 stringsAsFactors = FALSE)
df[100, "fac"] <- "a suffusion of yellow"
df$fac <- as.factor(df$fac)

train <- df[1:50, ]
test <- df[51:100, ]

rf <- randomForest(y ~ ., data=train)

# This is fine, even though the "yellow" level doesn't exist in train, RF
# is aware that it is a valid factor level
predict(rf, test)

# This is not fine. The factor level is introduced and RF can't know of it
test$fac <- as.character(test$fac)
test[50, "fac"] <- "toyota corolla"
test$fac <- as.factor(test$fac)
predict(rf, test)

您可以通过重新调整评分因素以匹配训练数据来解决此问题。

# Can get around by relevelling the new factor. "toyota corolla" becomes NA
test$fac <- factor(test$fac, levels = levels(train$fac))
predict(rf, test)

expand.grid()在用于检查randomForest()各种因素水平的预测时，我也遇到了这个问题。

该问题是通过默认expand.grid()设置创建的stringsAsFactors = T，该设置使用可用的数据级别将字符串强制转换为因子。当仅使用因子水平的子集进行预测时，这会产生问题。

我通过设置解决了这个问题stringsAsFactors = F，然后允许randomForest()按照上一个答案的建议进行一个热编码。

其它你可能感兴趣的问题

上一篇没有乘法误差的对数转换数据的回归模型下一篇协整数据水平的 VAR