机器算法验证 - 在另一个分析之前使用随机森林作为变量重要性的预处理 - 吾爱随笔录

在另一个分析之前使用随机森林作为变量重要性的预处理

机器算法验证机器学习支持向量机特征选择随机森林

2022-04-19 06:27:16

问题
展示了正确应用“随机森林”作为可变重要性选择工具的速度和准确性，特别是在处理非常大的数据与支持向量机（svm）等替代方法时。在理想情况下，它将是一个预处理器，然后是第二种方法，这种方法不太“非参数”。

我已经给出了以前的答案（链接），对此的回答很弱，但我认为一个干净而完整的答案与该领域的学习者相关。

动机：
发表了几篇论文 [ 1 , 2 , 3 , 4 ]，其中该方法被描述为“通常比原始 Freidman 的 MART 和 Breiman 的 RF 更准确、更快（在大规模的两个维度数据集上）”并且能够处理数十当“物理”高度非线性、多变量和嘈杂时，数千个混合类型（分类和数字）的预测变量（列）可以包含数千万个样本（行）的缺失值。

在这些论文中，基本工具是作为预处理器的“随机森林”。一般的方法是使用随机森林来确定重要的变量（列），然后仅使用这些列，使用更复杂的方法进行操作。在一种情况下，增加了数据的列以增强对变量重要性的确定。有许多替代后处理方法，包括最小二乘拟合和 ANOVA。

我个人发现这种方法对高维和低维数据都非常有用。虽然它发布了，但我不知道它的实用性已经在该领域的一小部分之外得到证明，所以我认为该领域的许多从业者不知道使用它。

虽然这是一个启发式的论点，但我认为kaggle上的大多数获胜算法都与并行集成 (rf)和系列集成 (gbm)有关这一事实很好地表明了这些方法系列的强大功能。

细节：
应该有两个数据集。第一个应该是“像样的玩具”，就像这里演示选项卡上显示的那样，第二个应该是像HIVA数据的预处理版本一样的像样的“大男孩”数据集。

方法应使用“默认”或“教科书”设置。在这些之外进行调整可能会导致不太普遍的结果。

感兴趣的度量应包括 CPU 时间和错误。

（版主注意：我认为这样的问题，它的答案很重要，并且会为简历读者提供价值。如果没有很好的答案，我打算自己提供一个答案。考虑一下- 这是非常基本的，但是不断提高投票率，因为人们不断发现其中的价值。这个问题是我对这个元问题的跟进。）

1个回答

注意：此答案不完整。

玩具问题：
这是一个微不足道的问题，通常尺寸很小，并且人类的直觉和学习尽可能容易理解。就个人而言，我发现这个（链接，链接）演示对于我的直觉和学习来说是可以访问的。马克斯普朗克生物控制论研究所的人也是如此。

“非增广”数据的形式为：

[\begin{matrix} C l a s s & X & Y \\ A & x_{1} & y_{1} \\ A & x_{2} & y_{2} \\ ⋮ & ⋮ & ⋮ \\ B & x_{n} & y_{n} \end{matrix}]

$\begin{bmatrix} Class & X & Y\\ A& x_1 & y_1\\ A& x_2 & y_2\\ \vdots & \vdots & \vdots\\ B& x_n & y_n\\ \end{bmatrix}$

The "physics" of the "good" class is a spiral starting at the origin while the bad class is uniformly random. The human eye can see that quickly. When evaluating "variable importance" we are trying to reduce the number of columns, but the non-augmented has no columns to reduce, thus we augment with random. There is the problem of overlap, it would be better to reclassify some of the uniform random within a range of "ideal" to class "A".

So here is the code that makes the "non-augmented" data:

#housekeeping
rm(list=ls())

#library
library(randomForest)

#for reproducibility
set.seed(08012015)

#basic
n <- 1:2000
r <- 0.05*n +1 
th <- n*(4*pi)/max(n)

#polar to cartesian
x1=r*cos(th) 
y1=r*sin(th)

#add noise
x2 <- x1+0.1*r*runif(min = -1,max = 1,n=length(n))
y2 <- y1+0.1*r*runif(min = -1,max = 1,n=length(n))

#append salt and pepper
x3 <- runif(min = min(x2),max = max(x2),n=length(n))
y3 <- runif(min = min(y2),max = max(y2),n=length(n))

X <- c(x2,x3)
Y <- c(y2,y3)
myClass <- as.factor(c(as.vector(matrix(1,nrow=length(y2))),
        as.vector(matrix(2,nrow=length(y3))) ))

#plot class "A" derivation
plot(x1,y1,pch=18,type="l",col="Red", lwd=2)
points(x2,y2,pch=18)
points(x3,y3,pch=1,col="Blue")
legend(x = 65,y=65,
       legend = c("true","sampled A","sampled B"),
       col = c("Red","Black","Blue"),
       lty = c(1,-1,-1),
       pch=c(-1,18,1))

Here is a plot of the non-augmented data.

Here is the code to augment the "toy" for variable importance detection, and assemble into a single data frame.

#Create bad columns class of uniform randomized good columns
x5 <- sample(x = X, size = length(X),replace = T)
y5 <- sample(x = Y, size = length(Y),replace = T)


#assemble data into frame 
data <- data.frame(myClass,
                   c(X),c(Y),c(x5),c(y5) )
names(data) <- c("myclass","x","y","n1","n2")

First a random forest (not yet with t-tests as in the Tuv reference) is used on all input columns to determine relative variable importance, and to get a sense of sufficient number of trees. It is assumed that more trees are required to get a decent fit using low importance data than with uniformly higher importance data.

#train random forest - I like h2o, but this is textbook Breimann
fit.rf_imp <- randomForest(data[2:5],data$myclass,
                       ntree = 2000, replace=TRUE, nodesize = 1,
                       localImp=T )

varImpPlot(fit.rf_imp)
plot(fit.rf_imp) 
grid()
importance(fit.rf_imp)

The results for importance (in plot form) are:

The mean decrease in accuracy and mean decrease in gini have a consistent message: "n1 and n2 are low importance columns".

The results for the convergence plot are:

Although somewhat qualitative, it appears that some acceptable level of convergence has occurred by 500 trees. It is also worth noting that the converged error rate is about 22%. This leads to the inference that the "classification error" within the region of "A" is about 1 in 5.

The code for an updated forest, one not including low-importance columns, is:

fit.rf <- randomForest(data[2:3],data$myclass,
                           ntree = 500, replace=TRUE, nodesize = 1,
                           localImp=T )

A plot of actual vs. predicted has excellent accuracy. Code to derive the plot follows:

data2 <- predict(fit.rf,newdata=data[data$myclass==1,c(2,3,4,5)],
                 type="response")

#separate class "1" from training data
idx1a <- which(data[,1]==1)

#separate class "1" from the predicted data
idx1b <- which(data2==1)

#separate class "2" from training data
idx2a <- which(data[,1]==2)

#separate class "2" from the predicted data
idx2b <- which(data2==2)

#show the difference in classes before and after RF based filter
#class "B" aka 2, uniform background
plot(data[idx2a,2],data[idx2a,3])
points(data[idx2b,2],data[idx2b,3],col="Blue")

#class "A" aka 1, red spiral
points(data[idx1a,2],data[idx1a,3])                        
points(data[idx1b,2],data[idx1b,3],col="Red",pch=18)

The actual plot follows.

For a very simple toy problem, a basic randomForest has been used to determine importance of variables, and to attempt to classify "in" versus "out".

I have an older laptop. It is a Dell Latitude E-7440 with an i7-4600 and 16 GB of RAM running Windows 7. You might have something fancy, or something even older than mine. You could have different OS, R version, or hardware. Your results are likely to differ from mine in absolute scale, but relative scale should still be informative.

Here is the code I used to benchmark the "variable importance" random forest:

res1 <- microbenchmark(randomForest(data[2:5],data$myclass,
                                   ntree = 2000, replace=TRUE, nodesize = 1,
                                   localImp=T ),
                      times=100L)
print(res1)

and here is the code I used to benchmark the fit of a random forest to the important variables only:

res2 <- microbenchmark(randomForest(data[2:3],data$myclass,
                                    ntree = 500, replace=TRUE, nodesize = 1,
                                    localImp=T ),
                       times=100L)

print(res2)

The time-result for the variable importance was:

       min       lq     mean  median       uq      max neval
1 9.323244 9.648383 9.967486 9.84808 10.05356 12.12949   100

Over 100 iterations the mean time-to-compute was 9.96 seconds. This is the "time to beat" for "incomparably faster" applied to the toy problem.

The time-result for the reduced model was:

      min       lq     mean   median      uq      max neval
 1.515134 1.598504 1.638809 1.634209 1.67372 2.038021   100

When computed over 100 iterations, the mean time-to-compute was 1.64 seconds. Running on the important data, and only for "reasonable" number of trees, reduced the run-time by about 84%.

INCOMPLETE.

References:

http://www.statistik.uni-dortmund.de/useR-2008//slides/Strobl+Zeileis.pdf
https://arxiv.org/pdf/1804.03515.pdf (updated 11/27/2018)

Awaiting:

random Forest on non-toy, with timing. HIVA is not the right data, even though I asked for it. I need an intermediate set.
random Forest + t-test solution on toy, with timing
random Forest + t-test solution on non-toy, with timing
svm solution on toy, with timing
svm solution on non-toy, with timing

其它你可能感兴趣的问题

上一篇使用 mlogit 分析排序数据下一篇计算交叉验证R2R2从平均交叉验证误差