我想了解Boruta 包是如何工作的。您能否为所谓的随机森林的理论方面推荐一些参考资料?
下面是两个说明性示例,说明为什么我会被 Boruta 算法所吸引。
第一个例子:
> set.seed(666)
> # simulates data
> # y does not depend on x4
> # y depends on x1, x2, x3 only through x3
> x1 <- rnorm(50); x2 <- rnorm(50) ; x3 <- (x1+x2)^2; x4 <- rnorm(50)
> y <- x3+rnorm(50,0.1)
>
> # lm() only indicates x3 is "important"
> summary(lm(y~x1+x2+x3+x4))
Call:
lm(formula = y ~ x1 + x2 + x3 + x4)
Residuals:
Min 1Q Median 3Q Max
-3.0000 -0.7137 0.0352 0.7082 1.7918
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.1875 0.1874 1.000 0.323
x1 -0.1541 0.1469 -1.049 0.300
x2 0.1153 0.1949 0.591 0.557
x3 0.9097 0.0501 18.160 <2e-16 ***
x4 0.1263 0.1518 0.832 0.410
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.123 on 45 degrees of freedom
Multiple R-squared: 0.9013, Adjusted R-squared: 0.8925
F-statistic: 102.7 on 4 and 45 DF, p-value: < 2.2e-16
> # Boruta indicates x1, x2, x3 are important
> Boruta(y~x1+x2+x3+x4, maxRuns=500)
Boruta performed 174 randomForest runs in 10.409 secs.
3 attributes confirmed important: x1 x2 x3
1 attributes confirmed unimportant: x4
第二个例子:
> set.seed(421)
> # simulates data
> # y does not depend on u1
> # y does not depend on u2
> # but y depends on u1+u2
> nsims <- 100
> u1 <- runif(nsims)
> u2 <- runif(nsims)
> x <- (u1+u2)-floor(u1+u2)
> y <- rnorm(nsims, x,.05)
>
> # lm() does not detect some dependence
> summary(fit <- lm(y~u1))
Call:
lm(formula = y ~ u1)
Residuals:
Min 1Q Median 3Q Max
-0.5411 -0.2234 0.0005 0.2197 0.4901
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.55721 0.05902 9.442 1.97e-15 ***
u1 -0.07858 0.09714 -0.809 0.421
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2668 on 98 degrees of freedom
Multiple R-squared: 0.006633, Adjusted R-squared: -0.003503
F-statistic: 0.6544 on 1 and 98 DF, p-value: 0.4205
> summary(fit <- lm(y~u2))
Call:
lm(formula = y ~ u2)
Residuals:
Min 1Q Median 3Q Max
-0.53996 -0.21855 0.01298 0.22406 0.49940
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.47022 0.05124 9.178 7.38e-15 ***
u2 0.09435 0.09298 1.015 0.313
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2663 on 98 degrees of freedom
Multiple R-squared: 0.0104, Adjusted R-squared: 0.000299
F-statistic: 1.03 on 1 and 98 DF, p-value: 0.3127
> summary(fit <- lm(y~u1+u2))
Call:
lm(formula = y ~ u1 + u2)
Residuals:
Min 1Q Median 3Q Max
-0.53400 -0.22071 0.00699 0.21612 0.54375
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.51609 0.06840 7.545 2.45e-11 ***
u1 -0.09978 0.09859 -1.012 0.314
u2 0.11176 0.09455 1.182 0.240
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2663 on 97 degrees of freedom
Multiple R-squared: 0.02074, Adjusted R-squared: 0.0005478
F-statistic: 1.027 on 2 and 97 DF, p-value: 0.3619
>
> # Boruta() does
> Boruta(y~u1)
Boruta performed 44 randomForest runs in 6.328125 secs.
No attributes has been deemed important
1 attributes confirmed unimportant: u1
> Boruta(y~u2)
Boruta performed 20 randomForest runs in 2.8125 secs.
No attributes has been deemed important
1 attributes confirmed unimportant: u2
> Boruta(y~u1+u2)
Boruta performed 48 randomForest runs in 6.796875 secs.
2 attributes confirmed important: u1 u2
No attributes has been deemed unimportant