在 R 中解释 ctree {partykit} 输出

机器算法验证 r 数据可视化 大车
2022-03-11 07:46:45

数据精度:

  • 报价是一个虚拟变量
  • 分钟数一天内的所有分钟数
  • temp 是温度

这是我的代码:

ctree <- ctree(quotation ~ minute + temp, data = visitquot)
print(ctree)

Fitted party:
[1] root
|   [2] minute <= 600
|   |   [3] minute <= 227
|   |   |   [4] temp <= -0.4259
|   |   |   |   [5] temp <= -2.3174: 0.015 (n = 6254, err = 89.7)
|   |   |   |   [6] temp > -2.3174
|   |   |   |   |   [7] minute <= 68: 0.028 (n = 4562, err = 126.3)
|   |   |   |   |   [8] minute > 68: 0.046 (n = 7100, err = 312.8)
|   |   |   [9] temp > -0.4259
|   |   |   |   [10] temp <= 6.0726: 0.015 (n = 56413, err = 860.5)
|   |   |   |   [11] temp > 6.0726: 0.019 (n = 39779, err = 758.9)
|   |   [12] minute > 227
|   |   |   [13] minute <= 501
|   |   |   |   [14] minute <= 291: 0.013 (n = 30671, err = 388.0)
|   |   |   |   [15] minute > 291: 0.009 (n = 559646, err = 5009.3)
|   |   |   [16] minute > 501
|   |   |   |   [17] temp <= 5.2105
|   |   |   |   |   [18] temp <= -1.8393: 0.009 (n = 66326, err = 617.1)
|   |   |   |   |   [19] temp > -1.8393: 0.012 (n = 355986, err = 4289.0)
|   |   |   |   [20] temp > 5.2105
|   |   |   |   |   [21] temp <= 13.6927: 0.014 (n = 287909, err = 3900.7)
|   |   |   |   |   [22] temp > 13.6927
|   |   |   |   |   |   [23] temp <= 14: 0.035 (n = 2769, err = 92.7)
|   |   |   |   |   |   [24] temp > 14: 0.007 (n = 2161, err = 15.9)
|   [25] minute > 600
|   |   [26] temp <= 1.6418
|   |   |   [27] temp <= -2.3366: 0.012 (n = 110810, err = 1268.1)
|   |   |   [28] temp > -2.3366: 0.014 (n = 584457, err = 7973.2)
|   |   [29] temp > 1.6418: 0.016 (n = 3753208, err = 57864.3)

然后我绘制了树:

plot(ctree, type = "simple")

这是输出的一部分:

在此处输入图像描述

我的问题是:

  1. 在第一个输出中print(ctree),让我们看最后一行[29] temp > 1.6418: 0.016 (n = 3753208, err = 57864.3)价值是什么0.016意思?那是p值吗?是什么err = 57864.3意思?它不能是归因错误的计数,因为它是一个浮点数。
  2. 我在灰色方块中找不到任何类似的输出。如果有人知道如何解释它。p 值怎么可能是负数?

这里相同的代码但不同的输出

3个回答

因此0.016,您看到的只是 SSE 的quotation平均值err

我是老派,不知道如何使用新partykit包准确显示这一点(也许@Achim 可以说明),但我可以向您展示如何使用旧party包完成此操作。

所以让我们创建一些可重现的例子

library(partykit)
airq <- subset(airquality, !is.na(Ozone))
airct <- ctree(Ozone ~ ., data = airq)
print(airct)
# Model formula:
#   Ozone ~ Solar.R + Wind + Temp + Month + Day
# 
# Fitted party:
#   [1] root
# |   [2] Temp <= 82
# |   |   [3] Wind <= 6.9: 55.600 (n = 10, err = 21946.4)
# |   |   [4] Wind > 6.9
# |   |   |   [5] Temp <= 77: 18.479 (n = 48, err = 3956.0)
# |   |   |   [6] Temp > 77: 31.143 (n = 21, err = 4620.6)
# |   [7] Temp > 82
# |   |   [8] Wind <= 10.3: 81.633 (n = 30, err = 15119.0)
# |   |   [9] Wind > 10.3: 48.714 (n = 7, err = 1183.4)
# 
# Number of inner nodes:    4
# Number of terminal nodes: 5

现在让我们使用和计算相同的值dtachpartykit封装和拟合相同的树party

detach("package:partykit", unload=TRUE)
library(party)
airct <- party::ctree(Ozone ~ ., data = airq)

t(sapply(unique(where(airct)), function(x) {
  n <- nodes(airct, x)[[1]]
  Ozone <- airq[as.logical(n$weights), "Ozone"]
  cbind.data.frame("Node" = as.integer(x), 
                   "n" = length(Ozone), 
                   "Avg."= mean(Ozone), 
                   "Variance"= var(Ozone), 
                   "SSE" = sum((Ozone - mean(Ozone))^2))  
}))

#      Node n  Avg.     Variance SSE     
# [1,] 5    48 18.47917 84.16977 3955.979
# [2,] 3    10 55.6     2438.489 21946.4 
# [3,] 6    21 31.14286 231.0286 4620.571
# [4,] 9    7  48.71429 197.2381 1183.429
# [5,] 8    30 81.63333 521.3437 15118.97

我认为从输出中很容易看出哪个是


上面的代码所做的基本上是Ozone从每个内部节点的拟合树中提取信息信息并计算相关统计信息


根据P-values,这似乎是某种类型的打印错误,这里有一个示例如何计算P-value内部节点中的 s

terNodes <- unique(where(airct))
setdiff(1:max(terNodes), terNodes)

sapply(setdiff(1:max(terNodes), terNodes), function(x) {
          n <- nodes(airct, x)[[1]]
          pvalue <- 1 - nodes(airct, x)[[1]]$criterion$maxcriterion
          plab <- ifelse(pvalue < 10^(-3),
                         paste("p <", 10^(-3)),
                         paste("p =", round(pvalue, digits = 3)))
          c("Node" = x, "P-value" = plab)
})

#         [,1]        [,2]        [,3]        [,4]       
# Node    "1"         "2"         "4"         "7"        
# P-value "p < 0.001" "p = 0.002" "p = 0.003" "p = 0.003"

作为旁注,如果quotation是一个虚拟变量,您可能应该适合分类树(即将其转换为因子)而不是像您正在做的回归树

正如@DavidArenburg 所建议的那样,我在另一个答案中收集我的评论,以便于参考。

(1) 如果你的响应是二元的,你需要把它变成一个因子。否则,在生长树时使用的推理不是它应该是的,而且预测、可视化和错误测量也不是用于分类的那些。有关更多示例,请参阅 @DavidArenburg 和 @AntoniosK 的回复。一般来说:解释变量也需要有适当的类(数值、因子、有序因子)才能在生长树时正确处理。

(2)plot(..., type = "simple")当前无法按预期工作 - 换句话说,这是一个错误。我们将partykit在适当的时候修复包裹。目前,您可以使用plot(as.simpleparty(...)). 作为一个可重现的例子:

library("partykit")
data("PimaIndiansDiabetes", package = "mlbench")
ct <- ctree(diabetes ~ ., data = PimaIndiansDiabetes)
plot(as.simpleparty(ct))

简单派对

(3)$criterion图中当前错误显示的表格type = "simple"包含测试统计量和在每个节点中进行的条件推理的相应log 1-p 值。使用对数而不是 p 值是因为它在用于比较、计算最小值等时在数值上更加稳定。请注意,当 p 值显着时,p 值可能会变得非常小。例如,考虑对应于上面树中第一个节点的测试结果:

nodeapply(ct, ids = 1, function(n) info_node(n)$criterion)
## $`1`
##                pregnant       glucose   pressure   triceps      insulin
## statistic  3.631413e+00  5.117841e+00  1.1778530  1.455334  2.570457503
## p.value   -6.380290e-09 -2.710725e-37 -0.5937987 -0.313498 -0.002398554
##                    mass      pedigree           age
## statistic  4.185236e+00  3.143294e+00  3.774507e+00
## p.value   -4.181295e-15 -1.180135e-05 -3.262459e-10

(4) 为了提取实际的 p 值(无对数),有一个提取器函数sctest()(用于结构变化测试),该函数被strucchange包使用,也用于mob()树中的参数不稳定性测试。在上面的例子中:

library("strucchange")
sctest(ct, node = 1)
##               pregnant  glucose  pressure   triceps     insulin         mass
## statistic 3.776615e+01 166.9745 3.2473947 4.2859164 13.07180346 6.570902e+01
## p.value   6.380290e-09   0.0000 0.4477744 0.2691142  0.00239568 4.218847e-15
##               pedigree          age
## statistic 2.318009e+01 4.357601e+01
## p.value   1.180128e-05 3.262459e-10

请注意,一个 p 值(葡萄糖)变为零,而它的 log(1 - p) 非常接近于零但不完全为零。要仅从所选分区变量(如果有)中提取最小 p 值,您可以再次使用该nodeapply()函数从每个节点的 中获取它$info

nodeapply(ct, ids = nodeids(ct), function(n) info_node(n)$p.value)
## $`1`
## glucose 
##       0 
## 
## $`2`
##          age 
## 6.048661e-07 
## 
## $`3`
##        mass 
## 0.001169778 
## 
## ...

(5) 如果要汇总每个终端节点中的响应,最简单的解决方案 (恕我直言) 是tapply()按节点组对响应变量进行简单处理,提供您想要的任何汇总函数。为了更简单的概述,您也可以rbind()这样等。例如:

tab <- tapply(PimaIndiansDiabetes$diabetes, predict(ct, type = "node"),
  function(y) c("n" = length(y), 100 * prop.table(table(y))))
do.call("rbind", tab)
##      n      neg        pos
## 5  144 99.30556  0.6944444
## 6    7 85.71429 14.2857143
## 7  120 82.50000 17.5000000
## 8  214 66.82243 33.1775701
## 11  53 79.24528 20.7547170
## 12 108 39.81481 60.1851852
## 13 122 19.67213 80.3278689

作为@DavidArenburg 出色答案的补充,我将向您展示回归树和分类树之间的输出差异

library(partykit)

# data
airquality = data.frame(airquality)

# create a numeric binary variable as dependent variable
airquality$OzoneClass = 0
airquality$OzoneClass[airquality$Ozone>=34] =1

# regression tree with scale dependent variable
airq <- subset(airquality, !is.na(Ozone))
airct <- ctree(OzoneClass ~ Temp, data = airq)
print(airct)

# Model formula:
#   OzoneClass ~ Temp
# 
# Fitted party:
#   [1] root
# |   [2] Temp <= 82
# |   |   [3] Temp <= 77: 0.096 (n = 52, err = 4.5)
# |   |   [4] Temp > 77: 0.519 (n = 27, err = 6.7)
# |   [5] Temp > 82: 0.973 (n = 37, err = 1.0)
# 
# Number of inner nodes:    2
# Number of terminal nodes: 3




# create a categorical binary variable as dependent variable
airquality$OzoneClass = 0
airquality$OzoneClass[airquality$Ozone>=34] =1
airquality$OzoneClass = as.factor(airquality$OzoneClass)

# classification tree
airq <- subset(airquality, !is.na(Ozone))
airct <- ctree(OzoneClass ~ Temp, data = airq)
print(airct)

# Model formula:
#   OzoneClass ~ Temp
# 
# Fitted party:
#   [1] root
# |   [2] Temp <= 82
# |   |   [3] Temp <= 77: 0 (n = 52, err = 9.6%)
# |   |   [4] Temp > 77: 1 (n = 27, err = 48.1%)
# |   [5] Temp > 82: 1 (n = 37, err = 2.7%)
# 
# Number of inner nodes:    2
# Number of terminal nodes: 3

请注意(a)树的值如何从 0 和 1(回归树)的平均值变为最可能的 0/1 类(分类树),以及(b)err来自数字(错误)的值变成百分比(错误分类)。