机器算法验证 - 使用 R 中的“bnlearn”包预测连续变量 - 吾爱随笔录

使用 R 中的“bnlearn”包预测连续变量

机器算法验证 r 机器学习预言贝叶斯网络

2022-03-30 04:54:11

我在 R 中使用 bnlearn 包来学习我的贝叶斯网络的结构及其参数。我想要做的是在给定其他节点的值作为证据的情况下“预测”节点的值（显然，我们正在预测其值的节点除外）。

我有连续变量。

library(bnlearn)                       # Load the package in R
data(gaussian.test)
training.set = gaussian.test[1:4000, ] # This is training set to learn the parameters
test.set = gaussian.test[4001:4010, ]  # This is test set to give as evidence
res = hc(training.set)                 # learn BN structure on training set data 
fitted = bn.fit(res, training.set)     # learning of parameters
pred = predict(fitted$C, test.set)     # predicts the value of node C given test set
table(pred, test.set[, "C"])           # compares the predicted value as original

现在，这段代码运行良好，并给出了一个表格，您可以在其中看到节点 C 的预测值与测试集中节点 C 的原始值完全相同。

我不明白其中的原因，有人可以解释一下吗？

我知道，我提供的测试集的整个 df 已经包含节点 C 的值。但是如果我给出其他列的数据，它就会出错。因此，我尝试了将其他值设为 0 的替代方法。

test.set$C = 0                     # To not give the original value of node C as evidence
pred = predict(fitted$C, test.set) # predicts the value of node C given test set
table(pred, test.set[, "C"])       # compares the predicted value as original

这种方法是错误的吗？（不允许使用“NA”。）

3个回答

为什么要使用table比较输出？将cbind实际值和预测值并排放置表明预测与实际值不同，您可以计算标准准确度指标来量化它们的分歧程度。

library(bnlearn)                       # Load the package in R
library(forecast)

data(gaussian.test)
training.set = gaussian.test[1:4000, ] # This is training set to learn the parameters
test.set = gaussian.test[4001:4010, ]  # This is test set to give as evidence
res = hc(training.set)                 # learn BN structure on training set data 
fitted = bn.fit(res, training.set)     # learning of parameters
pred = predict(fitted, "C", test.set)  # predicts the value of node C given test set
cbind(pred, test.set[, "C"])           # compare the actual and predicted
accuracy(f = pred, x = test.set[, "C"])

比较实际和预测：

> cbind(predicted = pred, actual = test.set[, "C"])           
       predicted    actual
 [1,]  3.5749952  3.952410
 [2,]  0.7434548  1.443177
 [3,]  5.1731669  5.924198
 [4,] 10.0840800 10.296560
 [5,] 12.3966908 12.268170
 [6,]  9.1834888  9.725431
 [7,]  6.8067145  5.625797
 [8,]  9.9246630  9.597326
 [9,]  5.9426798  6.503896
[10,] 16.0056136 16.037176

预测的测量精度：

> accuracy(f = pred, x = test.set[, "C"])
                ME      RMSE       MAE      MPE     MAPE
Test set 0.1538594 0.5804431 0.4812143 6.172352 11.26223

对于您提出的两个预测集（具有原始值和零），我在 R 中发现了相同的输出。

[1]  3.5749952  0.7434548  5.1731669 10.0840800 12.3966908  9.1834888  6.8067145
[8]  9.9246630  5.9426798 16.0056136

这表明 C 的值是不相关的。此外，test.set$c为您提供：

[1]  3.952410  1.443177  5.924198 10.296560 12.268170  9.725431  5.625797  9.597326
[9]  6.503896 16.037176

这本质上与预测的输出不同。这使我相信您的代码实际上是正确的。

发生离散情况的等效情况（无法将目标变量设置为零）。在这种情况下，请执行以下操作：

test.set\$TARGET<-as.factor(0)  
levels(test.set\$TARGET) <- c(level1,level2,level3...)

其它你可能感兴趣的问题

上一篇了解用于 PCA 时 SVD 的输出下一篇如何温和地向流行病学家/公共卫生同事介绍高级预测模型？