机器算法验证 - 拟合和交叉验证由观察形成的分类样本数据 - 吾爱随笔录

拟合和交叉验证由观察形成的分类样本数据

机器算法验证回归物流分类数据交叉验证大车

2022-03-25 17:06:27

我正在使用以下示例数据集：

t1  t2  ntrial  nsuccess
 1       4    1000       4
 1       8    1000       8
 2       4    1000       4
 2       8    1000       8

预测变量（特征） t1 和 t2 是分类的：t1 具有类别 1 和 2，t2 具有类别 4 和 8。这些类别没有排序。实际上，对于 t1 和 t2 的每种组合，我观察到成功或失败。例如，在第一行 (t1, t2) = (1, 4) 中，我从 1000 次观察中记录了 4 次成功。

所以实际上数据集有 4000 行（称为展开的二进制数据），可以出于逻辑回归的目的对其进行压缩。显然，展开这些数据是内存效率低下的，因为展开的数据看起来像这样：

t1  t2  success
 1  4    1
 1  4    1
 1  4    1
 1  4    1
 1  4    0
  ... 996 zeros
 1  4    0
 1  8    1
 1  8    1
 ... and so on

在这个玩具示例中，我每行只有 1000 个观察值，因此可以展开数据，但实际上我的数据集有 10^9 个观察值和许多特征/预测变量，因此展开它并不可行。

我想知道：

有没有一种算法（比如 R 或 Java 等）可以直接在这个数据集上操作，而我不必展开行？该算法必须相对快速且易于训练和预测。
是否有交叉验证例程也可以在此数据集上工作而无需展开行？

这是到目前为止我尝试过的一些入门 R 代码。我适合加权逻辑回归（我们拥有的观察越多，预测数量的方差越低，所以我们手上有加权回归）和回归树。

library(boot)
set.seed(1)
input_file = 'data\\test\\test.txt'
# number of cross-validation folds
K = 10
input <- 
  read.csv(input_file, header=TRUE, sep='\t', quote="", 
           colClasses=c(rep('factor', 2), 'numeric', 'numeric'))
# change the contents of the frame
input$nfail = pmax(0, input$ntrial - input$nsuccess)
    # compute the probability of success
    input$prob = input$nsuccess/input$ntrial

# fit the main model
glm.model =  
  glm(cbind(input$nsuccess, input$nfail) ~ 
        input$t1 + input$t2, family = binomial, weights = input$ntrial)

cost <- 
  function(y,yhat) sum(input$ntrial*((y-yhat)^2))/sum(input$ntrial)
# this produces an error, cv.glm doesn't realize rows have to be unrolled
cv.glm(input, glm.model, cost = cost, K=K)$delta[1]

library(tree)
control = tree.control(sum(input$ntrial))
    tree.model <- 
      tree(input$prob ~ input$t1 + input$t2, 
       weights = input$ntrial, control = control)
plot(tree.model)
text(tree.model, pretty=0)
# this is too big to fit in memory, R fails
cv.tree(tree.model, K = K)

例如，如果我想执行 10 折交叉验证，我必须从 4000 行中的每一行中随机采样每个折，采样 1 或 0（成功或失败），然后对于每个折，我将t1 和 t2 的每个特定组合的成功和失败次数。

1个回答

一种可能的算法是使用具有 t1 和 t2 组合的哈希表。

我经常使用（并且相对简单）我使用 t1_t2 作为成功映射的字符串键

因此，对于您的示例，您的映射将是这样的：

哈希映射：

Key        Value (successes)

1_4        4
1_8        8
2_4        4
2_8        8

.. ETC...

您的地图将只有一次观察到的 t1,t2 组合，该值将是该组合的成功总数

可以假定地图中不存在的所有其他组合的值（成功）为零。

这可能非常有效。事实上，单次通过数据就可以生成映射（在任何支持哈希映射的语言中）

随后，该地图可用于对数据进行任何其他计算

更新：

为了使用hashmap从数据中计算 pdf ，可以执行以下操作（仅用于说明目的）：

count = 0
for t1 in 1..4  # use the categories for t1 here
for t2 in 1..4  # use the categories for t2 here

count++

if 't1_t2' in map:
    pdf(t1_t2) += map.get(t1_t2)/count

end
end

更新2：

如果您还想在地图中存储试验次数，可以这样做（值现在将是一个数组而不是单个数字）：

哈希映射：

Key        Value ([successes, ntrials])

1_4        [4, 1000]
1_8        [8, 999]
2_4        [4, 100]
2_8        [8, 1000]

.. ETC...

其它你可能感兴趣的问题

上一篇具有混合奖励过程的强盗？下一篇平均绝对误差的偏差方差分解