机器算法验证 - 我们什么时候应该离散化/合并连续的自变量/特征，什么时候不应该？ - 吾爱随笔录

我们什么时候应该离散化/合并连续的自变量/特征，什么时候不应该？

机器算法验证机器学习连续数据特征工程分箱

2022-01-23 05:07:44

我们什么时候应该离散化/bin自变量/特征，什么时候不应该？

我试图回答这个问题：

一般来说，我们不应该分箱，因为分箱会丢失信息。
Binning实际上是在增加模型的自由度，所以，binning后有可能造成过拟合。如果我们有一个“高偏差”模型，分箱可能还不错，但如果我们有一个“高方差”模型，我们应该避免分箱。
这取决于我们使用的模型。如果是线性模式，并且数据有很多“异常值”，分箱概率会更好。如果我们有一个树模型，那么异常值和分箱会产生很大的不同。

我对吗？还有什么？

我认为这个问题应该被问很多次，但我只能在这些帖子中找到它

我们应该对连续变量进行分类吗？

分解连续预测变量有什么好处？

2个回答

聚合具有实质性意义（无论研究人员是否意识到这一点）。

当需要时，应根据数据本身对数据（包括自变量）进行分类：

以出血统计力。
偏向关联测量。

我相信，这是一篇始于 Ghelke 和 Biehl（1934 年——绝对值得一读，并暗示了一些可以为自己运行的足够简单的计算机模拟）的文献，尤其是在“可修改的面积单位问题”文献中继续（Openshaw , 1983; Dudley, 1991; Lee and Kemp, 2000) 清楚地说明了这两点。

除非有关于聚合规模（要聚合到多少个单元）和聚合的分类功能（哪些个体观察结果最终会出现在哪些聚合单元中）的先验理论，否则不应聚合。例如，在流行病学中，我们关心个人的健康，也关心人群的健康。后者不仅仅是前者的随机集合，而是由例如地缘政治边界、种族分类等社会环境、监狱地位和历史类别等定义的。（例如，参见 Krieger，2012 年）

参考文献
Dudley, G. (1991)。规模、聚合和可修改的区域单元问题。[付费墙]运营地理学家，9（3）：28-33。

Gehlke, CE 和 Biehl, K. (1934)。分组对人口普查资料中相关系数大小的某些影响。[付费墙]美国统计协会杂志，29(185):169–170。

克里格，N. (2012)。谁和什么是“人口”？历史辩论、当前争议以及对理解“人口健康”和纠正健康不公平的影响。米尔班克季刊，90(4):634–681。

Lee, HTK 和 Kemp, Z. (2000)。时空数据的层次推理与在线分析处理。在第 9 届空间数据处理国际研讨会论文集上，北京，中国。国际地理联盟。

Openshaw, S. (1983)。可修改的面积单位问题。现代地理学的概念与技术。Geo Books，英国诺里奇。

看起来您也在从预测的角度寻找答案，所以我在 R 中汇总了两种方法的简短演示

将变量分箱成大小相等的因子。
自然三次样条。

下面，我给出了一个函数的代码，该函数将针对任何给定的真实信号函数自动比较这两种方法

test_cuts_vs_splines <- function(signal, N, noise,
                                 range=c(0, 1), 
                                 max_parameters=50,
                                 seed=154)

此函数将从给定信号创建嘈杂的训练和测试数据集，然后将一系列线性回归拟合到两种类型的训练数据

该cuts模型包括分箱预测变量，通过将数据范围分割成相等大小的半开区间，然后创建二进制预测变量来指示每个训练点属于哪个区间。
该splines模型包括自然三次样条基础扩展，节点在整个预测变量范围内等距分布。

论据是

signal: 一个单变量函数，表示要估计的真值。
N：要包含在训练和测试数据中的样本数。
noise：添加到训练和测试信号中的随机高斯噪声量。
range：训练和测试x数据的范围，在这个范围内统一生成的数据。
max_paramters：模型中要估计的最大参数数。这既是模型中的最大段数，也是cuts模型中的最大节点数splines。

请注意，模型中估计的参数splines数量与节点数相同，因此两个模型进行了公平比较。

函数的返回对象有几个组件

signal_plot: 信号函数图。
data_plot：训练和测试数据的散点图。
errors_comparison_plot：显示两个模型的误差平方和之和在估计参数数量范围内演变的图。

我将使用两个信号函数进行演示。第一个是叠加线性趋势增加的正弦波

true_signal_sin <- function(x) {
  x + 1.5*sin(3*2*pi*x)
}

obj <- test_cuts_vs_splines(true_signal_sin, 250, 1)

以下是错误率的演变方式

第二个例子是我为这种事情保留的一个疯狂的函数，绘制它并查看

true_signal_weird <- function(x) {
  x*x*x*(x-1) + 2*(1/(1+exp(-.5*(x-.5)))) - 3.5*(x > .2)*(x < .5)*(x - .2)*(x - .5)
}

obj <- test_cuts_vs_splines(true_signal_weird, 250, .05)

为了好玩，这是一个无聊的线性函数

obj <- test_cuts_vs_splines(function(x) {x}, 250, .2)

你可以看到：

当模型复杂性针对两者进行适当调整时，样条曲线可以提供更好的整体测试性能。
样条曲线以更少的估计参数提供最佳测试性能。
总体而言，随着估计参数数量的变化，样条曲线的性能更加稳定。

因此，从预测的角度来看，总是首选样条曲线。

代码

这是我用来进行这些比较的代码。我已经将它全部包装在一个函数中，以便您可以使用自己的信号函数进行尝试。您将需要导入ggplot2和splinesR 库。

test_cuts_vs_splines <- function(signal, N, noise,
                                 range=c(0, 1), 
                                 max_parameters=50,
                                 seed=154) {

  if(max_parameters < 8) {
    stop("Please pass max_parameters >= 8, otherwise the plots look kinda bad.")
  }

  out_obj <- list()

  set.seed(seed)

  x_train <- runif(N, range[1], range[2])
  x_test <- runif(N, range[1], range[2])

  y_train <- signal(x_train) + rnorm(N, 0, noise)
  y_test <- signal(x_test) + rnorm(N, 0, noise)

  # A plot of the true signals
  df <- data.frame(
    x = seq(range[1], range[2], length.out = 100)
  )
  df$y <- signal(df$x)
  out_obj$signal_plot <- ggplot(data = df) +
    geom_line(aes(x = x, y = y)) +
    labs(title = "True Signal")

  # A plot of the training and testing data
  df <- data.frame(
    x = c(x_train, x_test),
    y = c(y_train, y_test),
    id = c(rep("train", N), rep("test", N))
  )
  out_obj$data_plot <- ggplot(data = df) + 
    geom_point(aes(x=x, y=y)) + 
    facet_wrap(~ id) +
    labs(title = "Training and Testing Data")

  #----- lm with various groupings -------------   
  models_with_groupings <- list()
  train_errors_cuts <- rep(NULL, length(models_with_groupings))
  test_errors_cuts <- rep(NULL, length(models_with_groupings))

  for (n_groups in 3:max_parameters) {
    cut_points <- seq(range[1], range[2], length.out = n_groups + 1)
    x_train_factor <- cut(x_train, cut_points)
    factor_train_data <- data.frame(x = x_train_factor, y = y_train)
    models_with_groupings[[n_groups]] <- lm(y ~ x, data = factor_train_data)

    # Training error rate
    train_preds <- predict(models_with_groupings[[n_groups]], factor_train_data)
    soses <- (1/N) * sum( (y_train - train_preds)**2)
    train_errors_cuts[n_groups - 2] <- soses

    # Testing error rate
    x_test_factor <- cut(x_test, cut_points)
    factor_test_data <- data.frame(x = x_test_factor, y = y_test)
    test_preds <- predict(models_with_groupings[[n_groups]], factor_test_data)
    soses <- (1/N) * sum( (y_test - test_preds)**2)
    test_errors_cuts[n_groups - 2] <- soses
  }

  # We are overfitting
  error_df_cuts <- data.frame(
    x = rep(3:max_parameters, 2),
    e = c(train_errors_cuts, test_errors_cuts),
    id = c(rep("train", length(train_errors_cuts)),
           rep("test", length(test_errors_cuts))),
    type = "cuts"
  )
  out_obj$errors_cuts_plot <- ggplot(data = error_df_cuts) +
    geom_line(aes(x = x, y = e)) +
    facet_wrap(~ id) +
    labs(title = "Error Rates with Grouping Transformations",
         x = ("Number of Estimated Parameters"),
         y = ("Average Squared Error"))

  #----- lm with natural splines -------------  
  models_with_splines <- list()
  train_errors_splines <- rep(NULL, length(models_with_groupings))
  test_errors_splines <- rep(NULL, length(models_with_groupings))

  for (deg_freedom in 3:max_parameters) {
    knots <- seq(range[1], range[2], length.out = deg_freedom + 1)[2:deg_freedom]

    train_data <- data.frame(x = x_train, y = y_train)
    models_with_splines[[deg_freedom]] <- lm(y ~ ns(x, knots=knots), data = train_data)

    # Training error rate
    train_preds <- predict(models_with_splines[[deg_freedom]], train_data)
    soses <- (1/N) * sum( (y_train - train_preds)**2)
    train_errors_splines[deg_freedom - 2] <- soses

    # Testing error rate
    test_data <- data.frame(x = x_test, y = y_test)  
    test_preds <- predict(models_with_splines[[deg_freedom]], test_data)
    soses <- (1/N) * sum( (y_test - test_preds)**2)
    test_errors_splines[deg_freedom - 2] <- soses
  }

  error_df_splines <- data.frame(
    x = rep(3:max_parameters, 2),
    e = c(train_errors_splines, test_errors_splines),
    id = c(rep("train", length(train_errors_splines)),
           rep("test", length(test_errors_splines))),
    type = "splines"
  )
  out_obj$errors_splines_plot <- ggplot(data = error_df_splines) +
    geom_line(aes(x = x, y = e)) +
    facet_wrap(~ id) +
    labs(title = "Error Rates with Natural Cubic Spline Transformations",
         x = ("Number of Estimated Parameters"),
         y = ("Average Squared Error"))


  error_df <- rbind(error_df_cuts, error_df_splines)
  out_obj$error_df <- error_df

  # The training error for the first cut model is always an outlier, and
  # messes up the y range of the plots.
  y_lower_bound <- min(c(train_errors_cuts, train_errors_splines))
  y_upper_bound = train_errors_cuts[2]
  out_obj$errors_comparison_plot <- ggplot(data = error_df) +
    geom_line(aes(x = x, y = e)) +
    facet_wrap(~ id*type) +
    scale_y_continuous(limits = c(y_lower_bound, y_upper_bound)) +
    labs(
      title = ("Binning vs. Natural Splines"),
      x = ("Number of Estimated Parameters"),
      y = ("Average Squared Error"))

  out_obj
}

其它你可能感兴趣的问题

上一篇无监督、有监督和半监督学习下一篇为什么执行逐步选择后 p 值会产生误导？