机器算法验证 - R中的Hosmer-Lemeshow测试 - 吾爱随笔录

R中的Hosmer-Lemeshow测试

机器算法验证 r 拟合优度 hosmer-lemeshow-test

2022-03-19 15:21:56

Hosmer-Lemeshow 检验是逻辑回归模型拟合优度的统计检验。根据?hoslem.test，它只处理二元逻辑回归。但是，我想知道这个测试是否可以用于因变量超过 2 个级别的有序 logit 模型。

有序 logit 模型也称为比例优势模型或累积 logit 模型。我使用“Ordinal”包。非常感谢。

1个回答

7 月 28 日更新：以下内容现已generalhoslem与 Lipsitz 和 Pulkstenis-Robinson 测试一起推送到包中的 CRAN。

Fagerland 和 Hosmer在比例优势回归模型 2013 Stat Med 的拟合优度检验和拟合优度检验中讨论了 Hosmer-Lemeshow 检验和其他两种方法（Lipsitz 检验和 Pulkstenis-Robinson 检验）的推广在序数逻辑回归模型（2016）统计计算与模拟杂志。

我还没有正确检查，但据我所知，它们还没有在 R 中实现。如果事实证明是这样，我计划将它们添加到generalhoslem包中。

编辑：可能还值得指出 Fagerland 和 Hosmer 的建议

因为测试可能会检测到不同类型的不匹配，所以对拟合优度的全面评估需要使用所有三种方法。

另一个编辑：我没有立即意识到这一点，但 Hosmer-Lemeshow 测试的多项版本和序数版本之间的唯一区别在于自由度。严格来说，序数响应模型的测试需要按加权序数“响应分数”对观察结果进行排序，但它们在上面 2012 年的文章中表明，这相当于以与多项式情况相同的方式对观察结果进行分箱。

我没有正确测试它，但logitgof我的 github 页面上的功能现在应该可以解决问题：https ://github.com/matthewjay15/generalhoslem-v1.1.0/blob/1.2.0.9000/logitgof.R请注意，如果你检查此函数生成的观察到的和预期的表，那么列的顺序可能不正确（它们将按字母顺序排列）。但是，这不应该对测试统计数据产生任何影响，因为它们将相互对应。

ordinal使用andMASS包的示例：

library(reshape) # needed by logitgof
logitgof <- function (obs, exp, g = 10, ord = FALSE) {
  DNAME <- paste(deparse(substitute(obs)), deparse(substitute(exp)), sep = ", ")
  yhat <- exp
  if (is.null(ncol(yhat))) {
    mult <- FALSE
  } else {
    if (ncol(yhat) == 1) {
      mult <- FALSE
    } else mult <- TRUE
  }
  n <- ncol(yhat)
  if (mult) {
    if (!ord) {
      METHOD <- "Hosmer and Lemeshow test (multinomial model)"
    } else {
      METHOD <- "Hosmer and Lemeshow test (ordinal model)"
    }
    qq <- unique(quantile(1 - yhat[, 1], probs = seq(0, 1, 1/g)))
    cutyhats <- cut(1 - yhat[, 1], breaks = qq, include.lowest = TRUE)
    dfobs <- data.frame(obs, cutyhats)
    dfobsmelt <- melt(dfobs, id.vars = 2)
    observed <- cast(dfobsmelt, cutyhats ~ value, length)
    observed <- observed[order(c(1, names(observed[, 2:ncol(observed)])))]
    dfexp <- data.frame(yhat, cutyhats)
    dfexpmelt <- melt(dfexp, id.vars = ncol(dfexp))
    expected <- cast(dfexpmelt, cutyhats ~ variable, sum)
    expected <- expected[order(c(1, names(expected[, 2:ncol(expected)])))]
    stddiffs <- abs(observed[, 2:ncol(observed)] - expected[, 2:ncol(expected)]) / sqrt(expected[, 2:ncol(expected)])
    if (ncol(expected) != ncol(observed)) stop("Observed and expected tables have different number of columns. Check you entered the correct data.")
    chisq <- sum((observed[, 2:ncol(observed)] - expected[, 2:ncol(expected)])^2 / expected[, 2:ncol(expected)])
    if (!ord) {
      PARAMETER <- (nrow(expected) - 2) * (ncol(yhat) - 1) 
    } else {
      PARAMETER <- (nrow(expected) - 2) * (ncol(yhat) - 1) + ncol(yhat) - 2
    }
  } else {
    METHOD <- "Hosmer and Lemeshow test (binary model)"
    if (is.factor(obs)) {
      y <- as.numeric(obs) - 1
    } else {
      y <- obs
    }
    qq <- unique(quantile(yhat, probs = seq(0, 1, 1/g)))
    cutyhat <- cut(yhat, breaks = qq, include.lowest = TRUE)
    observed <- xtabs(cbind(y0 = 1 - y, y1 = y) ~ cutyhat)
    expected <- xtabs(cbind(yhat0 = 1 - yhat, yhat1 = yhat) ~ cutyhat)
    stddiffs <- abs(observed - expected) / sqrt(expected)
    chisq <- sum((observed - expected)^2/expected)
    PARAMETER <- nrow(expected) - 2
  }
  if (g != nrow(expected))
    warning(paste("Not possible to compute", g, "rows. There might be too few observations."))
  if (any(expected[, 2:ncol(expected)] < 1))
    warning("At least one cell in the expected frequencies table is < 1. Chi-square approximation may be incorrect.")
  PVAL <- 1 - pchisq(chisq, PARAMETER)
  names(chisq) <- "X-squared"
  names(PARAMETER) <- "df"
  structure(list(statistic = chisq, parameter = PARAMETER, 
                 p.value = PVAL, method = METHOD, data.name = DNAME, observed = observed, 
                 expected = expected, stddiffs = stddiffs), class = "htest")
}

library(foreign) # just to download the example dataset

# with the ordinal package
library(ordinal)
ml <- read.dta("http://www.ats.ucla.edu/stat/data/hsbdemo.dta")
mod1 <- clm(ses ~ female + write + read, data = ml)

# extract predicted probs for each level of outcome
predprob <- data.frame(id = ml$id, female = ml$female, read = ml$read, write = ml$write)
fv <- predict(mod1, newdata = predprob, type = "prob")$fit
logitgof(ml$ses, fv, ord = TRUE) # set ord to TRUE to run ordinal instead of multinomial test

# with MASS
library(MASS)
mod2 <- polr(ses ~ female + write + read, data = ml)
logitgof(ml$ses, fitted(mod2), ord = TRUE)

其它你可能感兴趣的问题

上一篇为什么有人会使用 z 分数而不是 t 分数？下一篇如何找到甜蜜点