机器算法验证 - 多个列表重叠的意义 - 吾爱随笔录

多个列表重叠的意义

机器算法验证 r 分布超几何分布

2022-04-02 13:03:46

我正在尝试评估几个基因列表之间重叠的重要性。在这里，我应用了不同的方法来选择与疾病相关的基因，并且我有几个 4 向维恩图来说明结果。

我的主要目标是确定这 4 种方法的交集是否显着，以便我可以在每个维恩图之间进行比较。

为了测试两个列表之间重叠的重要性，我会使用超几何测试，但是我找不到多个重叠问题的任何解决方案。

有人知道我将如何实现这一目标吗？

1个回答

我正在处理类似的问题，但还没有找到一个简单的功能。所以我自己写了一个函数。虽然它不是很简洁，但它确实有效。希望它也对你有所帮助。

hyper_matrix <- function(gene.list, background){
  # generate every combinations of two gene lists
  combination <- expand.grid(names(gene.list),names(gene.list))
  combination$values <- rep(NA, times=nrow(combination))

  # convert long table into wide
  combination <- reshape(combination, idvar="Var1", timevar="Var2", direction="wide")
  rownames(combination) <- combination$Var1
  combination <- combination[,-1]
  colnames(combination) <- gsub("values.", "", colnames(combination))

  # calculate the length of overlap of each pair
  for(i in colnames(combination)){
    for(j in rownames(combination)){
      combination[j,i]<-length(intersect(gene.list[[j]],gene.list[[i]]))
    }
  }

  # calculate the significance of the overlap of each pair
  for(m in 1:length(gene.list)){
    for(n in 1:length(gene.list)){
      if(n>m){
        combination[n,m] <- phyper(combination[m,n]-1, length(gene.list[[m]]), background-length(gene.list[[m]]), length(gene.list[[n]]), lower.tail=F)
        # note that the phyper function (lower.tail=F) give the probability of P[X>x], so the the overlap length should subtract 1 to get a P[X>=x].
      }
    }
  }
  # round to 2 digit.
  return(round(combination,2))
}

有了这个，假设你有 4 个基因列表。

gene.list <- list(listA=paste0("gene",c(1,2,3,4,5,6,7,8,9)),
                 listB=paste0("gene",c(1,3,4,6,7,9)),
                 listC=paste0("gene",c(5,6,7,8,9,11)),
                 listD=paste0("gene",c(11,12,13,14,15)))

并且背景数是世界上 14 个基因（瓮中所有球的数量），结果将是：

hyper_matrix(gene.list, 14)

      listA listB listC listD
listA  9.00  6.00  5.00     0
listB  0.03  6.00  3.00     0
listC  0.24  0.53  6.00     1
listD  1.00  1.00  0.97     5

其中右边的上三角是每对重叠的长度，左边的下三角是超几何检验重叠的显着性。在这个玩具示例中，如果您选择 0.05 作为您的 p 值截止值，则 14 个基因之间的重叠和之间的重叠是显着的listA。listB任何其他对没有明显重叠。

其它你可能感兴趣的问题

上一篇Nadaraya-Watson 模型的条件密度和方差下一篇国家一级流行率的荟萃分析