机器算法验证 - 使用矩阵乘法计算二进制数据的 Jaccard 或其他关联系数 - 吾爱随笔录

使用矩阵乘法计算二进制数据的 Jaccard 或其他关联系数

机器算法验证 r 矩阵二进制数据关联度量相似之处

2022-03-28 04:14:10

我想知道是否有任何可能的方法来使用矩阵乘法计算 Jaccard 系数。

我用了这段代码

    jaccard_sim <- function(x) {
    # initialize similarity matrix
    m <- matrix(NA, nrow=ncol(x),ncol=ncol(x),dimnames=list(colnames(x),colnames(x)))
    jaccard <- as.data.frame(m)

    for(i in 1:ncol(x)) {
     for(j in i:ncol(x)) {
        jaccard[i,j]= length(which(x[,i] & x[,j])) / length(which(x[,i] | x[,j]))
        jaccard[j,i]=jaccard[i,j]        
       }
     }

这在 R 中实现是完全可以的。我已经完成了骰子相似性，但被 Tanimoto/Jaccard 卡住了。有人可以帮忙吗？

3个回答

我们知道 Jaccard（在任意两列二进制数据之间计算 $\bf{X}$ ）是 $\frac{a}{a+b+c}$ , 而 Rogers-Tanimoto 是 $\frac{a+d}{a+d+2(b+c)}$ ，在哪里

a - 两列都为 1 的行数
b - 此列而不是另一列为 1 的行数
c - 另一列而不是该列为 1 的行数
d - 两列都为 0 的行数

$a+b+c+d=n$ , 中的行数 $\bf{X}$

然后我们有：

$\bf X'X=A$ 是平方对称矩阵 $a$ 所有列之间。

$\bf (not X)'(not X)=D$ 是平方对称矩阵 $d$ 在所有列之间（“not X”是在 X 中转换 1->0 和 0->1）。

所以， $\frac{\bf A}{n-\bf D}$ 是所有列之间的 Jaccard 对称方阵。

$\frac{\bf A+D}{\bf A+D+2(n-(A+D))}=\frac{\bf A+D}{2n-\bf A-D}$ 是所有列之间的 Rogers-Tanimoto 的对称方阵。

我用数字检查了这些公式是否给出了正确的结果。他们是这样。

更新。您还可以获得矩阵 $\bf B$ 和 $\bf C$ ：

$\bf B= [1]'X-A$ ，其中“[1]”表示一个矩阵，大小为 $\bf X$ . $\bf B$ 是平方非对称矩阵 $b$ 在所有列之间；它的元素ij是其中的行数 $\bf X$ 在i列中为 0 ，在j列中为1 。

最后， $\bf C=B'$ .

矩阵 $\bf D$ 当然也可以这样计算： $n \bf -A-B-C$ .

了解矩阵 $\bf A, B, C, D$ ，您可以计算为二进制数据发明的任何成对（不）相似性系数的矩阵。

如果 X 是稀疏的，则上述解决方案不是很好。因为取 !X 会产生一个密集的矩阵，占用大量的内存和计算量。

更好的解决方案是使用公式Jaccard[i,j] = #common / (#i + #j - #common)。使用稀疏矩阵，您可以按如下方式执行（注意该代码也适用于非稀疏矩阵）：

library(Matrix)
jaccard <- function(m) {
    ## common values:
    A = tcrossprod(m)
    ## indexes for non-zero common values
    im = which(A > 0, arr.ind=TRUE)
    ## counts for each row
    b = rowSums(m)

    ## only non-zero values of common
    Aim = A[im]

    ## Jacard formula: #common / (#i + #j - #common)
    J = sparseMatrix(
          i = im[,1],
          j = im[,2],
          x = Aim / (b[im[,1]] + b[im[,2]] - Aim),
          dims = dim(A)
    )

    return( J )
}

这可能对您有用，也可能没有用，具体取决于您的需求。假设您对聚类分配之间的相似性感兴趣：

Jaccard Similarity Coefficient 或Jaccard Index可用于计算两个聚类分配的相似性。

鉴于标签L1和L2，Ben-Hur、Elisseeff 和 Guyon (2002)表明，可以使用中间矩阵的点积计算 Jaccard 指数。下面的代码利用这一点快速计算 Jaccard 索引，而无需将中间矩阵存储在内存中。

代码是用 C++ 编写的，但可以使用sourceCpp命令加载到 R 中。

/**
 * The Jaccard Similarity Coefficient or Jaccard Index is used to compare the
 * similarity/diversity of sample sets. It is defined as the size of the
 * intersection of the sets divided by the size of the union of the sets. Here,
 * it is used to determine how similar to clustering assignments are.
 *
 * INPUTS:
 *    L1: A list. Each element of the list is a number indicating the cluster
 *        assignment of that number.
 *    L2: The same as L1. Must be the same length as L1.
 *
 * RETURNS:
 *    The Jaccard Similarity Index
 *
 * SIDE-EFFECTS:
 *    None
 *
 * COMPLEXITY:
 *    Time:  O(K^2+n), where K = number of clusters
 *    Space: O(K^2)
 *
 * SOURCES:
 *    Asa Ben-Hur, Andre Elisseeff, and Isabelle Guyon (2001) A stability based
 *    method for discovering structure in clustered data. Biocomputing 2002: pp.
 *    6-17. 
 */
// [[Rcpp::export]]
NumericVector JaccardIndex(const NumericVector L1, const NumericVector L2){
  int n = L1.size();
  int K = max(L1);

  int overlaps[K][K];
  int cluster_sizes1[K], cluster_sizes2[K];

  for(int i = 0; i < K; i++){    // We can use NumericMatrix (default 0) 
    cluster_sizes1[i] = 0;
    cluster_sizes2[i] = 0;
    for(int j = 0; j < K; j++)
      overlaps[i][j] = 0;
  }

  //O(n) time. O(K^2) space. Determine the size of each cluster as well as the
  //size of the overlaps between the clusters.
  for(int i = 0; i < n; i++){
    cluster_sizes1[(int)L1[i] - 1]++; // -1's account for zero-based indexing
    cluster_sizes2[(int)L2[i] - 1]++;
    overlaps[(int)L1[i] - 1][(int)L2[i] - 1]++;
  }

  // O(K^2) time. O(1) space. Square the overlap values.
  int C1dotC2 = 0;
  for(int j = 0; j < K; j++){
    for(int k = 0; k < K; k++){
      C1dotC2 += pow(overlaps[j][k], 2);
    }
  }

  // O(K) time. O(1) space. Square the cluster sizes
  int C1dotC1 = 0, C2dotC2 = 0;
  for(int i = 0; i < K; i++){
    C1dotC1 += pow(cluster_sizes1[i], 2);
    C2dotC2 += pow(cluster_sizes2[i], 2);
  }

  return NumericVector::create((double)C1dotC2/(double)(C1dotC1+C2dotC2-C1dotC2));
}

其它你可能感兴趣的问题

上一篇线性回归解释下一篇如果我们修改“有史以来的最佳统计问题”，正确的答案是什么？