如何解释余弦相似度的上三角矩阵

数据挖掘 阿帕奇火花 相似
2022-02-26 09:50:15

在Spark中,有一种RowMatrix.columnSimilarities()返回该矩阵的列之间余弦相似度的 nxn 稀疏上三角矩阵“。

我应该怎么读?如果我尝试从https://stackoverflow.com/a/1750187实现一个示例,如下所示:

JavaRDD<Vector> rows = sc.parallelize(Arrays.asList(
    new DenseVector(new double[]{2, 1, 0, 2, 0, 1, 1, 1}),
    new DenseVector(new double[]{2, 1, 1, 1, 1, 0, 1, 1})
));

RowMatrix mat = new RowMatrix(rows.rdd());  
List<Vector> sims = mat.columnSimilarities().toRowMatrix().rows().toJavaRDD().collect();
for(Vector v: sims) {
    System.out.println(v);
}

我明白了

(8,[6,7],[0.7071067811865475,0.7071067811865475])
(8,[1,2,3,4,5,6,7],[0.9999999999999998,0.7071067811865475,0.9486832980505137,0.7071067811865475,0.7071067811865475,0.9999999999999998,0.9999999999999998])
(8,[2,3,4,5,6,7],[0.7071067811865475,0.9486832980505137,0.7071067811865475,0.7071067811865475,0.9999999999999998,0.9999999999999998])
(8,[7],[0.9999999999999998])
(8,[4,5,6,7],[0.4472135954999579,0.8944271909999159,0.9486832980505137,0.9486832980505137])
(8,[6,7],[0.7071067811865475,0.7071067811865475])
(8,[3,4,6,7],[0.4472135954999579,1.0,0.7071067811865475,0.7071067811865475])

我应该如何解释它?如引用的 StackOverflow 帖子中所述,如何从中获得余弦角 0.822?

谢谢!

1个回答

解决方案是变换矩阵:

JavaRDD<Vector> rows = jsc.parallelize(Arrays.asList(
    new DenseVector(new double[]{2, 2}),
    new DenseVector(new double[]{0, 1}),
    new DenseVector(new double[]{1, 1}),
    new DenseVector(new double[]{1, 0}),
    new DenseVector(new double[]{0, 1}),
    new DenseVector(new double[]{2, 1}),
    new DenseVector(new double[]{1, 1}),
    new DenseVector(new double[]{1, 1})
));
​
RowMatrix mat = new RowMatrix(rows.rdd());
List<Vector> sims = mat.columnSimilarities().toRowMatrix().rows().toJavaRDD().collect();
for(Vector v: sims) {
  System.out.println(v); //(2,[1],[0.8215838362577492])
}