数据挖掘 - 聚类和混淆矩阵 - 吾爱随笔录

聚类和混淆矩阵

数据挖掘数据挖掘大数据

2022-02-12 08:24:59

问题出在此链接text vs cluster提供了文本集合的初始四个集群分区 {c1, c2, c3, c4} 。假设ground-truth分区由下式给出

cacm texts belong to cluster1
cisi texts belong to cluster2
cran texts belong to cluster3
med texts belong to cluster4

//open text file
fileID = fopen('list.txt');
C = textscan(fileID,'%s %s');
fclose(fileID);

它的作用是将名称排序为是文件名，是集群名称。顺便说一句，我正在使用 Matlab $C\{1\}\{j\}$ $C\{2\}\{j\}$

1个回答

假设您的聚类结果存储在 R 中。您有分区 id part_id和真正的clust_id，这里是一个小例子

df <- read.table(text = "obj,  part_id, clust_id 
X, 1, 1
Y, 2, 2
Z, 3, 3 
U, 1, 3
V, 2, 3
W, 2, 3"

比桌子为你做的工作

> table(df[,c("part_id","clust_id")])

       clust_id
part_id 1 2 3
      1 1 0 1
      2 0 1 2
      3 0 0 1

更新

为了完整起见，您也可以使用 SQL（如评论中所建议的那样），下面在与上面的数据框相同的表上显示了一个示例查询。

您将立即看到存在限制，因为您必须在查询中定义确切的列列表 - 因此必须根据集群/分区列表调整查询。

with clust_agg as (
select part_id, clust_id, count(*) cnt from clust
group by part_id, clust_id
)
select * from clust_agg
pivot(sum(cnt) clust  for (clust_id) in 
     (1 as "1",
      2 as "2",
      3 as "3"))
order by 1;
, header = TRUE, sep=",")

   PART_ID    1_CLUST    2_CLUST    3_CLUST
---------- ---------- ---------- ----------
         1          1                     1 
         2                     1          2 
         3                                1

其它你可能感兴趣的问题

上一篇根据 Tableau 中的筛选器将原始计数更改为比例下一篇施瓦茨准则是如何定义的？