数据挖掘 - 集群文档并识别集群中的突出文档？ - 吾爱随笔录

集群文档并识别集群中的突出文档？

数据挖掘机器学习数据挖掘聚类

2022-03-02 20:00:17

我有一组文档，如下例所示。

doc1 = {'Science': 0.7, 'History': 0.05, 'Politics': 0.15, 'Sports': 0.1}
doc2 = {'Science': 0.3, 'History': 0.5, 'Politics': 0.1, 'Sports': 0.1}

我想对文档进行聚类并确定集群中最突出的文档。

例如，集群 1 包括 = {doc1, doc4, doc5。doc8}，我想获得代表这个集群的最突出的文档（例如，doc8）。（或确定集群的主题）

请让我知道实现这一目标的合适方法:)

1个回答

一种非常简单的方法是为每个集群找到某种质心（例如，分别平均属于每个集群的文档的分布），然后计算集群内每个文档与相应质心的余弦距离。距离较短的文档将最接近质心，因此最具“代表性”。

继续上一个示例：

import pandas as pd
import numpy as np
from sklearn.metrics import pairwise_distances
from scipy.spatial.distance import cosine
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler


# Initialize some documents
doc1 = {'Science':0.8, 'History':0.05, 'Politics':0.15, 'Sports':0.1}
doc2 = {'News':0.2, 'Art':0.8, 'Politics':0.1, 'Sports':0.1}
doc3 = {'Science':0.8, 'History':0.1, 'Politics':0.05, 'News':0.1}
doc4 = {'Science':0.1, 'Weather':0.2, 'Art':0.7, 'Sports':0.1}
collection = [doc1, doc2, doc3, doc4]
df = pd.DataFrame(collection)
# Fill missing values with zeros
df.fillna(0, inplace=True)
# Get Feature Vectors
feature_matrix = df.as_matrix()

# Fit DBSCAN
db = DBSCAN(min_samples=1, metric='precomputed').fit(pairwise_distances(feature_matrix, metric='cosine'))
labels = db.labels_
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
print('Estimated number of clusters: %d' % n_clusters_)

# Find the representatives
representatives = {}
for label in set(labels):
    # Find indices of documents belonging to the same cluster
    ind = np.argwhere(labels==label).reshape(-1,)
    # Select these specific documetns
    cluster_samples = feature_matrix[ind,:]
    # Calculate their centroid as an average
    centroid = np.average(cluster_samples, axis=0)
    # Find the distance of each document from the centroid
    distances = [cosine(sample_doc, centroid) for sample_doc in cluster_samples]
    # Keep the document closest to the centroid as the representative
    representatives[label] = cluster_samples[np.argsort(distances),:][0]

for label, doc in representatives.iteritems():
    print("Label : %d -- Representative : %s" % (label, str(doc)))

其它你可能感兴趣的问题

上一篇即使在 PCA 之后，有哪些可能的方法来处理不可分离的数据？下一篇如果 logits 和标签相同，softmax 与 logits 的交叉熵是否应该始终为零？