数据挖掘 - KMeans 文档聚类 - 吾爱随笔录

KMeans 文档聚类

数据挖掘聚类 k-均值

2021-09-15 21:27:09

无论正确与否，在数据科学的早期，我都无法判断自己是不是自己。

但是，我在一个添加了一些随机文档（非常短的句子）的语料库上应用了 Kmeans。这些已被矢量化以适合。

有了聚类结果，我以某种方式期望向量（关键字）一次只落在一个聚类中（仅此而已）。不是这种情况。

在某些情况下，我有一个向量落在两个集群中，我想知道为什么会这样。

这是因为 Kmeans 在由文档制成的向量上的不当使用吗？
这与 Kmeans 的工作方式一样正常吗（移动质心，但事实上将对象按距离分配到最近的集群）？
这种重叠是因为在分析我的结果时我评估了一个集群中的整个项目组，而不仅仅是（比如说）靠近中心的顶部 X？

- 例子：

corpus = [
'The car is driven on the road.',
'The truck is driven on the highway.',
'The train run on the tracks.',
'The bycicle is run on the pavement.',
'The flight is conducted in the air.',
'The baloon is conducted in the air.',
'The bird is flying in the air.',
'The man is walking in the street.',
'The pedestrian is crossing the zebra.',
'The pilot flights the plane].',
'On the route, the car is driven.',
'On the road, the truck is moved.',
'The train is running on the tracks.',
'The bike is running on the pavement.',
'The flight takes place in the sky.',
'Birds don''t fly when is dark',
'The baloon is in the water.',
'The bird flies in the sky.',
'In the road, the guy walks.',
'The pedestrian is passing through the zebra.',
'The pilot is flying the plane.',    
'This is a Japanese doll.',
'I really want to go to work, but I am too sick to drive.',
'Christmas is coming.',
'With the daylight saving time turned off it''s getting dark soon.',
'The body fat may compensates for the loss of nutrients.',
'Mary plays the piano.',
'She always speaks to him in a loud voice.',
'Wow, does that work?',
'I don''t like walking when it is dark',
'Last Friday in three week’s time I saw a spotted striped blue worm shake hands with a legless lizard.',
'My Mum tries to be cool by saying that she likes all the same things that I do.',
'Mummy is saying that she loves me being a pilot when in reality she is scared all the time I take off.',    
'Where do random thoughts come from?',
'A glittering gem is not enough.',
'We need to rent a room for our party.',
'A purple pig and a green donkey flew a kite in the middle of the night and ended up sunburnt.',
'If I don’t like something, I’ll stay away from it.',
'The body may perhaps compensates for the loss of a true metaphysics.',
'Don''t step on the broken glass.',
'It was getting dark, and we weren''t there yet.', 
'Playing an instrument like the guitar takes out the stress from my day.']

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(analyzer='word', 
                         max_df=0.8, 
                         max_features=50000,  
                         lowercase=True
                        )

X = vectorizer.fit_transform(corpus)

from sklearn.cluster import KMeans

num_clusters = 11
kmean = KMeans(n_clusters=num_clusters, random_state=1021)
clusters = kmean.fit_predict(X)

如果你探索这个clusters变量，你会注意到我所说的重叠。例如，关键字baloon出现在集群 10 和 0 中。

有 12 个重叠，在 33 个唯一关键字数据集上代表 1/3，所以我不会说我可以满意的话。

任何建议表示赞赏。谢谢

2个回答

让我们假设您的语料库有 n 个不同的关键字。对于 k-means 算法，每个关键字都是 n 维空间中的一个轴。文档是那个 n 维空间中的一个点。

K-means 算法会将每个点（一个文档）分配给一个集群。当您说一个关键字出现在两个集群中时，它可能意味着：该特定维度/关键字对两个集群都很重要。

让我们举个假设的例子：如果你有一个病人的血压、胆固醇水平和一堆其他的医疗参数。假设您将血压离散化为 2 或 3 个水平。如果您在此数据上运行 k-means，则将为每个患者分配一个唯一的集群。但很有可能两个（甚至更多）集群都有收缩压 > 120 的患者。

您可能需要更仔细地阅读 k-means 的结果。

我想你可能把事情搞混了。在您提供的示例中，有 42 个句子，每个句子都通过转换TfIdfVectorizer，这为我们提供了 shape 的稀疏矩阵(42, 174)。然后，将每个句子表示为向量，用 k-means 进行聚类，从而将每个句子分配给一个聚类。

不处理单个单词，只处理整个句子。如果“baloon”关键字出现在两个句子中，并不一定意味着两个句子都会落入同一个簇。然而，我对你所说的感到惊讶，因为包含“baloon”的句子都属于同一个集群（#7）。这让我觉得你误解了结果。

>>> import numpy as np
>>> np.argwhere(["baloon" in sentence for sentence in corpus])
array([[ 5],
       [16]], dtype=int64)
>>> clusters[5]
7
>>> clusters[16]
7

无论如何，可能是包含“baloon”的句子属于不同的集群。这取决于句子中的其他词、聚类的数量、数据集的其余部分和聚类方法。例如，如果包含“baloon”的句子不太相似，就可能出现这种情况。

其它你可能感兴趣的问题

上一篇为什么 BERT 分类在序列长度较长时表现更差？下一篇是否可以使用 NLP 算法生成三段论？