无论正确与否,在数据科学的早期,我都无法判断自己是不是自己。
但是,我在一个添加了一些随机文档(非常短的句子)的语料库上应用了 Kmeans。这些已被矢量化以适合。
有了聚类结果,我以某种方式期望向量(关键字)一次只落在一个聚类中(仅此而已)。不是这种情况。
在某些情况下,我有一个向量落在两个集群中,我想知道为什么会这样。
- 这是因为 Kmeans 在由文档制成的向量上的不当使用吗?
- 这与 Kmeans 的工作方式一样正常吗(移动质心,但事实上将对象按距离分配到最近的集群)?
- 这种重叠是因为在分析我的结果时我评估了一个集群中的整个项目组,而不仅仅是(比如说)靠近中心的顶部 X?
- 例子:
corpus = [
'The car is driven on the road.',
'The truck is driven on the highway.',
'The train run on the tracks.',
'The bycicle is run on the pavement.',
'The flight is conducted in the air.',
'The baloon is conducted in the air.',
'The bird is flying in the air.',
'The man is walking in the street.',
'The pedestrian is crossing the zebra.',
'The pilot flights the plane].',
'On the route, the car is driven.',
'On the road, the truck is moved.',
'The train is running on the tracks.',
'The bike is running on the pavement.',
'The flight takes place in the sky.',
'Birds don''t fly when is dark',
'The baloon is in the water.',
'The bird flies in the sky.',
'In the road, the guy walks.',
'The pedestrian is passing through the zebra.',
'The pilot is flying the plane.',
'This is a Japanese doll.',
'I really want to go to work, but I am too sick to drive.',
'Christmas is coming.',
'With the daylight saving time turned off it''s getting dark soon.',
'The body fat may compensates for the loss of nutrients.',
'Mary plays the piano.',
'She always speaks to him in a loud voice.',
'Wow, does that work?',
'I don''t like walking when it is dark',
'Last Friday in three week’s time I saw a spotted striped blue worm shake hands with a legless lizard.',
'My Mum tries to be cool by saying that she likes all the same things that I do.',
'Mummy is saying that she loves me being a pilot when in reality she is scared all the time I take off.',
'Where do random thoughts come from?',
'A glittering gem is not enough.',
'We need to rent a room for our party.',
'A purple pig and a green donkey flew a kite in the middle of the night and ended up sunburnt.',
'If I don’t like something, I’ll stay away from it.',
'The body may perhaps compensates for the loss of a true metaphysics.',
'Don''t step on the broken glass.',
'It was getting dark, and we weren''t there yet.',
'Playing an instrument like the guitar takes out the stress from my day.']
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(analyzer='word',
max_df=0.8,
max_features=50000,
lowercase=True
)
X = vectorizer.fit_transform(corpus)
from sklearn.cluster import KMeans
num_clusters = 11
kmean = KMeans(n_clusters=num_clusters, random_state=1021)
clusters = kmean.fit_predict(X)
--
如果你探索这个clusters变量,你会注意到我所说的重叠。例如,关键字baloon出现在集群 10 和 0 中。
有 12 个重叠,在 33 个唯一关键字数据集上代表 1/3,所以我不会说我可以满意的话。
任何建议表示赞赏。谢谢