数据挖掘 - 基于点之间距离的聚类 - 吾爱随笔录

基于点之间距离的聚类

数据挖掘 Python 聚类无监督学习

2022-02-20 11:18:12

我试图以这样一种方式对地理位置进行集群，即每个集群内的所有位置都在彼此之间的最大 25 英里范围内。为此，我正在使用凝聚聚类。我正在使用自定义距离函数来计算每个位置之间的距离。我不想指定集群的数量。相反，我希望模型进行聚类，直到每个聚类中的所有位置都在 25 英里范围内。我已经尝试在 Scipy 和 Sklearn 中这样做，但没有取得任何进展。以下是我尝试过的方法。它只给了我一个集群。请帮忙。提前致谢。

from scipy.cluster.hierarchy import fclusterdata 
max_dist = 25
# dist is a custom function that calculates the distance (in miles) between two locations using the geographical coordinates

fclusterdata(locations_in_RI[['Latitude', 'Longitude']].values, t=max_dist, metric=dist, criterion='distance')

1个回答

我认为对于 HAC（分层凝聚聚类），首先获得链接矩阵总是有帮助的，这可以让您深入了解集群是如何迭代形成的。除此之外，scipy还提供了dendrogram一种可视化聚类形成的方法，可以帮助您避免将聚类过程视为“黑匣子”。

import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import dendrogram, linkage

# generate the linkage matrix
X = locations_in_RI[['Latitude', 'Longitude']].values
Z = linkage(X,
            method='complete',  # dissimilarity metric: max distance across all pairs of 
                                # records between two clusters
            metric='euclidean'
    )                           # you can peek into the Z matrix to see how clusters are 
                                # merged at each iteration of the algorithm

# calculate full dendrogram and visualize it
plt.figure(figsize=(30, 10))
dendrogram(Z)
plt.show()

# retrive clusters with `max_d`
from scipy.cluster.hierarchy import fcluster
max_d = 25       # I assume that your `Latitude` and `Longitude` columns are both in 
                 # units of miles
clusters = fcluster(Z, max_d, criterion='distance')

这clusters是一个集群 id 的数组，这就是你想要的。

有一篇关于 HAC 的非常有用（但有点长）的帖子值得一读。

其它你可能感兴趣的问题

上一篇这个概念在0-1输球中的意义？下一篇我究竟如何从时间戳中提取特征以进行机器学习？