k 模式:最优 k

数据挖掘 机器学习 Python 聚类 k-均值
2022-02-27 14:24:54

我有分类数据,我正在尝试使用此处提供的 GitHub 包来实现 k-modes 。我正在尝试在我的(大型)数据集中创建集群,例如 5-7 条记录,每条记录最相似。

然而,到目前为止,我还没有办法选择最佳的“k”,理想情况下,这会导致最大的轮廓分数。这将是理想的,因为 k-modes 将相异/相似性度量作为距离。因此,我假设剪影距离将根据这种差异定义的距离度量来衡量集群的距离/距离,从而建立剪影分数。我找不到这个的实现。

我可以在这里使用肘部方法吗?但是,我无法理解如何以编程方式确定这一点,而无需查看图表,因为我必须多次重复执行此过程。目前,一个想法是 - 找到成本大幅下降的 k。看看接下来的几个值是否会降低成本。如果是,选择这个作为k,如果不是..然后呢?在这一点上我有点困惑。

我在网上看,也发现了这个,我无法用 k 模式来解释。我正在寻找任何代码/建议来让我走上正确的道路。

3个回答

与其尝试寻找下载源代码的地方,不如自己实现,例如,Silhouette?

您在博客和存储库中在线找到的大量代码已损坏。

我见过很多 github 存储库的代码很糟糕,像你这样的人想知道为什么它不起作用。依靠匿名的其他人不犯错误是一个坏主意。在某些时候,您最好自己编写代码!

当然,可以依赖 sklearn、R、ELKI、Weka 等大型开源项目。这些有代码审查,讨论拉取请求,数十人查看代码,使用它,尝试查找和修复错误(但即使代码中存在错误)。

def matching_disimilarity(a, b):
    return np.sum(a != b, axis=1)

silhouette_dict = dict()
cluster_labels = [...]
distinct_cluster_label_predictions = unique cluster_labels

for i in m_array:
    other_records_in_cluster = m_array_(with cluster_prediction == cluster_prediction of i) - i
    other_records_outside_cluster = m_array_(with cluster_prediction != cluster_prediction of i)
    other_records_outside_cluster_labels = cluster labels of record in other_records_outside_cluster

    sum_a = 0
    sum_b = 0
    sum_cluster_dist = dict()
    avg_cluster_dist = dict()

    for c in distinct_cluster_label_predictions:
        sum_cluster_dist[c] = 0

    # finding a(i) - for each observation i, calculate the average dissimilarity ai between i and all other 
    # points of the cluster to which i belongs.
    for j in other_records_in_cluster:
        sum_a += matching_disimilarity(i, j)
    a = sum_a/len(other_records_in_cluster)

    dict_b = dict()

    # find average of inter-cluster distance with nearest neighbour
    for j in other_records_outside_cluster:
        dist_i_to_j = matching_disimilarity(i,j)
        dict_b[j] = dist_i_to_j
        sum_till_now = sum_cluster_dist[other_records_outside_cluster_labels[j]]
        sum_cluster_dist[other_records_outside_cluster_labels[j]] = sum_till_now+dist_i_to_j

    for c in distinct_cluster_label_predictions:
        avg_cluster_dist[c] = sum_cluster_dist[c]/(length of elements_belonging_to_c)

    # nearest_neighbour is the with smallest average distance
    # for more than one nearest neighbour? Break randomly?
    nearest_cluster_label = key of minimum avg_cluster_dist value

    neighbouring_cluster_records = records with cluster_prediction == nearest_cluster_label

    for k in neighbouring_cluster_records:
        sum_b += dict_b[k]
    b = sum_b/len(neighbouring_cluster_records)

    if (a<b):
        sil = 1 - (a/b)
    elif(a==b):
        sil = 0
    else:
        sil = b/a - 1

    silhouette_dict[i] = sil

average_silhouette_score = avg(all values in silhouette_dict) 

通常,您将选择与最高轮廓值关联的簇数,但这可能会很棘手,因为 X 和 Y 簇之间轮廓值的差异可以忽略不计。您是否尝试过生成剪影图?剪影图将让您可视化集群数据相对于它们分配的集群接近度,在 -1 到 1 的比例上,集群编号在垂直轴上

https://github.com/nicodv/kmodes/issues/46