数据挖掘 - 如何调整/选择 AffinityPropagation 的偏好参数？ - 吾爱随笔录

我有很大的“成对相似矩阵”字典，如下所示：

similarity['group1']：

array([[1.        , 0.        , 0.        , 0.        , 0.        ],
       [0.        , 1.        , 0.09      , 0.09      , 0.        ],
       [0.        , 0.09      , 1.        , 0.94535157, 0.        ],
       [0.        , 0.09      , 0.94535157, 1.        , 0.        ],
       [0.        , 0.        , 0.        , 0.        , 1.        ]])

简而言之，前一个矩阵的每个元素都是相似的概率（值包括 0 和 1）、record_i完全相似和完全不同的概率。record_j10

然后，我将每个相似度矩阵输入一个AffinityPropagation算法，以便对相似记录进行分组/聚类：

sim = similarities['group1']

clusterer = AffinityPropagation(affinity='precomputed', 
                                damping=0.5, 
                                max_iter=25000, 
                                convergence_iter=2500, 
                                preference=????)) # ISSUE here

affinity = clusterer.fit(sim)

cluster_centers_indices = affinity.cluster_centers_indices_
labels = affinity.labels_

但是，由于我在多个相似性矩阵上运行上述内容，因此我需要一个preference似乎无法调整的通用参数。

它在文档中说它默认设置为相似度矩阵的中位数，但是我在这个设置中得到了很多误报，有时工作的平均值有时会给出太多的集群等......

例如：使用偏好参数时，这些是我从相似度矩阵中得到的结果

preference = default # which is the median (value 0.2) of the similarity matrix：（结果不正确，我们看到该记录18不应该存在，因为与其他记录的相似度非常低）：

 # Indexes of the elements in Cluster n°5: [15, 18, 22, 27]

 {'15_18': 0.08,
 '15_22': 0.964546229533378,
 '15_27': 0.6909703138051403,
 '18_22': 0.12,    # Not Ok, the similarity is too low
 '18_27': 0.19,    # Not Ok, the similarity is too low
 '22_27': 0.6909703138051403}

preference = 0.2 in fact from 0.11 to 0.26：（正确的结果，因为记录相似）：

 # Indexes of the elements in Cluster n°5: [15, 22, 27]

 {'15_22': 0.964546229533378,
 '15_27': 0.6909703138051403,
 '22_27': 0.6909703138051403}

我的问题是preference：我应该如何以一种通用的方式选择这个参数？