数据挖掘 - 在 Python 中使用 k-means 聚类进行异常检测 - 吾爱随笔录

我正在使用 Python 进行异常检测任务。
数据集涉及来自传感器的时间序列集合，因此数据是时间戳和相对值。

为了发现异常，我使用了 k-means 聚类算法。我已将数据集拆分为训练和测试，测试部分在几天内自行拆分。
训练是使用数据集的训练部分完成的，并且每天都会进行预测。
我会这样做，因为这将是生产中的用法。

为了区分记录是否异常，我计算每个点与其最近质心之间的距离。

_clusters = self.km.predict(day)
centroids = self.km.cluster_centers_

# calculate the distance between each record and each centroid.
# the result is a matrix which has as column the id of centroid and rows are records.
# so each value is the distance of between record and centroid
distance_matrix = spatial.distance_matrix(day, centroids)

# save in nearest_distances, for each record, distance between each point and its nearest centroid
nearest_distances = []
for distance_per_cluster in distance_matrix:
nearest_distances.append(min(distance_per_cluster))

nearest_distances = pd.Series(nearest_distances)

然后，使用阈值，我发现异常

self.outliers_fraction = 0.01
number_of_outliers = int(self.outliers_fraction * len(nearest_distances))
threshold = nearest_distances.nlargest(number_of_outliers).min()

day_df['anomaly'] = (nearest_distances >= threshold).astype(int)

此代码有效，但我有大量误报。
数据集没有标记，但分析结果很明显。
这是因为阈值是使用等于 0.01 的 outliers_fraction 设置的，但它完全是任意的。

由于我无法提前知道哪个是“正确”阈值，所以我想问你是否有更好的方法来发现异常，在这个比赛中，使用 k-means 聚类算法。