我不知道我的方法是否比您的方法更准确,但我认为您可以从我的方法中找到一些见解,您可以使用这些见解来进一步改进您的结果。
与您在整个数据集上使用模型集合的方法不同,我尝试使用这样一个事实,即我们将为给定数据集拥有土地集群(例如大陆),因此我尝试OneClassSVM
为每个这样的集群拟合一个:
- 数据准备:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.svm import OneClassSVM
plt.rcParams.update({"figure.facecolor": "w"})
earth_df = pd.read_csv("Earth.txt", sep=' ', names=['X', 'Y'], header=None)
- 聚类使用
KMeans(n_clusters=5)
:
kmeans = KMeans(n_clusters=5)
kmeans.fit(earth_df)
clusters = kmeans.predict(earth_df)
centroids = kmeans.cluster_centers_
earth_df['cluster_no'] = clusters
earth_df.head()
输出:
- 通过颜色编码进行可视化:
colors = ['green', 'red', 'black', 'yellow', 'maroon']
clusters = range(5)
_, ax = plt.subplots(figsize=(7,7))
ax.axis('off')
ax.set_title("Color-Coded Clusters")
for cluster_no, color in zip(clusters, colors):
cluster = earth_df[earth_df['cluster_no'] == cluster_no]
ax.scatter(cluster.X, cluster.Y, color=color, marker='.')
ax.legend(clusters)
ax.scatter(centroids[:,0], centroids[:,1], marker='x', color='cyan', s=150)
plt.show();
输出:
- 每个集群安装一个 SVM:
svms = []
for cluster_no in clusters:
svm = OneClassSVM(kernel='rbf', gamma=0.0025, nu=0.2,
tol=0.001, shrinking=True, max_iter=- 1)
cluster = earth_df[earth_df['cluster_no'] == cluster_no]
cluster = cluster.drop(columns='cluster_no')
svm.fit(cluster)
svms.append(svm)
- 以决策函数和预测的形式可视化结果,分别针对每个 SVM:
data = earth_df.drop(columns='cluster_no')
_, axs = plt.subplots(5, 2, figsize=(14, 30))
for i in clusters:
df = svms[i].decision_function(data)
prediction = svms[i].predict(data)
ax_df, ax_pred = axs[i]
ax_df.axis('off')
ax_df.set_title(f"Decision Function for Cluster-{i} SVM")
ax_df.scatter(data.X, data.Y, c=df, cmap='coolwarm')
ax_pred.axis('off')
ax_pred.set_title(f"Prediction for Cluster-{i} SVM")
ax_pred.scatter(data.X, data.Y, c=prediction, cmap='coolwarm')
plt.show();
输出:
- 可视化两个虚拟位置:
fig, ax = plt.subplots(figsize=(6, 6))
ax.axis('off')
ax.set_title("Points for earth")
custom_points = np.array([[-12, -36], [2, 7]])
ax.scatter(earth_df['X'], earth_df['Y'], color='black', marker='.')
ax.scatter(custom_points[:,0], custom_points[:,1], color='cyan')
plt.show();
输出:
- 预测虚拟位置:
for i, svm in enumerate(svms):
print(f"For SVM-{i}:", svm.decision_function(custom_points))
输出:
For SVM-0: [-5.5578652 -5.55743803]
For SVM-1: [-5.12195068 -5.1219504 ]
For SVM-2: [-7.12232844 -6.28086617]
For SVM-3: [-5.38072626 -5.43922668]
For SVM-4: [-4.0920019 0.02235967]
我没有对超参数进行任何优化,而是使用了您提供的那些。表示土地的点有望在任何一个 SVM 上表现良好,并且您可以控制在该点开始被预测为异常值之前的距离程度(而不是.predict()
直接使用)。
理想情况下,我们会通过验证集来做到这一点,但我跳过了那部分。这比基本的 3 模型集成更占用内存,但为您的模型提供了更简单的子任务。