理论
在这种情况下,聚类不太可能起作用,因为您的红点被绿点彼此隔开。你可以使用更多的集群,但这需要大量的人工检查和摆弄。
解决这类问题的标准方法是使用非线性支持向量机。简单描述的想法是,尽管您的数据在二维中可能无法分离,但将数据投影到更高维度始终可以保证一个干净的分隔符,然后可以使用该分隔符对未来的兴趣点进行分类。此外,我们可以选择此分隔符,使分隔符与任何数据点之间的距离最大化。
这里有几张图片展示了这个想法:
结果
我使用WebPlotDigitizer以任意但保留纵横比的单位提取数据,然后使用径向基函数内核通过支持向量机运行它。这给出了以下分隔符:
请注意,正如所承诺的,这将您的数据清晰地分为两个类。分隔符本身是灰色实线,而分隔符周围的边距(分隔符和数据之间的距离)用虚线表示。分隔符和这些边距之间的距离是该内核与所有其他选择相比的最大值。引起边缘的数据点被圈出。
如果我们添加一百个新点并使用我们的 SVM 对它们进行分类,我们会得到:
执行
我使用 Python 来实现上述想法。
import matplotlib.pyplot as plt
import numpy as np
from sklearn import svm
# Your data
data = [
[ 0.7490599550363091 , 9.24443743612264 , 0],
[ 4.12034765121126 , 8.885539916739752 , 0],
[ 7.6539862433328665 , 9.612037680925432 , 0],
[ 7.914094745271155 , 0.9253511681985884 , 0],
[ 8.994538029638242 , 0.5077641947442704 , 0],
[ 9.371974940345583 , 9.608153819994236 , 0],
[12.320128383367402 , 6.088728506174032 , 0],
[12.411426660363531 , 4.715352126892222 , 0],
[12.558352293037347 , 11.229593835418646 , 0],
[13.114336626127825 , 5.703722810531428 , 0],
[13.381386496538893 , 9.854563219073547 , 0],
[13.886274645038084 , 8.480251836234228 , 0],
[14.13510342320811 , 6.595572357695316 , 0],
[14.463275899337807 , 9.915985760466931 , 0],
[15.05984795641495 , 7.296033868971857 , 0],
[ 3.096302985685499 , 6.907935469254409 , 1],
[ 4.878692156862327 , 7.159379502874164 , 1],
[ 7.4911533079089345 , 8.366972559074298 , 1],
[ 7.986689890548966 , 3.89506631355318 , 1],
[ 8.498712223311847 , 4.883868537295852 , 1],
[ 9.869150457208352 , 5.679125024633843 , 1],
[10.527327159481406 , 2.3884159656508306 , 1],
[10.671071331605198 , 7.848836741512311 , 1],
[12.788477939489049 , 3.1497246315161327 , 1],
[12.86531503216689 , 7.524534353757313 , 1],
]
# Load data into numpy
data = np.array(data)
# Separate data into x and y values; predictors and observations
data_x = data[:,0:2] # x-value/predictor
data_y = data[:,2] # y-value/observation
# In these "learning" style problems you often want to avoid over-fitting, so
# it might make sense to split your data into training and test sets. Here we
# simply use all data as the training data.
x_train = data_x[:,:]
y_train = data_y[:]
# Using the RBF (radial basis function) kernel with an SVM projects the points
# into a higher-dimensional space where a clean boundary is possible and then
# lowers that boundary back into the space the points live in.
model = svm.SVC(kernel='rbf', C=1e6)
model.fit(x_train, y_train)
def plot_model(data_x, data_y, predict_x=None, predict_y=None) -> None:
# Let's plot some stuff
fig, ax = plt.subplots(figsize=(12, 7))
# Create grid to evaluate model
xx = np.linspace(-1, max(data_x[:,0]) + 1, len(data_x))
yy = np.linspace(0, max(data_x[:,1]) + 1, len(data_x))
YY, XX = np.meshgrid(yy, xx)
xy = np.vstack([XX.ravel(), YY.ravel()]).T
# Assigning different colors to the classes
colors = data_y
colors = np.where(colors == 1, '#8C7298', '#4786D1')
ax.scatter(data_x[:,0], data_x[:,1],c=colors)
if predict_x is not None:
assert predict_y is not None
colors = predict_y
colors = np.where(colors == 1, '#8C7298', '#4786D1')
ax.scatter(predict_x[:,0], predict_x[:,1],c=colors,s=4)
# Get the separator
Z = model.decision_function(xy).reshape(XX.shape)
# Draw the decision boundary and margins
ax.contour(XX, YY, Z, colors='k', levels=[-1, 0, 1], alpha=0.5, linestyles=['--', '-', '--'])
# Highlight support vectors with a circle around them
ax.scatter(model.support_vectors_[:, 0], model.support_vectors_[:, 1], s=100, linewidth=1, facecolors='none', edgecolors='k')
plot_model(data_x, data_y)
plt.show()
# Generate some previously-unseen data
new_x = np.random.uniform(low=min(data_x[:,0]), high=max(data_x[:,0]), size=(100,1))
new_y = np.random.uniform(low=min(data_x[:,1]), high=max(data_x[:,1]), size=(100,1))
new_data = np.hstack((new_x, new_y))
predictions_for_new_data = model.predict(new_data)
plot_model(data_x, data_y, new_data, predictions_for_new_data)
plt.show()