数据挖掘 - python中的循环问题 - 吾爱随笔录

python中的循环问题

数据挖掘 Python 聚类绘图

2022-03-07 22:53:42

我试图绘制 DBSCAN 聚类的聚类结果。我将数据聚集到两个集群中，当我编写代码来绘制它们时，它会显示“名称错误”。但我无法理解问题所在。这是我的错误代码

for i in range(0, reduced_data.shape[0]):
    if dbscan.labels_[i] == 0:
        c1 = plt.scatter(reduced_data[i,0],reduced_data[i,1],c='r',marker='+')
    elif dbscan.labels_[i] == 1:
        c2 = plt.scatter(reduced_data[i,0],reduced_data[i,1],c='g',marker='o')
    elif dbscan.labels_[i] == -1:
        c3 = plt.scatter(reduced_data[i,0],reduced_data[i,1],c='b',marker='*')
    plt.legend([c1, c2, c3], ['Cluster 1', 'Cluster 2','Noise'])
    plt.title('DBSCAN finds 2 clusters and noise')
    plt.show()

编辑：

我的其余代码：

feature_cols = ['age','workclass','fnlwgt','education','education num','marital-status','occupation','relationship','race','sex','capital-gain','capital-loss','hours-per-week','native-country']
X = train[feature_cols]
y = train['label']

# split X and y into training and testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30,random_state=10)

X_train_scale = scale(X_train)
X_test_scale = scale(X_test)

reduced_data = PCA(n_components=2).fit_transform(X_train_scale)
reduced_data_test = PCA(n_components=2).fit_transform(X_test_scale)

from pylab import *
xx, yy = zip(*reduced_data)
scatter(xx,yy)
show()

dbscan = DBSCAN(eps=0.3, min_samples=10).fit(reduced_data)
labels=dbscan.labels_
print(labels)

n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
print (n_clusters_)

n_cluster 的结果是 2

而PCA减少前的主要数据如下：

1个回答

在您的特定情况下，您只有 2 个集群，但不一定总是如此。我会允许更大的灵活性。

我从您的示例代码中假设您正在遵循文档中显示的内容。根据他们正在做的事情，您应该拥有以下内容

feature_cols = ['age','workclass','fnlwgt','education','education num','marital-status','occupation','relationship','race','sex','capital-gain','capital-loss','hours-per-week','native-country']
X = train[feature_cols]
y = train['label']

# split X and y into training and testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30,random_state=10)

X_train_scale = scale(X_train)
X_test_scale = scale(X_test)

reduced_data = PCA(n_components=2).fit_transform(X_train_scale)
reduced_data_test = PCA(n_components=2).fit_transform(X_test_scale)

from pylab import *
xx, yy = zip(*reduced_data)
scatter(xx,yy)
show()

# Compute DBSCAN
db = DBSCAN(eps=0.3, min_samples=10).fit(reduced_data)
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_

# Number of clusters in labels, ignoring noise if present.
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)

我刚刚将您的变量名 dbscan 更改为 db，以便更轻松地从源代码中查看代码。从这里您可以绘制由 DBSCAN 方法识别的所有不同集群。您还应该将保留标签的掩码保留为列表，以便我们在绘图时轻松访问它们。

我们将识别 DBSCAN 识别的唯一标签，并将颜色映射到每个标签。

# Plot result
import matplotlib.pyplot as plt

# Black removed and is used for noise instead.
unique_labels = set(labels)
colors = [plt.cm.Spectral(each)
          for each in np.linspace(0, 1, len(unique_labels))]

对于由唯一标签确定的每个集群，我们将与它关联的所有值绘制为

for k, col in zip(unique_labels, colors):
    # -1 is an identifier for noise
    if k == -1:
        # Black used for noise.
        col = [0, 0, 0, 1]

    # Find out what instances belong to this cluster, k
    class_member_mask = (labels == k)

    # Pull out these instances
    xy = X[class_member_mask]

    # Plot all of these instances
    plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),
             markeredgecolor='k', markersize=6)

plt.title('Estimated number of clusters: %d' % n_clusters_)
plt.show()

其它你可能感兴趣的问题

上一篇橙色：数据没有目标变量错误下一篇如何在训练和评估中匹配分类标签