我将使用所有功能并查看我的集群的分离性如何根据某些指标表现,例如,silhouette score
此外,在聚类之前缩放数据非常重要,因为 kmeans 是一种基于距离的算法。
heart_data = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/00519/heart_failure_clinical_records_dataset.csv")
from sklearn.cluster import KMeans
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score
Features = heart_data.drop(["DEATH_EVENT"], axis = 1).columns
X = heart_data[Features]
sc = []
for i in range(2, 25):
kmeans = Pipeline([("scaling",StandardScaler()),("clustering",KMeans(n_clusters=i, init='k-means++', max_iter=300, n_init=10, random_state=0))]).fit(X)
score = silhouette_score(X, kmeans["clustering"].labels_)
sc.append(score)
plt.plot(range(2, 25), sc, marker = "o")
plt.title('Silhouette')
plt.xlabel('Number of clusters')
plt.ylabel('Score')
plt.show()
您还可以尝试不同的功能组合,以使得分最高
出于可视化目的,您可以使用分解技术
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
plt.style.use("seaborn-whitegrid")
pca = Pipeline([("scaling",StandardScaler()),("decompositioning",PCA(n_components = 2))]).fit(X)
X2D = pca.transform(X)
plt.scatter(X2D[:,0],X2D[:,1], c = kmeans["clustering"].labels_, cmap = "RdYlBu")
plt.colorbar();
最后但并非最不重要的一点是,我建议在您的数据中使用诸如UMAP之类的流形投影,它可能会通过生成“定义明确的”集群来帮助您完成任务(可能但不一定,但值得尝试)
看,通过使用 UMAP,结果似乎有所改善:
代码:
# pip install umap-learn
heart_data = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/00519/heart_failure_clinical_records_dataset.csv")
from sklearn.cluster import KMeans, DBSCAN
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score
from umap import UMAP
Features = heart_data.drop(["DEATH_EVENT"], axis = 1).columns
X = heart_data[Features]
sc = []
for i in range(2, 25):
kmeans = Pipeline([("scaling",StandardScaler()),("umap",UMAP()),("clustering",KMeans(n_clusters=i, init='k-means++',random_state=0))]).fit(X)
score = silhouette_score(X, kmeans["clustering"].labels_)
sc.append(score)
plt.plot(range(2, 25), sc, marker = "o")
plt.title('Silhouette')
plt.xlabel('Number of clusters')
plt.ylabel('Score')
plt.show()
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
plt.style.use("seaborn-whitegrid")
kmeans = Pipeline([("scaling",StandardScaler()),("umap",UMAP()),("clustering",KMeans(n_clusters=3, init='k-means++',random_state=0))]).fit(X)
pca = Pipeline([("scaling",StandardScaler()),("umap",UMAP()),("decompositioning",PCA(n_components = 2))]).fit(X)
X2D = pca.transform(X)
plt.scatter(X2D[:,0],X2D[:,1], c = kmeans["clustering"].labels_, cmap = "RdYlBu")
plt.colorbar();

绘图显示 umap 投影的第一个和第二个主成分(它只是对所有特征在 2D 空间中的外观的投影)
颜色是集群 ID。即对于我们看到的每种颜色,算法(k-means)将每个观察值分配给哪个集群。