数据挖掘 - 如何使用 id 提取所有信息 - 吾爱随笔录

如何使用 id 提取所有信息

数据挖掘机器学习 Python 聚类朱庇特

2022-02-19 15:03:23

我需要在聚类后应用分类器算法。现在在聚类之后，我找到了哪个 id 属于哪个集群的 id 号。我将它们聚集成 2 个集群。

现在我需要使用那些 id 来收集这些数据。但我不知道如何使用这些 id 收集所有信息。

当我使用 jupyeter notebook 并且在主数据中，当我从主数据文件加载数据时，我没有名为 id 的属性，并且那些 id 分配给 jupyter notebook。

这是我的主要数据

这是我查找哪些数据属于哪个集群的代码。

x = 0.10
i=0
C_i = np.where(labels == i)[0].tolist()
n_i = len(C_i) # number of points in cluster i

# (2) indices of the points from X to be sampled from cluster i
sample_i = np.random.choice(C_i, int(x * n_i)) 
print (i, sample_i)

聚类后我找到了这些 id

新增内容：

假设我的加载文件名是 train. 现在使用train.loc[26]命令我得到该特定 ID 的信息。

但我需要将所有信息收集到一个新的数据框中，比如数据train框

2个回答

定义一个新列，然后使用这些 id 选择相关行并将该列设置为适当的 id：

from pandas import DataFrame,Series
from numpy.random import rand
df = DataFrame(rand(10,10)).assign(cluster=0)
clusters = Series([[1,3,5],[0,2,4,6,7,8,9]])
for cluster,rows in clusters.iteritems():
    df.loc[rows]["cluster"] = cluster

现在您可以groupby在“cluster”列上执行您的操作了。

解决方案：

通过使用这些索引号来创建一个列表。假设我需要 10% 的数据索引号。首先，我收集 0（其中 i=0）编号集群的索引号，然后收集 1（其中 i=1）编号集群索引编号

x = 0.10 i=0 C_i = np.where(labels == i)[0].tolist() n_i = len(C_i) #indices of the points from X to be sampled from cluster i sample_i = np.random.choice(C_i, int(x * n_i)) print (i, sample_i) list1=(sample_i)

x = 0.10 i=1 C_i = np.where(labels == i)[0].tolist() n_i = len(C_i) # indices of the points from X to be sampled from cluster i sample_i = np.random.choice(C_i, int(x * n_i)) print (i, sample_i) list2=(sample_i)

在找到 2 个集群的两个列表后，我将这两个列表合并为 1 个列表

new_list = np.concatenate((list1,list2)) new_train_data=train.loc[new_list] new_train_data.head()

其它你可能感兴趣的问题

上一篇是否有任何现有的算法或公式可以计算数据集的复杂性？下一篇如何在线性回归模型中使用正确的权重