数据挖掘 - 层次聚类：从大热图中提取观察结果 - 吾爱随笔录

层次聚类：从大热图中提取观察结果

数据挖掘 Python scikit-学习海运

2021-10-02 14:12:02

我目前正在尝试将大型数据集可视化为热图。这本身很顺利，但我很难从看起来很有趣的集群中获得洞察力。

具体来说，我有两个非常相关的问题：

首先，我找到了有趣的特征集群，并正在寻找一种系统的方法来提取特定级别的平面集群（但fcluster函数似乎做了一些不同的事情，而cut_tree不适用于这些树）。我想在树状图的指定深度有一个层次聚类的切片。这可能编码在链接矩阵中，Z但我很难理解如何准确地从Z.

其次，使用下图所示的复杂热图，右侧的行名称是每 100 个数据点（基因）的基因名称。我现在想看看哪些基因在一些小簇中，例如MF: LIHC标记的特征的黑色小方块。我知道右侧标记的基因的 ID，所以我想知道以下内容：

哪些基因与第 5 级的 CAPN7 在同一个簇中？

谢谢您的帮助！

罗马

2个回答

在 stackoverflow 上也提出了类似的问题。建议使用该criterion='maxclust'功能的选项flcuster：

from scipy.cluster.hierarchy import fcluster
clust = fcluster(Z, t=k, criterion='maxclust')

flcuster 文档中对这个选项的描述有点混乱，但这就是你如何获得集群的t=k集群。

您应该能够使用结果数组检索第二个问题的答案。

我不确定您在第一季度指的是什么，但对于第二季度，您似乎正试图深入挖掘较低的负相关项目，对吧。我无法重现您的 exace 示例（我没有数据），但我会给您一个通用示例，您可以轻松地适应您的特定场景。

# get only numerics from your dataframe; correlations work on values not labels
df = df.sample(frac=0.1, replace=True, random_state=1)
numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
newdf = df.select_dtypes(include=numerics)

for col in newdf.columns: 
    print(col) 

# Compute the correlation matrix
# no statistically significant correlations between any numeric features...
corrmat = newdf.corr()
top_corr_features = corrmat.index
plt.figure(figsize=(15,15))
#plot heat map
g=sns.heatmap(newdf[top_corr_features].corr(),annot=True,cmap="RdYlGn")

# Identify Highly Negatively Correlated Features
# Create correlation matrix
corr_matrix = newdf.corr()
# Select upper triangle of correlation matrix
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))
# Find index of feature columns with correlation less than 0
to_keep = [column for column in upper.columns if any(upper[column] < 0)]

回答： ['rating', 'num_comments', 'list_price', 'lowest_price_new_condition']

最后，假设您要删除所有具有 >.2 相关性的内容，只保留不相关的特征或负相关的特征，您可以这样做...

# Create correlation matrix
corr_matrix = newdf.corr()
# Select upper triangle of correlation matrix
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))
# Find index of feature columns with correlation greater than 0.95
to_drop = [column for column in upper.columns if any(upper[column] > .2)]
# Drop features 
finaldf = newdf.drop(newdf[to_drop], axis=1)

corrmat = finaldf.corr()
top_corr_features = corrmat.index
plt.figure(figsize=(15,15))
#plot heat map
g=sns.heatmap(finaldf[top_corr_features].corr(),annot=True,cmap="RdYlGn")

其它你可能感兴趣的问题

上一篇在训练期间调整图像大小会影响边界框注释吗？下一篇为什么我们不使用空间填充曲线进行高维最近邻搜索？