我不确定您在第一季度指的是什么,但对于第二季度,您似乎正试图深入挖掘较低的负相关项目,对吧。我无法重现您的 exace 示例(我没有数据),但我会给您一个通用示例,您可以轻松地适应您的特定场景。
# get only numerics from your dataframe; correlations work on values not labels
df = df.sample(frac=0.1, replace=True, random_state=1)
numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
newdf = df.select_dtypes(include=numerics)
for col in newdf.columns:
print(col)
# Compute the correlation matrix
# no statistically significant correlations between any numeric features...
corrmat = newdf.corr()
top_corr_features = corrmat.index
plt.figure(figsize=(15,15))
#plot heat map
g=sns.heatmap(newdf[top_corr_features].corr(),annot=True,cmap="RdYlGn")

# Identify Highly Negatively Correlated Features
# Create correlation matrix
corr_matrix = newdf.corr()
# Select upper triangle of correlation matrix
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))
# Find index of feature columns with correlation less than 0
to_keep = [column for column in upper.columns if any(upper[column] < 0)]
回答: ['rating', 'num_comments', 'list_price', 'lowest_price_new_condition']
最后,假设您要删除所有具有 >.2 相关性的内容,只保留不相关的特征或负相关的特征,您可以这样做...
# Create correlation matrix
corr_matrix = newdf.corr()
# Select upper triangle of correlation matrix
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))
# Find index of feature columns with correlation greater than 0.95
to_drop = [column for column in upper.columns if any(upper[column] > .2)]
# Drop features
finaldf = newdf.drop(newdf[to_drop], axis=1)
corrmat = finaldf.corr()
top_corr_features = corrmat.index
plt.figure(figsize=(15,15))
#plot heat map
g=sns.heatmap(finaldf[top_corr_features].corr(),annot=True,cmap="RdYlGn")
