数据挖掘 - 如何测量分类变量和连续变量之间的相关性 - 吾爱随笔录

我的数据集中有以下分类变量的名称列表：

categorical_columns = ['MSSubClass', 'MSZoning', 'LotShape', 'LandContour', 'LotConfig', 'Neighborhood', 'Condition1',
                       'Condition2', 'BldgType', 'HouseStyle', 'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd',
                       'Foundation', 'Heating', 'Electrical', 'Functional', 'GarageType', 'PavedDrive', 'Fence',
                       'MiscFeature', 'SaleType', 'SaleCondition', 'Street', 'CentralAir']

在这个函数定义中，我在 for 循环的帮助下对每一列进行一次性编码：

def feature_encoding(df, categorical_list):

    # One Hot Encoding the columns gathered in categorical_columns
    for col in categorical_list:

        # take one-hot encoding
        OHE_sdf = pd.get_dummies(df[categorical_list])

        # drop the old categorical column from original df
        df.drop(col, axis = 1, inplace = True)

        # attach one-hot encoded columns to original dataframe
        df = pd.concat([df, OHE_sdf], axis = 1, ignore_index = True)

我不想在训练中使用所有这些列，所以现在我正处于工作的降维阶段。我想测量这些列中的每一个与我的SalePrice变量（数值）之间的相关性，并剔除相关性低的列。

我读过卡方检验通常用于测量分类变量的相关性，但我还没有看到它是分类变量列表与连续变量的实现。