数据挖掘 - 如何使用随机森林降维 - 吾爱随笔录

我正在参加 Kaggle 上的波士顿比赛，目前我正在尝试使用随机森林来查找与目标变量相关性最高的列SalePrice。但是，该实现几乎返回了数据集中的每一个变量：

       0   1      2      3     4     5    6    ... 252 253 254 255 256 257 258
0        1  RL   65.0   8450  Pave   NaN  Reg  ...   0   1   0   0   1   0   1
1        2  RL   80.0   9600  Pave   NaN  Reg  ...   0   1   0   0   1   0   1
2        3  RL   68.0  11250  Pave   NaN  IR1  ...   0   1   0   0   1   0   1
3        4  RL   60.0   9550  Pave   NaN  IR1  ...   0   0   0   0   1   0   1
4        5  RL   84.0  14260  Pave   NaN  IR1  ...   0   1   0   0   1   0   1
5        6  RL   85.0  14115  Pave   NaN  IR1  ...   0   1   0   0   1   0   1
6        7  RL   75.0  10084  Pave   NaN  Reg  ...   0   1   0   0   1   0   1
7        8  RL    NaN  10382  Pave   NaN  IR1  ...   0   1   0   0   1   0   1
8        9  RM   51.0   6120  Pave   NaN  Reg  ...   0   0   0   0   1   0   1
9       10  RL   50.0   7420  Pave   NaN  Reg  ...   0   1   0   0   1   0   1
10      11  RL   70.0  11200  Pave   NaN  Reg  ...   0   1   0   0   1   0   1
11      12  RL   85.0  11924  Pave   NaN  IR1  ...   0   0   1   0   1   0   1
12      13  RL    NaN  12968  Pave   NaN  IR2  ...   0   1   0   0   1   0   1
13      14  RL   91.0  10652  Pave   NaN  IR1  ...   0   0   1   0   1   0   1
14      15  RL    NaN  10920  Pave   NaN  IR1  ...   0   1   0   0   1   0   1
15      16  RM   51.0   6120  Pave   NaN  Reg  ...   0   1   0   0   1   0   1
16      17  RL    NaN  11241  Pave   NaN  IR1  ...   0   1   0   0   1   0   1
17      18  RL   72.0  10791  Pave   NaN  Reg  ...   0   1   0   0   1   0   1
18      19  RL   66.0  13695  Pave   NaN  Reg  ...   0   1   0   0   1   0   1
19      20  RL   70.0   7560  Pave   NaN  Reg  ...   0   0   0   0   1   0   1
20      21  RL  101.0  14215  Pave   NaN  IR1  ...   0   0   1   0   1   0   1
21      22  RM   57.0   7449  Pave  Grvl  Reg  ...   0   1   0   0   1   0   1
22      23  RL   75.0   9742  Pave   NaN  Reg  ...   0   1   0   0   1   0   1
23      24  RM   44.0   4224  Pave   NaN  Reg  ...   0   1   0   0   1   0   1
24      25  RL    NaN   8246  Pave   NaN  IR1  ...   0   1   0   0   1   0   1
25      26  RL  110.0  14230  Pave   NaN  Reg  ...   0   1   0   0   1   0   1
26      27  RL   60.0   7200  Pave   NaN  Reg  ...   0   1   0   0   1   0   1
27      28  RL   98.0  11478  Pave   NaN  Reg  ...   0   1   0   0   1   0   1
28      29  RL   47.0  16321  Pave   NaN  IR1  ...   0   1   0   0   1   0   1
29      30  RM   60.0   6324  Pave   NaN  IR1  ...   0   1   0   0   1   1   0
...    ...  ..    ...    ...   ...   ...  ...  ...  ..  ..  ..  ..  ..  ..  ..
1430  1431  RL   60.0  21930  Pave   NaN  IR3  ...   0   1   0   0   1   0   1
1431  1432  RL    NaN   4928  Pave   NaN  IR1  ...   0   1   0   0   1   0   1

不仅如此，其中一些列也返回NaN值。NaN在返回任何东西之前，我已经处理了值。

警告：我在对分类变量进行一次热编码后立即使用随机森林，这就是为什么返回具有如此高的维度的部分原因。

到目前为止，这是我的实现：

我在单独的列表中收集了分类变量、连续变量和二进制变量的名称：

categorical_columns = ['MSSubClass', 'MSZoning', 'LotShape', 'LandContour', 'LotConfig', 'Neighborhood', 'Condition1',
                       'Condition2', 'BldgType', 'HouseStyle', 'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd',
                       'Foundation', 'Heating', 'Electrical', 'Functional', 'GarageType', 'PavedDrive', 'Fence',
                       'MiscFeature', 'SaleType', 'SaleCondition', 'Street', 'CentralAir']

ranked_columns = ['Utilities', 'LandSlope', 'ExterQual', 'ExterCond', 'BsmtQual', 'BsmtCond', 'BsmtExposure',
                  'BsmtFinType1', 'BsmtFinType2', 'HeatingQC', 'KitchenQual', 'FireplaceQu', 'GarageQual', 'GarageCond',
                  'PoolQC', 'OverallQual', 'OverallCond']

numerical_columns = ['LotArea', 'LotFrontage', 'YearBuilt', 'YearRemodAdd', 'MasVnrArea', 'BsmtFinSF1', 'BsmtFinSF2',
                     'BsmtUnfSF','TotalBsmtSF', '1stFlrSF', '2ndFlrSf', 'LowQualFinSF', 'GrLivArea', 'BsmtFullBath',
                     'BsmtHalfBath', 'FullBath', 'HalfBath', 'Bedroom', 'Kitchen', 'TotRmsAbvGrd', 'Fireplaces',
                     'GarageYrBlt', 'GarageCars', 'GarageArea', 'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch',
                     '3SsnPorch', 'ScreenPorch', 'PoolArea', 'MiscVal', 'MoSold', 'YrSold']

我创建了一个名为的函数定义def feature_encoding(df, categorical_list):，以下代码来自该函数定义：

在这里，我将遍历categorical_columns循环中的每个分类变量，以对它们中的每一个进行一次热编码。最后，我将它们重新插入数据框中：

for col in categorical_list:

    # take one-hot encoding
    OHE_sdf = pd.get_dummies(df[categorical_list])

    # drop the old categorical column from original df
    df.drop(col, axis = 1, inplace = True)

    # attach one-hot encoded columns to original dataframe
    df = pd.concat([df, OHE_sdf], axis = 1, ignore_index = True)

    return df

在这里，我用整数编码我的排名值（例如： Excellent、、、Good）：Average

df['Utilities'] = df['Utilities'].replace(['AllPub', 'NoSeWa'], [2, 1])  # Utilities
df['ExterQual'] = df['ExterQual'].replace(['Ex', 'Gd', 'TA', 'Fa'], [4, 3, 2, 1])  # Exterior Quality
df['LandSlope'] = df['LandSlope'].replace(['Gtl', 'Mod', 'Sev'], [3, 2, 1])  # Land Slope
df['ExterCond'] = df['ExterCond'].replace(['Ex', 'Gd', 'TA', 'Fa', 'Po'], [4, 3, 2, 1, 0])  # Exterior Condition
df['HeatingQC'] = df['HeatingQC'].replace(['Ex', 'Gd', 'TA', 'Fa', 'Po'], [4, 3, 2, 1, 0])  # Heating Quality and Condition
df['KitchenQual'] = df['KitchenQual'].replace(['Ex', 'Gd', 'TA', 'Fa'], [3, 2, 1, 0])  # Kitchen Quality

一些列的值缩写为NA，这意味着类似于“No pavement”，但 pandas 将其解释为NaN。为避免这种情况，我将这些缩写中的每一个都替换为以下内容XX：

# Replacing the NA values of each column with XX to avoid pandas from listing them as NaN
na_data = ['Alley', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'FireplaceQu',
           'GarageType', 'GarageFinish', 'GarageQual', 'GarageCond', 'PoolQC', 'Fence', 'MiscFeature']

for i in na_data:
    df[i] = df[i].fillna('XX')

# Replaced the NaN values of LotFrontage and MasVnrArea with the mean of their column
df['LotFrontage'] = df['LotFrontage'].fillna(df['LotFrontage'].mean())
df['MasVnrArea'] = df['MasVnrArea'].fillna(df['MasVnrArea'].mean())

最后，这是我找到相关变量的随机森林实现：

x_train, x_test, y_train, y_test = train_test_split(df, df['SalePrice'], test_size=0.3, random_state=42)

sel = SelectFromModel(RandomForestClassifier(n_estimators=100))
sel.fit(x_train, y_train)
sel.get_support()

selected_feat = x_train.columns[sel.get_support()]

对于这么冗长的帖子，我深表歉意。我想在我的问题中尽可能清楚。如果您想查看整个 .py 文件，它与超链接数据集位于同一存储库中。