如何使用随机森林降维

数据挖掘 机器学习 Python 特征选择 随机森林 卡格尔
2022-03-14 19:44:47

我正在参加 Kaggle 上的波士顿比赛,目前我正在尝试使用随机森林来查找与目标变量相关性最高的列SalePrice但是,该实现几乎返回了数据集中的每一个变量

       0   1      2      3     4     5    6    ... 252 253 254 255 256 257 258
0        1  RL   65.0   8450  Pave   NaN  Reg  ...   0   1   0   0   1   0   1
1        2  RL   80.0   9600  Pave   NaN  Reg  ...   0   1   0   0   1   0   1
2        3  RL   68.0  11250  Pave   NaN  IR1  ...   0   1   0   0   1   0   1
3        4  RL   60.0   9550  Pave   NaN  IR1  ...   0   0   0   0   1   0   1
4        5  RL   84.0  14260  Pave   NaN  IR1  ...   0   1   0   0   1   0   1
5        6  RL   85.0  14115  Pave   NaN  IR1  ...   0   1   0   0   1   0   1
6        7  RL   75.0  10084  Pave   NaN  Reg  ...   0   1   0   0   1   0   1
7        8  RL    NaN  10382  Pave   NaN  IR1  ...   0   1   0   0   1   0   1
8        9  RM   51.0   6120  Pave   NaN  Reg  ...   0   0   0   0   1   0   1
9       10  RL   50.0   7420  Pave   NaN  Reg  ...   0   1   0   0   1   0   1
10      11  RL   70.0  11200  Pave   NaN  Reg  ...   0   1   0   0   1   0   1
11      12  RL   85.0  11924  Pave   NaN  IR1  ...   0   0   1   0   1   0   1
12      13  RL    NaN  12968  Pave   NaN  IR2  ...   0   1   0   0   1   0   1
13      14  RL   91.0  10652  Pave   NaN  IR1  ...   0   0   1   0   1   0   1
14      15  RL    NaN  10920  Pave   NaN  IR1  ...   0   1   0   0   1   0   1
15      16  RM   51.0   6120  Pave   NaN  Reg  ...   0   1   0   0   1   0   1
16      17  RL    NaN  11241  Pave   NaN  IR1  ...   0   1   0   0   1   0   1
17      18  RL   72.0  10791  Pave   NaN  Reg  ...   0   1   0   0   1   0   1
18      19  RL   66.0  13695  Pave   NaN  Reg  ...   0   1   0   0   1   0   1
19      20  RL   70.0   7560  Pave   NaN  Reg  ...   0   0   0   0   1   0   1
20      21  RL  101.0  14215  Pave   NaN  IR1  ...   0   0   1   0   1   0   1
21      22  RM   57.0   7449  Pave  Grvl  Reg  ...   0   1   0   0   1   0   1
22      23  RL   75.0   9742  Pave   NaN  Reg  ...   0   1   0   0   1   0   1
23      24  RM   44.0   4224  Pave   NaN  Reg  ...   0   1   0   0   1   0   1
24      25  RL    NaN   8246  Pave   NaN  IR1  ...   0   1   0   0   1   0   1
25      26  RL  110.0  14230  Pave   NaN  Reg  ...   0   1   0   0   1   0   1
26      27  RL   60.0   7200  Pave   NaN  Reg  ...   0   1   0   0   1   0   1
27      28  RL   98.0  11478  Pave   NaN  Reg  ...   0   1   0   0   1   0   1
28      29  RL   47.0  16321  Pave   NaN  IR1  ...   0   1   0   0   1   0   1
29      30  RM   60.0   6324  Pave   NaN  IR1  ...   0   1   0   0   1   1   0
...    ...  ..    ...    ...   ...   ...  ...  ...  ..  ..  ..  ..  ..  ..  ..
1430  1431  RL   60.0  21930  Pave   NaN  IR3  ...   0   1   0   0   1   0   1
1431  1432  RL    NaN   4928  Pave   NaN  IR1  ...   0   1   0   0   1   0   1

不仅如此,其中一些列也返回NaN值。NaN在返回任何东西之前,我已经处理了值。

警告:我在对分类变量进行一次热编码后立即使用随机森林,这就是为什么返回具有如此高的维度的部分原因。

到目前为止,这是我的实现:

我在单独的列表中收集了分类变量、连续变量和二进制变量的名称:

categorical_columns = ['MSSubClass', 'MSZoning', 'LotShape', 'LandContour', 'LotConfig', 'Neighborhood', 'Condition1',
                       'Condition2', 'BldgType', 'HouseStyle', 'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd',
                       'Foundation', 'Heating', 'Electrical', 'Functional', 'GarageType', 'PavedDrive', 'Fence',
                       'MiscFeature', 'SaleType', 'SaleCondition', 'Street', 'CentralAir']

ranked_columns = ['Utilities', 'LandSlope', 'ExterQual', 'ExterCond', 'BsmtQual', 'BsmtCond', 'BsmtExposure',
                  'BsmtFinType1', 'BsmtFinType2', 'HeatingQC', 'KitchenQual', 'FireplaceQu', 'GarageQual', 'GarageCond',
                  'PoolQC', 'OverallQual', 'OverallCond']

numerical_columns = ['LotArea', 'LotFrontage', 'YearBuilt', 'YearRemodAdd', 'MasVnrArea', 'BsmtFinSF1', 'BsmtFinSF2',
                     'BsmtUnfSF','TotalBsmtSF', '1stFlrSF', '2ndFlrSf', 'LowQualFinSF', 'GrLivArea', 'BsmtFullBath',
                     'BsmtHalfBath', 'FullBath', 'HalfBath', 'Bedroom', 'Kitchen', 'TotRmsAbvGrd', 'Fireplaces',
                     'GarageYrBlt', 'GarageCars', 'GarageArea', 'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch',
                     '3SsnPorch', 'ScreenPorch', 'PoolArea', 'MiscVal', 'MoSold', 'YrSold']

我创建了一个名为的函数定义def feature_encoding(df, categorical_list):,以下代码来自该函数定义:

在这里,我将遍历categorical_columns循环中的每个分类变量,以对它们中的每一个进行一次热编码。最后,我将它们重新插入数据框中:

for col in categorical_list:

    # take one-hot encoding
    OHE_sdf = pd.get_dummies(df[categorical_list])

    # drop the old categorical column from original df
    df.drop(col, axis = 1, inplace = True)

    # attach one-hot encoded columns to original dataframe
    df = pd.concat([df, OHE_sdf], axis = 1, ignore_index = True)

    return df

在这里,我用整数编码我的排名值(例如: Excellent、、、Good) :Average

df['Utilities'] = df['Utilities'].replace(['AllPub', 'NoSeWa'], [2, 1])  # Utilities
df['ExterQual'] = df['ExterQual'].replace(['Ex', 'Gd', 'TA', 'Fa'], [4, 3, 2, 1])  # Exterior Quality
df['LandSlope'] = df['LandSlope'].replace(['Gtl', 'Mod', 'Sev'], [3, 2, 1])  # Land Slope
df['ExterCond'] = df['ExterCond'].replace(['Ex', 'Gd', 'TA', 'Fa', 'Po'], [4, 3, 2, 1, 0])  # Exterior Condition
df['HeatingQC'] = df['HeatingQC'].replace(['Ex', 'Gd', 'TA', 'Fa', 'Po'], [4, 3, 2, 1, 0])  # Heating Quality and Condition
df['KitchenQual'] = df['KitchenQual'].replace(['Ex', 'Gd', 'TA', 'Fa'], [3, 2, 1, 0])  # Kitchen Quality

一些列的值缩写为NA,这意味着类似于“No pavement”,但 pandas 将其解释为NaN为避免这种情况,我将这些缩写中的每一个都替换为以下内容XX

# Replacing the NA values of each column with XX to avoid pandas from listing them as NaN
na_data = ['Alley', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'FireplaceQu',
           'GarageType', 'GarageFinish', 'GarageQual', 'GarageCond', 'PoolQC', 'Fence', 'MiscFeature']

for i in na_data:
    df[i] = df[i].fillna('XX')

# Replaced the NaN values of LotFrontage and MasVnrArea with the mean of their column
df['LotFrontage'] = df['LotFrontage'].fillna(df['LotFrontage'].mean())
df['MasVnrArea'] = df['MasVnrArea'].fillna(df['MasVnrArea'].mean())

最后,这是我找到相关变量的随机森林实现:

x_train, x_test, y_train, y_test = train_test_split(df, df['SalePrice'], test_size=0.3, random_state=42)

sel = SelectFromModel(RandomForestClassifier(n_estimators=100))
sel.fit(x_train, y_train)
sel.get_support()

selected_feat = x_train.columns[sel.get_support()]

对于这么冗长的帖子,我深表歉意。我想在我的问题中尽可能清楚。如果您想查看整个 .py 文件,它与超链接数据集位于同一存储库中。

1个回答

扩展我的评论,

根据SelectFromModel一些信息标准选择最佳特征。当拟合估计器(在您的情况下为随机森林)时,计算拟合SelectFromModel估计器的每个特征的特征重要性。

然后SelectFromModel“过滤”掉那些不符合特定标准的特征,例如feature_importance价值标准。设置这个标准(在 Sklearn 中命名为阈值)会对被过滤掉的特征数量产生很大影响。

根据您的问题,很难判断所有功能是否确实对估计器适合质量有价值。检查它是否是与代码相关的问题的一种方法是为阈值参数尝试不同的值。

人们会期望当阈值增加时,所选功能(支持的功能)的数量会减少。如果这按预期工作,您可以考虑如何确定最能满足您需求的阈值。