使用随机森林选择变量会返回整个数据框

数据挖掘 机器学习 Python 特征选择 随机森林 降维
2022-02-26 08:04:09

我正在降维。我正在使用随机森林来查找与目标 SalePrice 列相关性最高的列。

问题是输出太大。绝对不是我想要的。它返回 259 列。其中一些列是对分类变量进行一次热编码并将它们添加回数据框中的结果,这在逻辑上增加了数据集的维度。但是,我只想返回与目标变量“SalePrice”相关性最高的列。不是整个该死的数据框。

这是输出:

       0   1     2      3     4    5    6    ... 252 253 254 255 256 257 258
0        1  RL  65.0   8450  Pave  NaN  Reg  ...   0   1   0   0   1   0   1
1        2  RL  80.0   9600  Pave  NaN  Reg  ...   0   1   0   0   1   0   1
2        3  RL  68.0  11250  Pave  NaN  IR1  ...   0   1   0   0   1   0   1
3        4  RL  60.0   9550  Pave  NaN  IR1  ...   0   0   0   0   1   0   1
4        5  RL  84.0  14260  Pave  NaN  IR1  ...   0   1   0   0   1   0   1
...    ...  ..   ...    ...   ...  ...  ...  ...  ..  ..  ..  ..  ..  ..  ..
1455  1456  RL  62.0   7917  Pave  NaN  Reg  ...   0   1   0   0   1   0   1
1456  1457  RL  85.0  13175  Pave  NaN  Reg  ...   0   1   0   0   1   0   1
1457  1458  RL  66.0   9042  Pave  NaN  Reg  ...   0   1   0   0   1   0   1
1458  1459  RL  68.0   9717  Pave  NaN  Reg  ...   0   1   0   0   1   0   1
1459  1460  RL  75.0   9937  Pave  NaN  Reg  ...   0   1   0   0   1   0   1

[1460 rows x 259 columns]

这是我的代码:

import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel
from sklearn.model_selection import train_test_split

train = pd.read_csv("https://raw.githubusercontent.com/oo92/Boston-Kaggle/master/train.csv")
test = pd.read_csv("https://raw.githubusercontent.com/oo92/Boston-Kaggle/master/test.csv")

categorical_columns = ['MSSubClass', 'MSZoning', 'LotShape', 'LandContour', 'LotConfig', 'Neighborhood', 'Condition1',
                       'Condition2', 'BldgType', 'HouseStyle', 'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd',
                       'Foundation', 'Heating', 'Electrical', 'Functional', 'GarageType', 'PavedDrive', 'Fence',
                       'MiscFeature', 'SaleType', 'SaleCondition', 'Street', 'CentralAir']

ranked_columns = ['Utilities', 'LandSlope', 'ExterQual', 'ExterCond', 'BsmtQual', 'BsmtCond', 'BsmtExposure',
                  'BsmtFinType1', 'BsmtFinType2', 'HeatingQC', 'KitchenQual', 'FireplaceQu', 'GarageQual', 'GarageCond',
                  'PoolQC', 'OverallQual', 'OverallCond']

numerical_columns = ['LotArea', 'LotFrontage', 'YearBuilt', 'YearRemodAdd', 'MasVnrArea', 'BsmtFinSF1', 'BsmtFinSF2',
                     'BsmtUnfSF','TotalBsmtSF', '1stFlrSF', '2ndFlrSf', 'LowQualFinSF', 'GrLivArea', 'BsmtFullBath',
                     'BsmtHalfBath', 'FullBath', 'HalfBath', 'Bedroom', 'Kitchen', 'TotRmsAbvGrd', 'Fireplaces',
                     'GarageYrBlt', 'GarageCars', 'GarageArea', 'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch',
                     '3SsnPorch', 'ScreenPorch', 'PoolArea', 'MiscVal', 'MoSold', 'YrSold']


def feature_encoding(df, categorical_list):

    # take one-hot encoding
    OHE_sdf = pd.get_dummies(df[categorical_list])

    # drop the old categorical column from original df
    df.drop(columns = categorical_list, inplace = True)

    # attach one-hot encoded columns to original dataframe
    df = pd.concat([df, OHE_sdf], axis = 1, ignore_index = True)

    # Integer Encoding
    df['Utilities'] = df['Utilities'].replace(['AllPub', 'NoSeWa'], [2, 1])  # Utilities
    df['ExterQual'] = df['ExterQual'].replace(['Ex', 'Gd', 'TA', 'Fa'], [4, 3, 2, 1])  # Exterior Quality
    df['LandSlope'] = df['LandSlope'].replace(['Gtl', 'Mod', 'Sev'], [3, 2, 1])  # Land Slope
    df['ExterCond'] = df['ExterCond'].replace(['Ex', 'Gd', 'TA', 'Fa', 'Po'], [4, 3, 2, 1, 0])  # Exterior Condition
    df['HeatingQC'] = df['HeatingQC'].replace(['Ex', 'Gd', 'TA', 'Fa', 'Po'], [4, 3, 2, 1, 0])  # Heating Quality and Condition
    df['KitchenQual'] = df['KitchenQual'].replace(['Ex', 'Gd', 'TA', 'Fa'], [3, 2, 1, 0])  # Kitchen Quality

    # Replacing the NA values of each column with XX to avoid pandas from listing them as NaN
    na_data = ['Alley', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'FireplaceQu',
               'GarageType', 'GarageFinish', 'GarageQual', 'GarageCond', 'PoolQC', 'Fence', 'MiscFeature']

    for i in na_data:
        df[i] = df[i].fillna('XX')

    # Replaced the NaN values of LotFrontage and MasVnrArea with the mean of their column
    df['LotFrontage'] = df['LotFrontage'].fillna(df['LotFrontage'].mean())
    df['MasVnrArea'] = df['MasVnrArea'].fillna(df['MasVnrArea'].mean())

    x_train, x_test, y_train, y_test = train_test_split(df, df['SalePrice'], test_size = 0.3, random_state = 42)

    sel = SelectFromModel(RandomForestClassifier(n_estimators = 100), threshold = 300 * "mean")
    sel.fit(x_train, y_train)
    sel.get_support()

    selected_feat = x_train.columns[sel.get_support()]

    return selected_feat


print(feature_encoding(train, categorical_columns))

随机森林的代码就在训练测试拆分之后。

更新

将代码更改为上述版本后,出现以下错误:

Traceback (most recent call last):
  File "C:\Users\security\AppData\Roaming\Python\Python37\site-packages\pandas\core\indexes\base.py", line 2657, in get_loc
    return self._engine.get_loc(key)
  File "pandas\_libs\index.pyx", line 108, in pandas._libs.index.IndexEngine.get_loc
  File "pandas\_libs\index.pyx", line 129, in pandas._libs.index.IndexEngine.get_loc
  File "pandas\_libs\index_class_helper.pxi", line 91, in pandas._libs.index.Int64Engine._check_type
KeyError: 'Utilities'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:/Users/security/Downloads/AP/Boston-Kaggle/Boston.py", line 66, in <module>
    print(feature_encoding(train, categorical_columns))
  File "C:/Users/security/Downloads/AP/Boston-Kaggle/Boston.py", line 37, in feature_encoding
    df['Utilities'] = df['Utilities'].replace(['AllPub', 'NoSeWa'], [2, 1])  # Utilities
  File "C:\Users\security\AppData\Roaming\Python\Python37\site-packages\pandas\core\frame.py", line 2927, in __getitem__
    indexer = self.columns.get_loc(key)
  File "C:\Users\security\AppData\Roaming\Python\Python37\site-packages\pandas\core\indexes\base.py", line 2659, in get_loc
    return self._engine.get_loc(self._maybe_cast_indexer(key))
  File "pandas\_libs\index.pyx", line 108, in pandas._libs.index.IndexEngine.get_loc
  File "pandas\_libs\index.pyx", line 129, in pandas._libs.index.IndexEngine.get_loc
  File "pandas\_libs\index_class_helper.pxi", line 91, in pandas._libs.index.Int64Engine._check_type
KeyError: 'Utilities'
2个回答

我想问题出在你使用的 for 循环中

for col in categorical_list:

        # take one-hot encoding
        OHE_sdf = pd.get_dummies(df[categorical_list])

        # drop the old categorical column from original df
        df.drop(col, axis = 1, inplace = True)

        # attach one-hot encoded columns to original dataframe
        df = pd.concat([df, OHE_sdf], axis = 1, ignore_index = True)

        return df

您已提供退货声明。这就是调用函数时返回数据帧的原因。所以只需删除 for 循环中的 return 语句。

为了获得与目标变量相关性最高的列,您可以压缩获得的特征列表和相关值,并根据相关值按降序对它们进行排序。

你根本不需要 for 循环。

def feature_encoding(df, categorical_list):

    # One Hot Encoding the columns gathered in categorical_columns
    # take one-hot encoding
    OHE_sdf = pd.get_dummies(df[categorical_list])

     # drop the old categorical column from original df
    df.drop(columns=categorical_list, inplace = True)

     # attach one-hot encoded columns to original dataframe
     df = pd.concat([df, OHE_sdf], axis = 1, ignore_index = True)