我正在参加 Kaggle 上的波士顿比赛,目前我正在尝试使用随机森林来查找与目标变量相关性最高的列SalePrice。但是,该实现几乎返回了数据集中的每一个变量:
0 1 2 3 4 5 6 ... 252 253 254 255 256 257 258
0 1 RL 65.0 8450 Pave NaN Reg ... 0 1 0 0 1 0 1
1 2 RL 80.0 9600 Pave NaN Reg ... 0 1 0 0 1 0 1
2 3 RL 68.0 11250 Pave NaN IR1 ... 0 1 0 0 1 0 1
3 4 RL 60.0 9550 Pave NaN IR1 ... 0 0 0 0 1 0 1
4 5 RL 84.0 14260 Pave NaN IR1 ... 0 1 0 0 1 0 1
5 6 RL 85.0 14115 Pave NaN IR1 ... 0 1 0 0 1 0 1
6 7 RL 75.0 10084 Pave NaN Reg ... 0 1 0 0 1 0 1
7 8 RL NaN 10382 Pave NaN IR1 ... 0 1 0 0 1 0 1
8 9 RM 51.0 6120 Pave NaN Reg ... 0 0 0 0 1 0 1
9 10 RL 50.0 7420 Pave NaN Reg ... 0 1 0 0 1 0 1
10 11 RL 70.0 11200 Pave NaN Reg ... 0 1 0 0 1 0 1
11 12 RL 85.0 11924 Pave NaN IR1 ... 0 0 1 0 1 0 1
12 13 RL NaN 12968 Pave NaN IR2 ... 0 1 0 0 1 0 1
13 14 RL 91.0 10652 Pave NaN IR1 ... 0 0 1 0 1 0 1
14 15 RL NaN 10920 Pave NaN IR1 ... 0 1 0 0 1 0 1
15 16 RM 51.0 6120 Pave NaN Reg ... 0 1 0 0 1 0 1
16 17 RL NaN 11241 Pave NaN IR1 ... 0 1 0 0 1 0 1
17 18 RL 72.0 10791 Pave NaN Reg ... 0 1 0 0 1 0 1
18 19 RL 66.0 13695 Pave NaN Reg ... 0 1 0 0 1 0 1
19 20 RL 70.0 7560 Pave NaN Reg ... 0 0 0 0 1 0 1
20 21 RL 101.0 14215 Pave NaN IR1 ... 0 0 1 0 1 0 1
21 22 RM 57.0 7449 Pave Grvl Reg ... 0 1 0 0 1 0 1
22 23 RL 75.0 9742 Pave NaN Reg ... 0 1 0 0 1 0 1
23 24 RM 44.0 4224 Pave NaN Reg ... 0 1 0 0 1 0 1
24 25 RL NaN 8246 Pave NaN IR1 ... 0 1 0 0 1 0 1
25 26 RL 110.0 14230 Pave NaN Reg ... 0 1 0 0 1 0 1
26 27 RL 60.0 7200 Pave NaN Reg ... 0 1 0 0 1 0 1
27 28 RL 98.0 11478 Pave NaN Reg ... 0 1 0 0 1 0 1
28 29 RL 47.0 16321 Pave NaN IR1 ... 0 1 0 0 1 0 1
29 30 RM 60.0 6324 Pave NaN IR1 ... 0 1 0 0 1 1 0
... ... .. ... ... ... ... ... ... .. .. .. .. .. .. ..
1430 1431 RL 60.0 21930 Pave NaN IR3 ... 0 1 0 0 1 0 1
1431 1432 RL NaN 4928 Pave NaN IR1 ... 0 1 0 0 1 0 1
不仅如此,其中一些列也返回NaN值。NaN在返回任何东西之前,我已经处理了值。
警告:我在对分类变量进行一次热编码后立即使用随机森林,这就是为什么返回具有如此高的维度的部分原因。
到目前为止,这是我的实现:
我在单独的列表中收集了分类变量、连续变量和二进制变量的名称:
categorical_columns = ['MSSubClass', 'MSZoning', 'LotShape', 'LandContour', 'LotConfig', 'Neighborhood', 'Condition1',
'Condition2', 'BldgType', 'HouseStyle', 'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd',
'Foundation', 'Heating', 'Electrical', 'Functional', 'GarageType', 'PavedDrive', 'Fence',
'MiscFeature', 'SaleType', 'SaleCondition', 'Street', 'CentralAir']
ranked_columns = ['Utilities', 'LandSlope', 'ExterQual', 'ExterCond', 'BsmtQual', 'BsmtCond', 'BsmtExposure',
'BsmtFinType1', 'BsmtFinType2', 'HeatingQC', 'KitchenQual', 'FireplaceQu', 'GarageQual', 'GarageCond',
'PoolQC', 'OverallQual', 'OverallCond']
numerical_columns = ['LotArea', 'LotFrontage', 'YearBuilt', 'YearRemodAdd', 'MasVnrArea', 'BsmtFinSF1', 'BsmtFinSF2',
'BsmtUnfSF','TotalBsmtSF', '1stFlrSF', '2ndFlrSf', 'LowQualFinSF', 'GrLivArea', 'BsmtFullBath',
'BsmtHalfBath', 'FullBath', 'HalfBath', 'Bedroom', 'Kitchen', 'TotRmsAbvGrd', 'Fireplaces',
'GarageYrBlt', 'GarageCars', 'GarageArea', 'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch',
'3SsnPorch', 'ScreenPorch', 'PoolArea', 'MiscVal', 'MoSold', 'YrSold']
我创建了一个名为的函数定义def feature_encoding(df, categorical_list):,以下代码来自该函数定义:
在这里,我将遍历categorical_columns循环中的每个分类变量,以对它们中的每一个进行一次热编码。最后,我将它们重新插入数据框中:
for col in categorical_list:
# take one-hot encoding
OHE_sdf = pd.get_dummies(df[categorical_list])
# drop the old categorical column from original df
df.drop(col, axis = 1, inplace = True)
# attach one-hot encoded columns to original dataframe
df = pd.concat([df, OHE_sdf], axis = 1, ignore_index = True)
return df
在这里,我用整数编码我的排名值(例如: Excellent、、、Good) :Average
df['Utilities'] = df['Utilities'].replace(['AllPub', 'NoSeWa'], [2, 1]) # Utilities
df['ExterQual'] = df['ExterQual'].replace(['Ex', 'Gd', 'TA', 'Fa'], [4, 3, 2, 1]) # Exterior Quality
df['LandSlope'] = df['LandSlope'].replace(['Gtl', 'Mod', 'Sev'], [3, 2, 1]) # Land Slope
df['ExterCond'] = df['ExterCond'].replace(['Ex', 'Gd', 'TA', 'Fa', 'Po'], [4, 3, 2, 1, 0]) # Exterior Condition
df['HeatingQC'] = df['HeatingQC'].replace(['Ex', 'Gd', 'TA', 'Fa', 'Po'], [4, 3, 2, 1, 0]) # Heating Quality and Condition
df['KitchenQual'] = df['KitchenQual'].replace(['Ex', 'Gd', 'TA', 'Fa'], [3, 2, 1, 0]) # Kitchen Quality
一些列的值缩写为NA,这意味着类似于“No pavement”,但 pandas 将其解释为NaN。为避免这种情况,我将这些缩写中的每一个都替换为以下内容XX:
# Replacing the NA values of each column with XX to avoid pandas from listing them as NaN
na_data = ['Alley', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'FireplaceQu',
'GarageType', 'GarageFinish', 'GarageQual', 'GarageCond', 'PoolQC', 'Fence', 'MiscFeature']
for i in na_data:
df[i] = df[i].fillna('XX')
# Replaced the NaN values of LotFrontage and MasVnrArea with the mean of their column
df['LotFrontage'] = df['LotFrontage'].fillna(df['LotFrontage'].mean())
df['MasVnrArea'] = df['MasVnrArea'].fillna(df['MasVnrArea'].mean())
最后,这是我找到相关变量的随机森林实现:
x_train, x_test, y_train, y_test = train_test_split(df, df['SalePrice'], test_size=0.3, random_state=42)
sel = SelectFromModel(RandomForestClassifier(n_estimators=100))
sel.fit(x_train, y_train)
sel.get_support()
selected_feat = x_train.columns[sel.get_support()]
对于这么冗长的帖子,我深表歉意。我想在我的问题中尽可能清楚。如果您想查看整个 .py 文件,它与超链接数据集位于同一存储库中。