您估算特征的方式无法在测试集中复制,因为它需要目标类的知识!
您需要选择不依赖于目标特征的不同插补策略。
假设您正在使用另一个功能,与您使用目标的方式相同,您需要存储您在训练集中对每一列进行插补的值,然后使用与训练集相同的值来插补测试集. 这看起来像这样:
# we have two dataframes, train_df and test_df
impute_values = train_df.groupby('Another Feature')['Feature'].mean()
train_df['Feature'] = pd.Series(train_df['Feature'].values, index=train_df['Another Feature']).fillna(impute_values).reset_index(drop=True)
# train your model ...
test_df['Feature'] = pd.Series(test_df['Feature'].values, index=test_df['Another Feature']).fillna(impute_values).reset_index(drop=True)
例子:
train_df = pd.DataFrame({'f1': ['a'] * 5 + ['b'] * 5, 'f2': range(10)})
test_df = pd.DataFrame({'f1': ['a'] * 3 + ['b'] * 7, 'f2': range(10, 20)})
train_df.loc[[1, 6], 'f2'] = np.nan
test_df.loc[[1, 6], 'f2'] = np.nan
impute_values = train_df.groupby('f1')['f2'].mean()
train_df['f2'] = pd.Series(train_df['f2'].values, index=train_df['f1']).fillna(impute_values).reset_index(drop=True)
test_df['f2'] = pd.Series(test_df['f2'].values, index=test_df['f1']).fillna(impute_values).reset_index(drop=True)