我正在通过 Kaggle 上的房价竞争解决方案(Human Analog's Kernel on House prices: Advance Regression Techniques)并遇到了这一部分:
# Transform the skewed numeric features by taking log(feature + 1).
# This will make the features more normal.
from scipy.stats import skew
skewed = train_df_munged[numeric_features].apply(lambda x: skew(x.dropna().astype(float)))
skewed = skewed[skewed > 0.75]
skewed = skewed.index
train_df_munged[skewed] = np.log1p(train_df_munged[skewed])
test_df_munged[skewed] = np.log1p(test_df_munged[skewed])
我不确定将偏态分布转换为正态分布的需要是什么。拜托,谁能详细解释一下:
- 为什么要在这里进行?或者这有什么帮助?
- 这与特征缩放有何不同?
- 这是特征工程的必要步骤吗?如果我跳过这一步可能会发生什么?