使用 Python SMOTE 重采样

数据挖掘 机器学习 神经网络 深度学习 数据挖掘 打击
2022-01-22 19:56:35

我正在尝试在训练测试拆分后做一个简单的 ML 重新采样方法。但是,当我这样做时,它会引发以下错误。你能帮我理解这个错误是什么吗?

KeyError: 'Only the Series name can be used for the key in Series dtype mappings.'

代码如下:

# split into training and testing datasets
from sklearn.model_selection import train_test_split
from sklearn.utils import resample
from imblearn.over_sampling import SMOTE
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30, random_state = 2, shuffle = True, stratify = y)
print("Number transactions X_train dataset: ", X_train.shape)
print("Number transactions y_train dataset: ", y_train.shape)
print("Number transactions X_test dataset: ", X_test.shape)
print("Number transactions y_test dataset: ", y_test.shape)
print("Before OverSampling, counts of label '1': {}".format(sum(y_train==1)))
print("Before OverSampling, counts of label '0': {} \n".format(sum(y_train==0)))

sm = SMOTE(random_state=2)
X_train_res, y_train_res = sm.fit_sample(X_train, y_train.ravel())   # error is thrown here

print('After OverSampling, the shape of train_X: {}'.format(X_train_res.shape))
print('After OverSampling, the shape of train_y: {} \n'.format(y_train_res.shape))

print("After OverSampling, counts of label '1': {}".format(sum(y_train_res==1)))
print("After OverSampling, counts of label '0': {}".format(sum(y_train_res==0)))

这是完整的错误消息:

KeyError                                  Traceback (most recent call last)
<ipython-input-216-af83b63865ac> in <module>
      3 
      4 sm = SMOTE(random_state=2)
----> 5 X_train_res, y_train_res = sm.fit_sample(X_train, y_train.ravel())
      6 
      7 print('After OverSampling, the shape of train_X: {}'.format(X_train_res.shape))

~\AppData\Local\Continuum\anaconda3\lib\site-packages\imblearn\base.py in fit_resample(self, X, y)
     86         if self._X_columns is not None:
     87             X_ = pd.DataFrame(output[0], columns=self._X_columns)
---> 88             X_ = X_.astype(self._X_dtypes)
     89         else:
     90             X_ = output[0]

~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\generic.py in astype(self, dtype, copy, errors, **kwargs)
   5863                     results.append(
   5864                         col.astype(
-> 5865                             dtype=dtype[col_name], copy=copy, errors=errors, **kwargs
   5866                         )
   5867                     )

~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\generic.py in astype(self, dtype, copy, errors, **kwargs)
   5846                 if len(dtype) > 1 or self.name not in dtype:
   5847                     raise KeyError(
-> 5848                         "Only the Series name can be used for "
   5849                         "the key in Series dtype mappings."
   5850                     )

KeyError: 'Only the Series name can be used for the key in Series dtype mappings.'
2个回答

做到这一点,不要散乱(或任何形式的重塑)。

或者,如果您打算将数据帧 X_train 也转换为矩阵。这是正确的格式 fit_sample

将您的数据框更改为矩阵:

sm.fit_sample(X_train.as_matrix(), y_train.ravel())