使用 imblearn 欠采样、过采样和组合来平衡数据集?

数据挖掘 Python 阶级失衡 打击 不平衡学习 smotenc
2022-02-19 22:22:52

我有不平衡的数据集:

data['Class'].value_counts()
Out[22]: 
0    137757
1      4905
Name: Class, dtype: int64
X_train, X_valid, y_train, y_valid = train_test_split(input_x, input_y, test_size=0.20, random_state=seed)
print(sorted(Counter(y_train).items()))
[(0, 110215), (1, 3914)]

我尝试了不同的 imblearn 函数:

from imblearn.combine import SMOTEENN, SMOTETomek
from imblearn.over_sampling import  ADASYN, BorderlineSMOTE, RandomOverSampler, SMOTE
from imblearn.under_sampling import CondensedNearestNeighbour, EditedNearestNeighbours, RepeatedEditedNearestNeighbours
from imblearn.under_sampling import AllKNN, InstanceHardnessThreshold, NeighbourhoodCleaningRule, TomekLinks

smote_enn = SMOTEENN(random_state=27)
smote_tomek = SMOTETomek(random_state=27)

adasyn = ADASYN(random_state=27)
borderline = BorderlineSMOTE(random_state=27)
ran_oversample = RandomOverSampler(random_state=27)
smote = SMOTE(random_state=27)

cnn = CondensedNearestNeighbour(random_state=27) 
enn = EditedNearestNeighbours(random_state=27)
renn = RepeatedEditedNearestNeighbours(random_state=27)
allknn = AllKNN(random_state=27)
iht = InstanceHardnessThreshold(random_state=0)
ncr = NeighbourhoodCleaningRule(random_state=27)
tomek = TomekLinks(random_state=27)

创建了不同的火车:

def BalancingData(function):
    X_train_resampled, y_train_resampled = function.fit_sample(X_train, y_train)
    print(sorted(Counter(y_train_resampled).items()))
    return X_train_resampled, y_train_resampled

X_train_smote_enn, y_train_smote_enn = BalancingData(smote_enn)
X_train_smote_tomek, y_train_smote_tomek = BalancingData(smote_tomek)

X_train_adasyn, y_train_adasyn = BalancingData(adasyn)
X_train_borderline, y_train_borderline = BalancingData(borderline)
X_train_ran_oversample, y_train_ran_oversample = BalancingData(ran_oversample)
X_train_smote, y_train_smote = BalancingData(smote)

X_train_cnn, y_train_cnn = BalancingData(cnn)
X_train_enn, y_train_enn = BalancingData(enn)
X_train_renn, y_train_renn = BalancingData(renn)
X_train_allknn, y_train_allknn = BalancingData(allknn)
X_train_iht, y_train_iht = BalancingData(iht)
X_train_ncr, y_train_ncr = BalancingData(ncr)
X_train_tomek, y_train_tomek = BalancingData(tomek)

然后构建了 Keras 模型:

def my_model(X,y):
    Keras_model = Sequential()
    Keras_model.add(Dense(33,activation='sigmoid', kernel_initializer='glorot_uniform', kernel_constraint=maxnorm(12), input_shape=(input_len,))) 
    Keras_model.add(Dense(1, activation='sigmoid', kernel_initializer='glorot_uniform')) 
    Keras_model.compile(optimizer = 'Adam', loss = 'binary_crossentropy', metrics = ['accuracy'])
    Keras_model.fit(X, y, validation_data=(X_valid,y_valid), batch_size = 1000, epochs = 10, verbose = 0)
    scores_valid = Keras_model.evaluate(X_valid, y_valid, verbose=1)
    scores_train = Keras_model.evaluate(X, y, verbose=1)
    scores_full = Keras_model.evaluate(X_train, y_train, verbose=1)

使用不同的 imb train 和有效集合调用

my_model(X_train, y_train)
my_model(X_train_smote_enn, y_train_smote_enn)
my_model(X_train_smote_tomek, y_train_smote_tomek)
and so on

我发现 allKNN 的准确性有非常小的改进。这是最初的基准,我知道我们仍然可以调整 allKNN 中的参数并尝试改进它。我的问题是:为什么不使用基础模型本身显示出更好准确性的平衡功能就不能找到很大的改进?

在此处输入图像描述

0个回答
没有发现任何回复~