当决策树更加平衡和准确时,多类问题的神经网络分类器如何仅在一个类中分类?

数据挖掘 分类 决策树 多类分类 阶级失衡
2022-02-23 12:10:00

我想为具有四个类的数据框创建一个分类器。每行只能有一个类。我有两个预测模型:一个神经网络和一个树分类器。但是他们在培训期间以及因此在测试期间将每个人都放在一个班级中。

神经网络仅分类为一类

问题是我的神经网络的分类是:

水图像介绍

我在这里调用模型:

from tensorflow.keras.callbacks import ModelCheckpoint
from tensorflow.keras.models import load_model

model = create_model(x_train.shape[1], y_train.shape[1])
epochs =  30
batch_sz = 64

print("Beginning model training with batch size {} and {} epochs".format(batch_sz, epochs))

checkpoint = ModelCheckpoint("lc_model.h5", monitor='val_acc', verbose=0, save_best_only=True, mode='auto', period=1)


# train the model
history = model.fit(x_train.to_numpy(),
                y_train.to_numpy(),
                validation_split=0.2,
                epochs=epochs,  
                batch_size=batch_sz, 
                verbose=2,
#                 class_weight = weights, # class_weight tells the model to "pay more attention" to samples from an under-represented grade class.
#                 callbacks=[checkpoint]
                   )

# revert to the best model encountered during training
model = load_model("lc_model.h5", compile=False)

这是模型的架构:

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.constraints import MaxNorm
# from tensorflow.python.compiler.tensorrt import trt_convert as trt

def create_model(input_dim, output_dim):
    print(output_dim)
    # create model
    model = Sequential()
    print("sequential")
    # input layer
    model.add(Dense(100, input_dim=input_dim, activation='relu', kernel_constraint=MaxNorm(3)))
    model.add(Dropout(0.2))

    # hidden layer
    model.add(Dense(60, activation='relu', kernel_constraint=MaxNorm(3)))
    model.add(Dropout(0.2))

    # output layer
    model.add(Dense(output_dim, activation='softmax'))

    # Compile model
    model.compile(loss=focal_loss(alpha=1), loss_weights=None, optimizer='nadam', metrics=['accuracy'])

    return model

这里是一部分x_train

id reg  0.0_x   1.0_x   17.0    21.0    30.0    40.0    50.0    60.0    70.0    Célibataire     Divorcé(e)  Marié et j'ai des enfants à charge  Marié et je n'ai pas encore d'enfants à charge  Refus de répondre   Veuf (ve)   1er cycle universitaire / Licence   2e cycle universtaire / Master  3e cycle universtaire / Doctorat    BTS     Je n'ai jamais été à l'école    Niveau collège  Niveau lycée    Niveau primaire     Autre. Merci de préciser :@NS$  Infirme     J'ai une société    Je ne travaille pas     Je suis commerçant  Je suis encore étudiant     Je suis independent     Je suis journalier, je travaille de temps à autre   Je suis retraité    Je travaille dans le secteur privé  Je travaille dans le secteur public     0.0_y   250.0   3750.0  7500.0  8750.0  11250.0     11500.0     18750.0     25000.0     35000.0     45000.0     50000.0     0.0_x.1     1.0_y   0.0_y.1     1.0_x.1     Je ne suis pas d'accord     Je suis d'accord    False_x     True_y  False_y     True_x  False_x.1   True_y.1    False_y.1   True_x.1    False_x.2   True_y.2    False_y.2   True_x.2    False_x.3   True_y.3    False_y.3   True_x.3    False_x.4   True_y.4    False_y.4   True_x.4    False_x.5   True_y.5    False_y.5   True_x.5    0.0_x.2     1.0_y.1     0.0_y.2     1.0_x.2     0.0_x.3     1.0_y.2     0.0_y.3     1.0
0   NaN     0   1   0   0   1   0   0   0   0   1   0   0   0   0   0   1   0   0   0   0   0   0   0   0   0   0   1   0   0   0   0   0   0   0   1   0   0   0   0   0   0   0   0   0   0   0   1   0   1   0   0   1   0   1   0   1   1   0   1   0   0   1   1   0   1   0   1   0   1   0   1   0   1   0   1   0   0   1   1   0   0   1   0   1
1   NaN     0   1   0   0   1   0   0   0   0   1   0   0   0   0   0   0   0   0   0   0   0   1   0   0   0   0   0   0   0   1   0   0   0   0   1   0   0   0   0   0   0   0   0   0   0   0   0   1   1   0   1   0   0   1   1   0   1   0   1   0   0   1   1   0   1   0   1   0   1   0   0   1   1   0   1   0   1   0   1   0   1   0   0   0
...

这是一部分y_train

   Voting intention in 2021_Cast a blank vote   Voting intention in 2021_I know who I will be voting for in 2021    Voting intention in 2021_I won't vote   Voting intention in 2021_I'm going to vote in 2021 but don't know for who
0                0                                  1                                                                       0                                                                       0   
1                0                                  0                                                                       0                                                                       1
...

所以当我尝试测试这个模型时,它并不比随机更好:

sequential
Beginning model training with batch size 64 and 30 epochs
WARNING:tensorflow:`period` argument is deprecated. Please use `save_freq` to specify the frequency in number of samples seen.
Train on 768 samples, validate on 192 samples
Epoch 1/30
768/768 - 1s - loss: -inf - acc: 0.2448 - val_loss: -inf - val_acc: 0.2708
Epoch 2/30
768/768 - 0s - loss: -inf - acc: 0.2409 - val_loss: -inf - val_acc: 0.2708
...
Epoch 30/30
768/768 - 0s - loss: -inf - acc: 0.2409 - val_loss: -inf - val_acc: 0.2708

事实上,准确率略低于 25%,这是我从随机选择类时所期望的结果。它似乎永远不会学到任何东西,因为损失总是如此-inf

所以我在测试集上计算模型的准确率,结果更糟。确实使用以下代码:

import numpy as np
from sklearn.metrics import f1_score

y_pred = model.predict(x_test.to_numpy())

# Revert one-hot encoding to classes
y_pred_classes = pd.DataFrame((y_pred.argmax(1)[:,None] == np.arange(y_pred.shape[1])),
                              columns=y_test.columns,
                              index=y_test.index)

y_test_vals = y_test.idxmax(1)
y_pred_vals = y_pred_classes.idxmax(1)

# F1 score
# Use idxmax() to convert back from one-hot encoding
f1 = f1_score(y_test_vals, y_pred_vals, average='weighted')
print("Test Set Accuracy: {:.2%}   (But results would have been better if trained on the FULL dataset)".format(f1))

我不明白,这是我设法解决另一个贷款分类问题的架构。

我得到:Test Set Accuracy: 10.92%

带重量:

所有之前的模型化都未加权或没有焦点损失。

我试图以不同的方式来应对类不平衡,例如重新采样。没有重新采样,我用权重做了:

weights = df_en2['Voting intention in 2021'].value_counts(normalize=True)
weights = weights.sort_index().tolist()
weights = {0: 1 / weights[0],
           1: 1 / weights[1], 
           2: 1 / weights[2],
           3: 1 / weights[3]}

dfen_2提供x_train, y_train, x_test,y_test功能的数据框在哪里spilt_data(),您可以在这里找到(它基本上是相同的架构,但针对贷款分类问题)。

多类树分类器

相比之下,使用树分类器,如果我离开max_depthto None,叶子会被扩展,直到所有叶子都是纯的或直到所有叶子包含少于 min_samples_split 样本。我的平均准确率为 42%。

# importing necessary libraries 
from sklearn import datasets 
from sklearn.metrics import confusion_matrix 
from sklearn.model_selection import train_test_split 

# dividing X, y into train and test data 
# X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0) 

# training a DescisionTreeClassifier 
from sklearn.tree import DecisionTreeClassifier 
dtree_model = DecisionTreeClassifier().fit(x_train, y_train) 
dtree_predictions = dtree_model.predict(x_test) 

# creating a confusion matrix 
cm = confusion_matrix(y_test.values.argmax(axis = 1), dtree_predictions.argmax(axis = 1))

它返回:

array([[0.19047619, 0.15873016, 0.45454545, 0.27118644],
       [0.15873016, 0.38095238, 0.2       , 0.30508475],
       [0.15873016, 0.15873016, 0.4       , 0.22033898],
       [0.19047619, 0.19047619, 0.21818182, 0.38983051]])
1个回答

对于多标签分类,您希望在最后一层使用 binary_crossentropy 损失和 sigmoid 激活,因为每个类可以取 0 到 1 之间的值。在您当前的情况下,您的模型假设最后一层的所有值的总和应该总计 1,而实际上总数可能超过 1。这是一个很好的参考:https ://towardsdatascience.com/multi-label-image-classification-with-neural-network-keras-ddc1ab1afede