当一些代表较少时改进分类?

数据挖掘 分类 特征选择 多类分类 多标签分类
2022-03-07 16:13:01

我有一个多类分类问题。它表现得很好,但在最少代表的类上却没有。确实,这是分布:

在此处输入图像描述

这是分类结果(我从标签上去掉了数字):

在此处输入图像描述.

因此,当一些代表较少时如何改进分类?

我想在火车样本中复制几行它不能很好地预测的类。但也许这个假设是完全错误的,也许不是因为它们的代表性较低而分类错误。也许我应该看看我手工做的特征选择,而不是做一个 PCA ?

更新

倒频类权重

我传递了class_weight参数,model.fit()其中是数据集上类的倒频列表:

>>> lossWeights = df['grade'].value_counts(normalize=True)
>>> lossWeights = lossWeights.sort_index().tolist()
>>> print(lossWeights)
[0.204064039408867, 0.2954361054766734, 0.29536185163720663, 0.13638619240799768, 0.04878839466821211, 0.014684149521877717, 0.0052792668791654595]
weights = {0: 1 / 0.204064,
       1: 1 / 0.295436, 
       2: 1 / 0.295362,
       3: 1 / 0.136386, 
       4: 1 / 0.048788,
       5: 1 / 0.014684,
       6: 1 / 0.005279}

history = model.fit(x_train.as_matrix(),
                y_train.as_matrix(),
                validation_split=0.2,
                epochs=epochs,  
                batch_size=batch_sz, # Can I tweak the batch here to get evenly distributed data ?
                verbose=2,
                class_weight = weights,
                callbacks=[checkpoint])

它在测试集准确度上有所下降:86.57%(之前是 88.54%),但更好地平衡了混淆矩阵的结果:

在此处输入图像描述

具有倒频 + 焦点损失的类权重

焦点损失旨在通过降低内点(简单示例)的权重来解决类别不平衡问题,即使它们的数量很大,它们对总损失的贡献也很小。它专注于训练一组稀疏的困难示例。

def focal_loss(gamma=2., alpha=4.):

    gamma = float(gamma)
    alpha = float(alpha)

    def focal_loss_fixed(y_true, y_pred):
        """Focal loss for multi-classification
        FL(p_t)=-alpha(1-p_t)^{gamma}ln(p_t)
        Notice: y_pred is probability after softmax
        gradient is d(Fl)/d(p_t) not d(Fl)/d(x) as described in paper
        d(Fl)/d(p_t) * [p_t(1-p_t)] = d(Fl)/d(x)
        Focal Loss for Dense Object Detection
        https://arxiv.org/abs/1708.02002

        Arguments:
            y_true {tensor} -- ground truth labels, shape of [batch_size, num_cls]
            y_pred {tensor} -- model's output, shape of [batch_size, num_cls]

        Keyword Arguments:
            gamma {float} -- (default: {2.0})
            alpha {float} -- (default: {4.0})

        Returns:
            [tensor] -- loss.
        """
        epsilon = 1.e-9
        y_true = tf.convert_to_tensor(y_true, tf.float32)
        y_pred = tf.convert_to_tensor(y_pred, tf.float32)

        model_out = tf.add(y_pred, epsilon)
        ce = tf.multiply(y_true, -tf.log(model_out))
        weight = tf.multiply(y_true, tf.pow(tf.subtract(1., model_out), gamma))
        fl = tf.multiply(alpha, tf.multiply(weight, ce))
        reduced_fl = tf.reduce_max(fl, axis=1)
        return tf.reduce_mean(reduced_fl)
    return focal_loss_fixed

model.compile(loss=focal_loss(alpha=1),
              optimizer='nadam',
              metrics=['accuracy'])
model.fit(X_train, y_train, epochs=3, batch_size=1000)

我只需要将它添加到我的模型中:

def create_model(input_dim, output_dim):
    print(output_dim)
    # create model
    model = Sequential()
    # input layer
    model.add(Dense(100, input_dim=input_dim, activation='relu', kernel_constraint=maxnorm(3)))
    model.add(Dropout(0.2))
    
    # hidden layer
    model.add(Dense(60, activation='relu', kernel_constraint=maxnorm(3)))
    model.add(Dropout(0.2))
    
    # output layer
    model.add(Dense(output_dim, activation='softmax'))
    
    # Compile model
    # model.compile(loss='categorical_crossentropy', loss_weights=None, optimizer='adam', metrics=['accuracy'])
    model.compile(loss=focal_loss(alpha=1), loss_weights=None, optimizer='adam', metrics=['accuracy'])
    
    return model

它给了我88%的整体准确率。然而,它给了我对最少代表类的一个非常糟糕的分类:

在此处输入图像描述

焦点损失

它有一个不错的测试集准确率:88.27%,分类更平衡:

在此处输入图像描述

现在我不得不提问。我还是不满意。如何改进这种分类?在第一次和最后一次更新之间我应该使用哪个模型?

重采样

我试图对大多数班级进行抽样调查

import os
from sklearn.utils import resample

# rebalance data
#df = resample_data(df)

if True:

    count_class_A, count_class_B,count_class_C, count_class_D,count_class_E, count_class_F, count_class_G = df.grade.value_counts()
    count_df = df.shape[0] 
    class_dict = {"A": count_class_A,"B" :count_class_B,"C": count_class_C,"D": count_class_D,"E": count_class_E, "F": count_class_F, "G": count_class_G}
    counts = [count_class_A, count_class_B,count_class_C, count_class_D,count_class_E, count_class_F, count_class_G]
    median = statistics.median(counts)


    for key in class_dict:
        if class_dict[key]>median:
            print(key)
            df[df.grade == key] = df[df.grade == key].sample(int(count_df/7), replace = False) 
                                             #replace=False,    # sample without replacement
                                             #n_samples=int(count_df/7),     # to match minority class
                                             #random_state=123)
# Divide the data set into training and test sets
x_train, x_test, y_train, y_test = split_data(df, APPLICANT_NUMERIC + CREDIT_NUMERIC,
                  APPLICANT_CATEGORICAL,
                  TARGET,
                  test_size = 0.2,
                  #row_limit = os.environ.get("sample"))
                  row_limit = 552160)

# Inspect our training data
print("x_train contains {} rows and {} features".format(x_train.shape[0], x_train.shape[1]))
print("y_train contains {} rows and {} features".format(y_train.shape[0], y_train.shape[1]))

print("x_test contains {} rows and {} features".format(x_test.shape[0], x_test.shape[1]))
print("y_test contains {} rows and {} features".format(y_test.shape[0], y_test.shape[1]))

# Loan grade has been one-hot encoded
print("Sample one-hot encoded 'y' value: \n{}".format(y_train.sample()))

然而,它的结果是灾难性的。模型准确性和模型损失看起来有一些问题:

在此处输入图像描述

在测试集上,所有东西都被归类为“A”。

2个回答

类不平衡是一个常见问题,人们解决这个问题的方法有多种,以下是几种。

  1. 如果可能,请尝试增加数据较少的类。(有些人可能称之为过采样)。
  2. 您可以使用类加权来减少(在训练期间)数据不足的类。
  3. 使用 Focal Loss,一个例子可以在 - Using focus loss for Fraud detection

还有其他可能的方法,请尝试最适合您的场景和方便的方法。

也许我应该看看我手工做的特征选择,而不是做一个 PCA ?

这是一个选择,它可能会奏效。

或者,您可以尝试:

  1. 使用小批量梯度下降训练神经网络,并以使所有类均匀分布的方式人为地构建每批数据。

  2. 如果可以,将问题分解为子问题。分类器的性能随着类数量的增加而下降。如果某些类代表一个单独的问题,您可以在数据子集上训练模型,以便每个模型都可以学习在更少的类上进行分类。这通常会导致更好的预测。