我有一个多类分类问题。它表现得很好,但在最少代表的类上却没有。确实,这是分布:
这是分类结果(我从标签上去掉了数字):
因此,当一些代表较少时如何改进分类?
我想在火车样本中复制几行它不能很好地预测的类。但也许这个假设是完全错误的,也许不是因为它们的代表性较低而分类错误。也许我应该看看我手工做的特征选择,而不是做一个 PCA ?
更新
倒频类权重
我传递了class_weight参数,model.fit()其中是数据集上类的倒频列表:
>>> lossWeights = df['grade'].value_counts(normalize=True)
>>> lossWeights = lossWeights.sort_index().tolist()
>>> print(lossWeights)
[0.204064039408867, 0.2954361054766734, 0.29536185163720663, 0.13638619240799768, 0.04878839466821211, 0.014684149521877717, 0.0052792668791654595]
weights = {0: 1 / 0.204064,
1: 1 / 0.295436,
2: 1 / 0.295362,
3: 1 / 0.136386,
4: 1 / 0.048788,
5: 1 / 0.014684,
6: 1 / 0.005279}
history = model.fit(x_train.as_matrix(),
y_train.as_matrix(),
validation_split=0.2,
epochs=epochs,
batch_size=batch_sz, # Can I tweak the batch here to get evenly distributed data ?
verbose=2,
class_weight = weights,
callbacks=[checkpoint])
它在测试集准确度上有所下降:86.57%(之前是 88.54%),但更好地平衡了混淆矩阵的结果:
具有倒频 + 焦点损失的类权重
焦点损失旨在通过降低内点(简单示例)的权重来解决类别不平衡问题,即使它们的数量很大,它们对总损失的贡献也很小。它专注于训练一组稀疏的困难示例。
def focal_loss(gamma=2., alpha=4.):
gamma = float(gamma)
alpha = float(alpha)
def focal_loss_fixed(y_true, y_pred):
"""Focal loss for multi-classification
FL(p_t)=-alpha(1-p_t)^{gamma}ln(p_t)
Notice: y_pred is probability after softmax
gradient is d(Fl)/d(p_t) not d(Fl)/d(x) as described in paper
d(Fl)/d(p_t) * [p_t(1-p_t)] = d(Fl)/d(x)
Focal Loss for Dense Object Detection
https://arxiv.org/abs/1708.02002
Arguments:
y_true {tensor} -- ground truth labels, shape of [batch_size, num_cls]
y_pred {tensor} -- model's output, shape of [batch_size, num_cls]
Keyword Arguments:
gamma {float} -- (default: {2.0})
alpha {float} -- (default: {4.0})
Returns:
[tensor] -- loss.
"""
epsilon = 1.e-9
y_true = tf.convert_to_tensor(y_true, tf.float32)
y_pred = tf.convert_to_tensor(y_pred, tf.float32)
model_out = tf.add(y_pred, epsilon)
ce = tf.multiply(y_true, -tf.log(model_out))
weight = tf.multiply(y_true, tf.pow(tf.subtract(1., model_out), gamma))
fl = tf.multiply(alpha, tf.multiply(weight, ce))
reduced_fl = tf.reduce_max(fl, axis=1)
return tf.reduce_mean(reduced_fl)
return focal_loss_fixed
model.compile(loss=focal_loss(alpha=1),
optimizer='nadam',
metrics=['accuracy'])
model.fit(X_train, y_train, epochs=3, batch_size=1000)
我只需要将它添加到我的模型中:
def create_model(input_dim, output_dim):
print(output_dim)
# create model
model = Sequential()
# input layer
model.add(Dense(100, input_dim=input_dim, activation='relu', kernel_constraint=maxnorm(3)))
model.add(Dropout(0.2))
# hidden layer
model.add(Dense(60, activation='relu', kernel_constraint=maxnorm(3)))
model.add(Dropout(0.2))
# output layer
model.add(Dense(output_dim, activation='softmax'))
# Compile model
# model.compile(loss='categorical_crossentropy', loss_weights=None, optimizer='adam', metrics=['accuracy'])
model.compile(loss=focal_loss(alpha=1), loss_weights=None, optimizer='adam', metrics=['accuracy'])
return model
它给了我88%的整体准确率。然而,它给了我对最少代表类的一个非常糟糕的分类:
焦点损失
它有一个不错的测试集准确率:88.27%,分类更平衡:
现在我不得不提问。我还是不满意。如何改进这种分类?在第一次和最后一次更新之间我应该使用哪个模型?
重采样
我试图对大多数班级进行抽样调查
import os
from sklearn.utils import resample
# rebalance data
#df = resample_data(df)
if True:
count_class_A, count_class_B,count_class_C, count_class_D,count_class_E, count_class_F, count_class_G = df.grade.value_counts()
count_df = df.shape[0]
class_dict = {"A": count_class_A,"B" :count_class_B,"C": count_class_C,"D": count_class_D,"E": count_class_E, "F": count_class_F, "G": count_class_G}
counts = [count_class_A, count_class_B,count_class_C, count_class_D,count_class_E, count_class_F, count_class_G]
median = statistics.median(counts)
for key in class_dict:
if class_dict[key]>median:
print(key)
df[df.grade == key] = df[df.grade == key].sample(int(count_df/7), replace = False)
#replace=False, # sample without replacement
#n_samples=int(count_df/7), # to match minority class
#random_state=123)
# Divide the data set into training and test sets
x_train, x_test, y_train, y_test = split_data(df, APPLICANT_NUMERIC + CREDIT_NUMERIC,
APPLICANT_CATEGORICAL,
TARGET,
test_size = 0.2,
#row_limit = os.environ.get("sample"))
row_limit = 552160)
# Inspect our training data
print("x_train contains {} rows and {} features".format(x_train.shape[0], x_train.shape[1]))
print("y_train contains {} rows and {} features".format(y_train.shape[0], y_train.shape[1]))
print("x_test contains {} rows and {} features".format(x_test.shape[0], x_test.shape[1]))
print("y_test contains {} rows and {} features".format(y_test.shape[0], y_test.shape[1]))
# Loan grade has been one-hot encoded
print("Sample one-hot encoded 'y' value: \n{}".format(y_train.sample()))
然而,它的结果是灾难性的。模型准确性和模型损失看起来有一些问题:
在测试集上,所有东西都被归类为“A”。





