在 Keras 中,我使用类似于Keras IMDB 示例的东西来构建主题建模示例。但是,与具有单一“正面/负面”分类的示例不同,我有一百多个不相互排斥的主题。每个训练样本都有一个对应的输出,它是一个由 3 或 4 个 1 组成的向量。例如:[0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0 ,0,0,0,0,0 ..... 0]
model = Sequential()
model.add(Embedding(max_features, 128))
model.add(LSTM(128, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(120, activation='sigmoid'))
# try using different optimizers and different optimizer configs
model.compile(loss='binary_crossentropy',
optimizer='adam',
metrics=['accuracy'])
print('Train...')
model.fit(x_train, y_train,
batch_size=batch_size,
epochs=15,
validation_data=(x_test, y_test))
score, acc = model.evaluate(x_test, y_test,
batch_size=batch_size)
print('Test score:', score)
print('Test accuracy:', acc)
当然,模型的准确率很快就提高到了 95-97%,但是当我查看输出时,当然它预测的只是零。显然,类不平衡(每个类都有更多的负样本,然后正样本导致我的预测保持在 0)有没有办法调整模型来理解稀疏的二元样本?