数据挖掘 - 消除分类任务中的低质量预测 - 吾爱随笔录

这是有关该问题的一些背景信息。我的目标是将文本分类为某些类别。我只想从模型中获得高质量的预测。如果模型没有信心，我想手动对文本进行分类。

让我们考虑http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html中提供的示例，以便它可以重现。在以下示例中，分类模型经过训练并适合测试文档。其中一份测试文件是——“这到底是什么？”。我知道该模型正在返回概率最高的类。但是，当模型不确定时，我想将文本标记为“无法分类”

from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB().fit(X_train_tfidf, twenty_train.target)

docs_new = ['God is love', 'OpenGL on the GPU is fast', 'what the heck is this?']
X_new_counts = count_vect.transform(docs_new)
X_new_tfidf = tfidf_transformer.transform(X_new_counts)

predicted = clf.predict(X_new_tfidf)

for doc, category in zip(docs_new, predicted):
    print('%r => %s' % (doc, twenty_train.target_names[category]))

输出

'God is love' => soc.religion.christian
'OpenGL on the GPU is fast' => comp.graphics
'what the heck is this?' => soc.religion.christian

预测概率

这是预测的概率。文件 1 和 2 有一些明显的赢家。但是，第三个文档没有。我有大约 100 个课程，我会犹豫设置手动阈值。

clf.predict_proba(X_new_tfidf)
array([[ 0.16297502,  0.03828016,  0.03737814,  0.76136668],
       [ 0.16387956,  0.36874738,  0.2364763 ,  0.23089675],
       [ 0.28288106,  0.17035852,  0.2484853 ,  0.29827513]])