如何在 Python 中为多类文本分类问题创建 ROC - AUC 曲线

数据挖掘 机器学习 文本挖掘
2022-01-24 11:31:55

我正在研究一个多类文本分类问题并尝试绘制 ROC 曲线,但到目前为止没有成功。尝试了许多可用的解决方案,但没有奏效。请有人帮助我使用以下代码来绘制 ROC 曲线。实际上,我正在为五个不同的类执行文本分类。

categories = ['Philonthropists', 'Politcians', 'Showbiz', 'Sportsmen', 'Writers']
train = dt.load_files(r'C:\Users\...\Learning\Train', categories=categories, encoding='ISO-8859-1')
test = dt.load_files(r'C:\Users\...\Learning\Test', categories=categories, encoding='ISO-8859-1')
count_vector = CountVectorizer()
x_trian_tf = count_vector.fit_transform(train.data)
tfidf_transformer = TfidfTransformer()
x_train_tfidf = tfidf_transformer.fit_transform(x_trian_tf)
learn = MultinomialNB().fit(x_train_tfidf, train.target)
x_test_tf = count_vector.transform(test.data)
x_test_tfidf = tfidf_transformer.transform(x_test_tf)
prediction = learn.predict(x_test_tfidf)
print("Accuracy is of Multinomial Naive Bayes Classifier", accuracy_score(test.target, prediction) * 100)
1个回答

首先查看文档中的二进制分类示例。scikit-learn就这么简单:

from sklearn.metrics import roc_curve
from sklearn.metrics import RocCurveDisplay
y_score = clf.decision_function(X_test)

fpr, tpr, _ = roc_curve(y_test, y_score, pos_label=clf.classes_[1])
roc_display = RocCurveDisplay(fpr=fpr, tpr=tpr).plot()

多类分类的情况下,这不是那么简单。如果你有 3 个类,你可以在 3D 中做 ROC-AUC-curve。看看这里的资源。

你可以做的并且更简单的是制作 4 条一对多曲线。基本上每个类都有一个二进制设置。

在你的情况下:

import matplotlib.pyplot as plt
# all the same up until now
prediction = learn.predict(x_test_tfidf)
proba = learn.predict_proba(x_test_tfidf)
print("Accuracy is of Multinomial Naive Bayes Classifier", accuracy_score(test.target, prediction) * 100)

for i in range(len(categories)):
    y_test_bin = np.int32(test.target == i)
    y_score = proba[:,i]
    fpr, tpr, _ = roc_curve(y_test_bin, y_score, pos_label=0)
    plt.subplot(2,2,i)
    roc_display = RocCurveDisplay(fpr=fpr, tpr=tpr).plot()