如何解释混淆矩阵

机器算法验证 预测模型 预言 混淆矩阵
2022-02-14 21:25:11

我正在使用混淆矩阵来检查我的分类器的性能。

我正在使用 Scikit-Learn,我有点困惑。我如何解释结果

from sklearn.metrics import confusion_matrix
>>> y_true = [2, 0, 2, 2, 0, 1]
>>> y_pred = [0, 0, 2, 2, 0, 2]
>>> confusion_matrix(y_true, y_pred)
array([[2, 0, 0],
       [0, 0, 1],
       [1, 0, 2]])

我怎样才能决定这个预测值是好还是不好。

3个回答

混淆矩阵是一种将错误分类数量制成表格的方法,即根据真实类别最终进入错误分类箱的预测类别数量。

虽然 sklearn.metrics.confusion_matrix 提供了一个数字矩阵,但我发现使用以下内容生成“报告”更有用:

import pandas as pd
y_true = pd.Series([2, 0, 2, 2, 0, 1, 1, 2, 2, 0, 1, 2])
y_pred = pd.Series([0, 0, 2, 1, 0, 2, 1, 0, 2, 0, 2, 2])

pd.crosstab(y_true, y_pred, rownames=['True'], colnames=['Predicted'], margins=True)

这导致:

Predicted  0  1  2  All
True                   
0          3  0  0    3
1          0  1  2    3
2          2  1  3    6
All        5  2  5   12

这让我们看到:

  1. 对角线元素显示每个类别的正确分类数:类别 0、1 和 2 为 3、1 和 3。
  2. 非对角元素提供了错误分类:例如,第 2 类中有 2 个被错误分类为 0,第 0 类中没有一个被错误分类为 2,等等。
  3. y_true中每个类别的分类总数y_pred,来自“全部”小计

此方法也适用于文本标签,并且对于数据集中的大量样本可以扩展以提供百分比报告。

import numpy as np
import pandas as pd

# create some data
lookup = {0: 'biscuit', 1:'candy', 2:'chocolate', 3:'praline', 4:'cake', 5:'shortbread'}
y_true = pd.Series([lookup[_] for _ in np.random.random_integers(0, 5, size=100)])
y_pred = pd.Series([lookup[_] for _ in np.random.random_integers(0, 5, size=100)])

pd.crosstab(y_true, y_pred, rownames=['True'], colnames=['Predicted']).apply(lambda r: 100.0 * r/r.sum())

那么输出是:

Predicted     biscuit  cake      candy  chocolate    praline  shortbread
True                                                                    
biscuit     23.529412    10  23.076923  13.333333  15.384615    9.090909
cake        17.647059    20   0.000000  26.666667  15.384615   18.181818
candy       11.764706    20  23.076923  13.333333  23.076923   31.818182
chocolate   11.764706     5  15.384615   6.666667  15.384615   13.636364
praline     17.647059    10  30.769231  20.000000   0.000000   13.636364
shortbread  17.647059    35   7.692308  20.000000  30.769231   13.636364

现在的数字代表分类结果的百分比(而不是病例数)。

尽管请注意,sklearn.metrics.confusion_matrix可以使用以下命令直接可视化输出:

import matplotlib.pyplot as plt
conf = sklearn.metrics.confusion_matrix(y_true, y_pred)
plt.imshow(conf, cmap='binary', interpolation='None')
plt.show()

在 y 轴上混淆矩阵具有实际值,在 x 轴上具有预测器给出的值。因此,对角线上的计数是正确预测的数量。对角线的元素是不正确的预测。

在你的情况下:

>>> confusion_matrix(y_true, y_pred)
    array([[2, 0, 0],  # two zeros were predicted as zeros
           [0, 0, 1],  # one 1 was predicted as 2
           [1, 0, 2]]) # two 2s were predicted as 2, and one 2 was 0

我想以图形方式指定理解这一点的必要性。这是一个简单的矩阵,在得出结论之前需要很好地理解。因此,这是上述答案的简化可解释版本。

        0  1  2   <- Predicted
     0 [2, 0, 0]  
TRUE 1 [0, 0, 1]  
     2 [1, 0, 2] 

# At 0,0: True value was 0, Predicted value was 0, - 2 times predicted
# At 1,1: True value was 1, Predicted value was 1, - 0 times predicted
# At 2,2: True value was 2, Predicted value was 2, - 2 times predicted
# At 1,2: True value was 1, Predicted value was 2, - 1 time predicted
# At 2,0: True value was 2, Predicted value was 0, - 1 time predicted...
...Like that

而且,正如我的朋友@fu DL 所问,代码如下:

from sklearn.metrics import confusion_matrix

Y_true = [0,0,0,1,1,1,2,2,0,1,2]
Y_pred = [0,0,1,1,1,2,2,2,0,0,0]

confusion = confusion_matrix(Y_true, Y_pred)

# PUT YOUR DESIRED LABELS HERE... 
row_label = "True"
col_label = "Predicted"

# For printing column label right in the middle
col_space = len(row_label)
index_middle = int(int(len(set(Y_true)))/2)

# Prints first row
print(" "*(col_space + 4), "  ".join([str(i) for i in set(Y_true)]), " <-  {}".format(col_label))

# Prints rest of the table
for index in range(len(set(Y_true))):
    if index == index_middle:
        print(row_label, " ", index, confusion[index])
    else:
        print(" "*(col_space+2), index, confusion[index])