数据挖掘 - 如何在 Python 中从随机森林中打印混淆矩阵 - 吾爱随笔录

如何在 Python 中从随机森林中打印混淆矩阵

数据挖掘随机森林

2021-09-17 10:51:57

我应用这种随机森林算法来预测特定的犯罪类型。我从这篇文章这里拿的例子。

import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
import random
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
import matplotlib
import matplotlib.pyplot as plt
import sklearn
from scipy import stats
from sklearn.cluster import KMeans
import seaborn as sns
# Using Skicit-learn to split data into training and testing sets
from sklearn.model_selection import train_test_split
# Import the model we are using
from sklearn.ensemble import RandomForestRegressor
import os
os.environ["PATH"] += os.pathsep + 'C:/Program Files (x86)/Graphviz2.38/bin/'


features = pd.read_csv('prueba2.csv',sep=';')
print (features.head(5))

# Labels are the values we want to predict
labels = np.array(features['target'])
# Remove the labels from the features
# axis 1 refers to the columns
features= features.drop('target', axis = 1)
# Saving feature names for later use
feature_list = list(features.columns)
# Convert to numpy array
features = np.array(features)


# Split the data into training and testing sets
train_features, test_features, train_labels, test_labels = train_test_split(features, labels, test_size = 0.25, random_state = 42)



baseline_preds = test_features[:, feature_list.index('Violent crime')]
# Baseline errors, and display average baseline error
baseline_errors = abs(baseline_preds - test_labels)
print('Error: ', round(np.mean(baseline_errors), 2))

# Instantiate model with 1000 decision trees
rf = RandomForestRegressor(n_estimators = 1000, random_state = 42)
# Train the model on training data
rf.fit(train_features, train_labels);

# Use the forest's predict method on the test data
predictions = rf.predict(test_features)
# Calculate the absolute errors
errors = abs(predictions - test_labels)
# Print out the mean absolute error (mae)
print('Promedio del error absoluto:', round(np.mean(errors), 2), ' Porcentaje.')


# Calculate mean absolute percentage error (MAPE)
mape = 100 * (errors / test_labels)
# Calculate and display accuracy
accuracy = 100 - np.mean(mape)
print('Precision:', round(accuracy, 2), '%.')

# Get numerical feature importances
importances = list(rf.feature_importances_)
# List of tuples with variable and importance
feature_importances = [(feature, round(importance, 2)) for feature, importance in zip(feature_list, importances)]
# Sort the feature importances by most important first
feature_importances = sorted(feature_importances, key = lambda x: x[1], reverse = True)
# Print out the feature and importances 
[print('Variable: {:20} Importance: {}'.format(*pair)) for pair in feature_importances];

# Import tools needed for visualization
from sklearn.tree import export_graphviz
import pydot
# Pull out one tree from the forest
tree = rf.estimators_[5]
# Import tools needed for visualization
from sklearn.tree import export_graphviz
import pydot
# Pull out one tree from the forest
tree = rf.estimators_[5]
# Export the image to a dot file
export_graphviz(tree, out_file = 'tree.dot', feature_names = feature_list, rounded = True, precision = 1)
# Use dot file to create a graph
(graph, ) = pydot.graph_from_dot_file('tree.dot')
# Write graph to a png file
graph.write_png('tree.png')

所以我的问题是：如何添加混淆矩阵来测量准确性？我从这里尝试了这个例子，但它不起作用。出现以下错误：

有什么建议吗？

3个回答

从你现在的代码和任务来看，混淆矩阵是没有意义的。这是因为它显示了模型对样本进行分类的程度，即说明它们属于哪个类别。您的问题（正如您链接中的作者所说）是一个回归问题，因为您正在预测一个连续变量（温度）。在这里查看更多信息。

一般来说，如果你确实有分类任务，打印混淆矩阵就像使用sklearn.metrics.confusion_matrix函数一样简单。

作为输入，它需要您的预测和正确的值：

from sklearn.metrics import confusion_matrix

conf_mat = confusion_matrix(labels, predictions)
print(conf_mat)

您可以考虑更改您的任务以使其成为分类问题，例如通过将温度分组到给定范围的类别中。

您可以说将目标温度转换为new_target_class，然后更改您的代码以使用[RandomForestClassifier][3]。

我对该文章中链接的相同数据进行了快速而肮脏的转换，请在此处查看。我基本上使用目标变量的最小值和最大值来设置一个范围，然后针对 10 个不同的温度类别并在表中创建一个新列，将该类别分配给每一行。顶部看起来是这样的（点击图片放大）：

如果您可以使用进行这些预测RandomForestClassifier，则可以在结果上运行上面的混淆矩阵代码。

添加到@n1k31t4 的答案中，您需要以 heatmaps 的形式直观地检查混淆矩阵，尤其是在处理多类分类任务时：

# Visualise classical Confusion M0atrix
from sklearn.metrics import confusion_matrix
CM = confusion_matrix(labels, predictions)
print(conf_mat)

# Visualize it as a heatmap
import seaborn
seaborn.heatmap(CM)
plt.show()

热图对于非常复杂的混淆矩阵非常有用。它们让您比简单的数字表更好地可视化数据。

下面是一个在这里找到的例子：

您也可以使用 sklearn 内置的 plot_confusion_matrix 进行绘图。

from sklearn.metrics import plot_confusion_matrix
from sklearn.metrics import confusion_matrix, plot_confusion_matrix

conf_mat = confusion_matrix(labels, predictions, normalize='true')
print(conf_mat)


disp = plot_confusion_matrix(classifier, X_test, y_test,
display_labels=class_names, cmap=plt.cm.Blues, normalize='true')

其它你可能感兴趣的问题

上一篇二元交叉熵如何工作？下一篇使用机器学习预测温度