数据挖掘 - 可视化多个分类器和特征子集大小的特征选择结果 - 吾爱随笔录

可视化多个分类器和特征子集大小的特征选择结果

数据挖掘 Python 可视化 matplotlib

2022-02-12 18:41:14

我正在使用信息增益特征选择技术为我的数据集获取不同的特征子集大小，如下所示：

fs1 = SelectKBest(score_func=mutual_info_classif, k=10)
fs1.fit(X_train, y_train)
X_train_fs1 = fs1.transform(X_train)
X_test_fs1 = fs1.transform(X_test)


fs2 = SelectKBest(score_func=mutual_info_classif, k=20)
fs2.fit(X_train, y_train)
X_train_fs2 = fs2.transform(X_train)
X_test_fs2 = fs2.transform(X_test)


fs3 = SelectKBest(score_func=mutual_info_classif, k=30)
fs3.fit(X_train, y_train)
X_train_fs3 = fs3.transform(X_train)
X_test_fs3 = fs3.transform(X_test)

然后，我使用特征选择特征的不同子集大小来测试 4 种不同算法（逻辑回归、SVM、AdaBoost 和决策树）的性能（子集 1 有 k=10，所以 10 个特征，子集 2 有 20 个特征，等等） . 为了评估模型的性能，我正在计算 Precision、Recall 和 AUC，如下所示：

def compareAlgorithms(X_train, y_train, score):
    # Compare Algorithms
    seed = 7

    # prepare models
    models = []
    models.append(('LR', LogisticRegression()))
    models.append(('SVM', SVC()))
    models.append(('Linear SVC', LinearSVC()))
    models.append(('ADABOOST', AdaBoostClassifier()))
    models.append(('DT', DecisionTreeClassifier()))


    # evaluate each model in turn
    results = []
    names = []
    scoring = score
    
    print(score, ":")
    
    for name, model in models:
        skf = StratifiedKFold(n_splits=5, shuffle=False, random_state=seed)
        #kfold = model_selection.KFold(n_splits=5, random_state=seed)
        cv_results = model_selection.cross_val_score(model, X_train, y_train, cv=skf, scoring=scoring)
        results.append(cv_results)
        names.append(name)
        msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
        print(msg)
        
    return results, names

由于我现在有很多结果，我正在尝试绘制结果以更好地可视化哪个算法在哪个子集上表现更好。我想创建类似于我在本文中找到的图：

我曾尝试使用 matplotlib 来做到这一点，但发现很难看到，因为我试图在不同的特征子集上绘制不同的分类器。我可以（有点）使用此函数绘制一个数据子集的算法性能线图：

def plot(results,names, score):
    import matplotlib.pyplot as plt
    # plot for algorithm comparison
    fig = plt.figure()
    fig.suptitle(score)
    ax = fig.add_subplot(111)
    plt.plot(results)
    ax.set_xticklabels(names)
    plt.show()

这导致了这个情节：

上图的问题（除了我正在修复的重叠模型名称之外）是它针对一个特征子集。

谁能帮我做一个像我所附论文中的情节一样的情节，也许可以指导我为刚开始学习数据可视化的人提供有用的资源？

非常感谢。

4个回答

你必须学习和理解 Matplotlib 及其调整。

此代码将完成基本工作。你可以扩展它。另外，请仔细阅读最后添加的参考资料。

import matplotlib.pyplot as plt

data = {'AUC':{'RF':[0.7,0.2,0.5,0.9,0.4], 'LR':[0.9,0.25,0.35,0.99,0.55], 'SVM':[0.3,0.5,0.8,0.6,0.7] } }
x = ['S1','S2','S3','S4','S5']

plt.plot(x, data['AUC']['RF'], marker='^', linestyle='solid') 
plt.plot(x,data['AUC']['LR'], marker='o', color='r',linestyle='dashed')  
plt.plot(x,data['AUC']['SVM'], marker='s', color='b',linestyle='dashdot')

$\hspace{3cm}$

参考资料-
Matplotlib 官方教程
 Python Data Science Handbook by Jake VanderPlas
Matplotlib Lines
Matplotlib Markers
Matplotlib Colors

尝试对您的函数进行这些修改，它可能看起来会更好。

import matplotlib.pyplot as plt
# optional but I like this style
# plt.style.use("seaborn-whitegrid")

def plot(results,names, score):
    # boxplot algorithm comparison
    fig = plt.figure()
    fig.suptitle(score)
    ax = fig.add_subplot(111)
    ax.plot(results, label = names, marker = "o", linestyle = "--")
    ax.set_ylabel(score)
    ax.legend(loc = "best")
    plt.show()

在这里，我不确定和的score类型names。但是如果两者都是字符串，它将按照我的建议工作

如果您稍微重组数据，则相对简单：

import matplotlib.pyplot as plt

data = {
    "LR": [0.6, 0.7, 0.8, 0.7],
    "SVM": [0.7, 0.6, 0.8, 0.5],
    "Linear SVC": [0.8, 0.5, 0.7, 0.6],
    "ADABOOST": [0.7, 0.8, 0.6, 0.7],
    "DT": [0.6, 0.8, 0.5, 0.7]
}
subsets = [3, 5, 10, 20]

for model in data:
    plt.plot(subsets, data[model])

在此处输入图像描述

您可以使用添加图例plt.legend，可以使用plt.xlabel/设置轴标题plt.ylabel。

感谢您提供所有建议的答案，他们有所帮助。我最终做的是以下内容：

首先，我更改了在不同模型之间进行比较的函数，如下所示：

def compareAlgorithmsFeatureSelection(X_train, y_train):
  
    # prepare configuration for cross validation test harness
    seed = 7

    # prepare models
    models = []
    models.append(('LR', LogisticRegression()))
    models.append(('SVM', SVC()))
    models.append(('Linear SVC', LinearSVC()))
    models.append(('ADABOOST', AdaBoostClassifier()))
    models.append(('DT', DecisionTreeClassifier()))

    
    names = []
    precision_results=[]
    recall_results=[]
    auc_results=[]
    
    NumOfKFeatures=[10,20,30,40,50]
    
    for name, model in models:
        names.append(name)
        precision_model_results = []
        recall_model_results = []
        auc_model_results = []
    
        for x in NumOfKFeatures:
            # get X_train and X_test after applying FS
            fs = SelectKBest(score_func=mutual_info_classif, k=x)
            fs.fit(X_train, y_train)
            X_train_fs = fs.transform(X_train)
            X_test_fs = fs.transform(X_test)

            # make splits for cross-validation
            skf = StratifiedKFold(n_splits=5, shuffle=False, random_state=seed)

            # calculate scores
            precision_cv_results = model_selection.cross_val_score(model, X_train_fs,
                                                                       y_train, cv=skf, scoring='precision')
            recall_cv_results = model_selection.cross_val_score(model, X_train_fs,
                                                                       y_train, cv=skf, scoring='recall')
            auc_cv_results = model_selection.cross_val_score(model, X_train_fs,
                                                                       y_train, cv=skf, scoring='roc_auc')


            precision_model_results.append(precision_cv_results.mean()) 
            recall_model_results.append(recall_cv_results.mean())
            auc_model_results.append(auc_cv_results.mean())

        
        # append scores to final results list only after completed for an algorithm and all the feature subsets    
        precision_results.append(precision_model_results)
        print(precision_results)
        recall_results.append(recall_model_results)
        auc_results.append(auc_model_results)
        
        
    return precision_results, recall_results, auc_results, names

然后，我使用@Oxbowerce @10xAI 和@Julio Jesus 提供的答案中的提示，执行以下操作来绘制每个特征子集的得分结果：


import matplotlib.pyplot as plt
plt.style.use("seaborn-whitegrid")

subsets=[10,20,30,40,50]

fig = plt.figure()
fig.suptitle("Precision")
ax = fig.add_subplot(111)

for x in range(len(subsets)):
    ax.plot(subsets, precision_results[x], label = names[x], marker = "o", linestyle = "--")
    
ax.set_ylabel("Precision")
ax.set_xlabel("Feature Selection Subsets")
ax.legend(loc = "best")
plt.show()

要绘制召回和 auc，我只需将precision_results 替换为相关结果列表。

其它你可能感兴趣的问题

上一篇如何使用回调在每个完成的时期保存我的学习率？下一篇GPT 块和 BERT 块有什么区别