数据挖掘 - 在 python 中使用 cross_val_score 时出现不一致 - 吾爱随笔录

在 python 中使用 cross_val_score 时出现不一致

数据挖掘 Python scikit-学习数据集支持向量机

2022-03-09 15:46:34

import pandas as pd
from sklearn.datasets import load_iris
from sklearn.svm import SVC

#import data
iris = load_iris()
X = pd.DataFrame(iris.data, columns=iris.feature_names) 
y = iris.target

#split data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y,
 test_size = 0.2, shuffle=False)

#modeling using SVM
model = SVC(kernel='linear', C = 1)
model.fit(X_train, y_train)
model.score(X_test, y_test)

# here I get score: 0.8666666

# Now I use cross_validation with cv = 5,

from sklearn.model_selection import cross_val_score
model = SVC(kernel='linear', C = 1)
scores = cross_val_score(model, X, y, cv = 5, scoring = "accuracy")
scores

#here I got array([0.96666667, 1.        , 0.96666667, 0.96666667, 1.        ])

上述数组中没有一个数字等于 0.866666，我想知道为什么会发生不一致（因为“cv = 5”与条件“test_size = 0.2”匹配）。

2个回答

一个原因可能是您的train_test_split()设置。使用shuffle=False意味着您只需使用数据的前 80% 示例进行无随机性训练。看看你的标签：

>>> y
Out: 
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

如您所见，虹膜数据集按其类别从 0 到 2 排序。这意味着您train_test_split()将不平衡，因为类别 2 的代表性不足：

>>> y_train
Out: 
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

更糟糕的是，您的测试数据只有一类：

>>> y_test
Out: 
array([2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2])

另一方面，交叉验证将使用分层拆分保留所有折叠的类分布，如sklearn 用户指南中所述：

对于整数/无输入，如果估计器是分类器并且 y 是二元或多类，则使用 StratifiedKFold。在所有其他情况下，使用 KFold。

我希望这会带来更好的结果。现在，train_test_split()平衡类（分层标签）的这种优势也给了我一个不同的结果：

>>> X_train, X_test, y_train, y_test = train_test_split(X, y,
 test_size = 0.2, stratify=y)

runfile(...)

>>> model.score(X_test, y_test)
Out: 1.0

之所以会发生不一致，是因为cross_val_score可能会计算与不同的训练/测试拆分train_test_split，尤其是因为您使用shuffle=False.

现在，他们是否应该是一个不同的问题。为了确保拆分相同，首先，我会使用随机状态来控制随机性。其次，cross_val_score具有cv允许您自己定义拆分的参数

其它你可能感兴趣的问题

上一篇量化数据集的不平衡性下一篇过拟合不是比总分更重要吗（F1：80-60-40% 或 43-40-40）？