我一直在玩一个玩具问题来比较几个 scikit-learn 分类器的性能和行为。
简而言之,我有一个连续变量 X(其中包含两个大小为 N 的样本,每个样本均来自不同的正态分布)和一个相应的标签 y(0 或 1)。
X 的构建如下:
# Subpopulation 1
s1 = np.random.normal(mu1, sigma1, n1)
l1 = np.zeros(n1)
# Subpopulation 2
s2 = np.random.normal(mu2, sigma2, n2)
l2 = np.ones(n2)
# Merge the subpopulations
X = np.concatenate((s1, s2), axis=0).reshape(-1, 1)
y = np.concatenate((l1, l2))
n1, n2: 每个子群体中的数据点数;
mu1, sigma1, mu2, sigma1: 从中抽取样本的每个总体的平均值和标准差。
然后我分成训练和测试集X:y
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.25)
然后我拟合了一系列模型,例如:
from sklearn import svm
clf = svm.SVC()
# Fit
clf.fit(X_train, y_train)
或者,或者(最后的表格中的完整列表):
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier()
# Fit
rfc.fit(X_train, y_train)
对于所有模型,然后我计算训练集和测试集的准确度。为此,我实现了以下功能:
def apply_model_and_calc_accuracies(model):
# Calculate accuracy on training set
y_train_hat = model.predict(X_train)
a_train = 100 * sum(y_train == y_train_hat) / y_train.shape[0]
# Calculate accuracy on test set
y_test_hat = model.predict(X_test)
a_test = 100 * sum(y_test == y_test_hat) / y_test.shape[0]
# Return accuracies
return a_train, a_test
我通过更改n1、n2、mu1、sigma1、mu2和sigma1检查训练集和测试集的准确性来比较算法。我用它们的默认参数初始化分类器。
长话短说,无论我设置什么参数,随机森林分类器在测试测试中的准确率总是 100%。
例如,如果我测试以下参数:
n1 = n2 = 250
mu1 = mu2 = 7.0
sigma1 = sigma2 = 3.0,
我将两个完全重叠的子群合并到 X 中(它们仍然具有与它们关联的正确标签 y)。我对这个实验的期望是各种分类器应该完全猜测,我希望测试准确率在 50% 左右。
实际上,这就是我得到的:
| 算法 | 训练准确率 % | 测试准确度 % | |----------------------------|------------------|- ----------------| | 支持向量机 | 56.3 | 42.4 | | 逻辑回归 | 49.1 | 52.8 | | 随机梯度下降 | 50.1 | 50.4 | | 高斯朴素贝叶斯 | 50.1 | 52.8 | | 决策树 | 100.0 | 51.2 | | 随机森林 | 100.0 | *100.0* | | 多层感知器 | 50.1 | 49.6 |
我不明白这怎么可能。随机森林分类器在训练期间永远不会看到测试集,并且仍然以 100% 的准确率进行分类。
感谢您的任何意见!
根据要求,我将我的代码粘贴到这里(只有两个最初测试的分类器和不太冗长的输出)。
import numpy as np
import sklearn
import matplotlib.pyplot as plt
# Seed
np.random.seed(42)
# Subpopulation 1
n1 = 250
mu1 = 7.0
sigma1 = 3.0
s1 = np.random.normal(mu1, sigma1, n1)
l1 = np.zeros(n1)
# Subpopulation 2
n2 = 250
mu2 = 7.0
sigma2 = 3.0
s2 = np.random.normal(mu2, sigma2, n2)
l2 = np.ones(n2)
# Display the data
plt.plot(s1, np.zeros(n1), 'r.')
plt.plot(s2, np.ones(n1), 'b.')
# Merge the subpopulations
X = np.concatenate((s1, s2), axis=0).reshape(-1, 1)
y = np.concatenate((l1, l2))
# Split in training and test sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.25)
print(f"Train set contains {X_train.shape[0]} elements; test set contains {X_test.shape[0]} elements.")
# Display the test data
X_test_0 = X_test[y_test == 0]
X_test_1 = X_test[y_test == 1]
plt.plot(X_test_0, np.zeros(X_test_0.shape[0]), 'r.')
plt.plot(X_test_1, np.ones(X_test_1.shape[0]), 'b.')
# Define a commodity function
def apply_model_and_calc_accuracies(model):
# Calculate accuracy on training set
y_train_hat = model.predict(X_train)
a_train = 100 * sum(y_train == y_train_hat) / y_train.shape[0]
# Calculate accuracy on test set
y_test_hat = model.predict(X_test)
a_test = 100 * sum(y_test == y_test_hat) / y_test.shape[0]
# Return accuracies
return a_train, a_test
# Classify
# Use Decision Tree
from sklearn import tree
dtc = tree.DecisionTreeClassifier()
# Fit
dtc.fit(X_train, y_train)
# Calculate accuracy on training and test set
a_train_dtc, a_test_dtc = apply_model_and_calc_accuracies(dtc)
# Report
print(f"Training accuracy = {a_train_dtc}%; test accuracy = {a_test_dtc}%")
# Use Random Forest
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier()
# Fit
rfc.fit(X, y)
# Calculate accuracy on training and test set
a_train_rfc, a_test_rfc = apply_model_and_calc_accuracies(rfc)
# Report
print(f"Training accuracy = {a_train_rfc}%; test accuracy = {a_test_rfc}%")