数据挖掘 - scikit-learn RandomForestClassifier 始终达到 100% 的测试准确率 - 吾爱随笔录

我一直在玩一个玩具问题来比较几个 scikit-learn 分类器的性能和行为。

简而言之，我有一个连续变量 X（其中包含两个大小为 N 的样本，每个样本均来自不同的正态分布）和一个相应的标签 y（0 或 1）。

X 的构建如下：

# Subpopulation 1
s1 = np.random.normal(mu1, sigma1, n1)
l1 = np.zeros(n1)

# Subpopulation 2
s2 = np.random.normal(mu2, sigma2, n2)
l2 = np.ones(n2)

# Merge the subpopulations
X = np.concatenate((s1, s2), axis=0).reshape(-1, 1)
y = np.concatenate((l1, l2))

n1, n2: 每个子群体中的数据点数； mu1, sigma1, mu2, sigma1: 从中抽取样本的每个总体的平均值和标准差。

然后我分成训练和测试集X：y

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.25)

然后我拟合了一系列模型，例如：

from sklearn import svm
clf = svm.SVC()

# Fit
clf.fit(X_train, y_train)

或者，或者（最后的表格中的完整列表）：

from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier()

# Fit
rfc.fit(X_train, y_train)

对于所有模型，然后我计算训练集和测试集的准确度。为此，我实现了以下功能：

def apply_model_and_calc_accuracies(model):
    # Calculate accuracy on training set
    y_train_hat = model.predict(X_train)
    a_train = 100 * sum(y_train == y_train_hat) / y_train.shape[0]
    # Calculate accuracy on test set
    y_test_hat = model.predict(X_test)
    a_test = 100 * sum(y_test == y_test_hat) / y_test.shape[0]
    # Return accuracies
    return a_train, a_test

我通过更改n1、n2、mu1、sigma1、mu2和sigma1检查训练集和测试集的准确性来比较算法。我用它们的默认参数初始化分类器。

长话短说，无论我设置什么参数，随机森林分类器在测试测试中的准确率总是 100%。

例如，如果我测试以下参数：

n1 = n2 = 250
mu1 = mu2 = 7.0
sigma1 = sigma2 = 3.0,

我将两个完全重叠的子群合并到 X 中（它们仍然具有与它们关联的正确标签 y）。我对这个实验的期望是各种分类器应该完全猜测，我希望测试准确率在 50% 左右。

实际上，这就是我得到的：

| 算法 | 训练准确率 % | 测试准确度 % |
|----------------------------|------------------|- ----------------|
| 支持向量机 | 56.3 | 42.4 |
| 逻辑回归 | 49.1 | 52.8 |
| 随机梯度下降 | 50.1 | 50.4 |
| 高斯朴素贝叶斯 | 50.1 | 52.8 |
| 决策树 | 100.0 | 51.2 |
| 随机森林 | 100.0 | *100.0* |
| 多层感知器 | 50.1 | 49.6 |

我不明白这怎么可能。随机森林分类器在训练期间永远不会看到测试集，并且仍然以 100% 的准确率进行分类。

感谢您的任何意见！

根据要求，我将我的代码粘贴到这里（只有两个最初测试的分类器和不太冗长的输出）。

import numpy as np
import sklearn
import matplotlib.pyplot as plt

# Seed
np.random.seed(42)

# Subpopulation 1
n1 = 250
mu1 = 7.0
sigma1 = 3.0
s1 = np.random.normal(mu1, sigma1, n1)
l1 = np.zeros(n1)

# Subpopulation 2
n2 = 250
mu2 = 7.0
sigma2 = 3.0
s2 = np.random.normal(mu2, sigma2, n2)
l2 = np.ones(n2)

# Display the data
plt.plot(s1, np.zeros(n1), 'r.')
plt.plot(s2, np.ones(n1), 'b.')

# Merge the subpopulations
X = np.concatenate((s1, s2), axis=0).reshape(-1, 1)
y = np.concatenate((l1, l2))

# Split in training and test sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.25)
print(f"Train set contains {X_train.shape[0]} elements; test set contains {X_test.shape[0]} elements.")

# Display the test data
X_test_0 = X_test[y_test == 0]
X_test_1 = X_test[y_test == 1]
plt.plot(X_test_0, np.zeros(X_test_0.shape[0]), 'r.')
plt.plot(X_test_1, np.ones(X_test_1.shape[0]), 'b.')

# Define a commodity function
def apply_model_and_calc_accuracies(model):
    # Calculate accuracy on training set
    y_train_hat = model.predict(X_train)
    a_train = 100 * sum(y_train == y_train_hat) / y_train.shape[0]
    # Calculate accuracy on test set
    y_test_hat = model.predict(X_test)
    a_test = 100 * sum(y_test == y_test_hat) / y_test.shape[0]
    # Return accuracies
    return a_train, a_test

# Classify

# Use Decision Tree
from sklearn import tree
dtc = tree.DecisionTreeClassifier()

# Fit
dtc.fit(X_train, y_train)

# Calculate accuracy on training and test set
a_train_dtc, a_test_dtc = apply_model_and_calc_accuracies(dtc)

# Report
print(f"Training accuracy = {a_train_dtc}%; test accuracy = {a_test_dtc}%")

# Use Random Forest
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier()

# Fit
rfc.fit(X, y)

# Calculate accuracy on training and test set
a_train_rfc, a_test_rfc = apply_model_and_calc_accuracies(rfc)

# Report
print(f"Training accuracy = {a_train_rfc}%; test accuracy = {a_test_rfc}%")