不必要的特征会损害基于树的模型吗?

数据挖掘 特征选择 随机森林 决策树 预测重要性
2021-10-04 20:28:24

是否有必要从树特征中删除噪声特征(例如随机数列)?我认为不是。有时它可能会受益,但永远不会对模型造成任何伤害。因为在每个拆分模型都在检查哪个特征会减少杂质。有时,随机数可能就是其中之一。

1个回答

这不是对您问题的直接回答,而更像是一个实验。我在 Python 中创建了一个简单的脚本,在其中多次运行 Iris 数据集,其中包含常规列以及 4 个带有随机数的额外列。然后我存储两个模型之间的精度差异并绘制它的分布。如果您自己也尝试一下,您会发现有些情况下“干净”的数据集具有更好的准确性,即使大多数结果完全相同。

from sklearn.datasets import load_iris
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
import numpy as np
import warnings
import seaborn as sns
import random as random

warnings.filterwarnings("ignore")

iris = load_iris()

df = pd.DataFrame(data= np.c_[iris['data'], iris['target']], columns= iris['feature_names'] + ['target'])

df['rand1'] = np.random.randint(0, 2, df.shape[0])
df['rand2'] = np.random.randint(0, 2, df.shape[0])
df['rand3'] = np.random.randint(0, 2, df.shape[0])
df['rand4'] = np.random.randint(0, 2, df.shape[0])

all_inputs = df[iris['feature_names']].values
all_inputs_with_random = df[iris['feature_names']+['rand1', 'rand2', 'rand3','rand4']].values
all_classes = df['target'].values

dif = []

for i in range(100):
    a = random.randint(0,1000)

    (train_inputs, test_inputs, train_classes, test_classes) = train_test_split(all_inputs, all_classes, train_size=0.7, random_state = a)

    dtc1 = DecisionTreeClassifier()
    dtc1.fit(train_inputs, train_classes)

    a1 = dtc1.score(test_inputs, test_classes)

    (train_inputs, test_inputs, train_classes, test_classes) = train_test_split(all_inputs_with_random, all_classes, train_size=0.7, random_state = a)

    dtc2 = DecisionTreeClassifier()
    dtc2.fit(train_inputs, train_classes)

    a2 = dtc2.score(test_inputs, test_classes)

    dif.append(a1-a2)


sns.distplot(dif)

干净和随机包含的数据集之间的准确性差异