是否有必要从树特征中删除噪声特征(例如随机数列)?我认为不是。有时它可能会受益,但永远不会对模型造成任何伤害。因为在每个拆分模型都在检查哪个特征会减少杂质。有时,随机数可能就是其中之一。
不必要的特征会损害基于树的模型吗?
数据挖掘
特征选择
随机森林
决策树
预测重要性
2021-10-04 20:28:24
1个回答
这不是对您问题的直接回答,而更像是一个实验。我在 Python 中创建了一个简单的脚本,在其中多次运行 Iris 数据集,其中包含常规列以及 4 个带有随机数的额外列。然后我存储两个模型之间的精度差异并绘制它的分布。如果您自己也尝试一下,您会发现有些情况下“干净”的数据集具有更好的准确性,即使大多数结果完全相同。
from sklearn.datasets import load_iris
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
import numpy as np
import warnings
import seaborn as sns
import random as random
warnings.filterwarnings("ignore")
iris = load_iris()
df = pd.DataFrame(data= np.c_[iris['data'], iris['target']], columns= iris['feature_names'] + ['target'])
df['rand1'] = np.random.randint(0, 2, df.shape[0])
df['rand2'] = np.random.randint(0, 2, df.shape[0])
df['rand3'] = np.random.randint(0, 2, df.shape[0])
df['rand4'] = np.random.randint(0, 2, df.shape[0])
all_inputs = df[iris['feature_names']].values
all_inputs_with_random = df[iris['feature_names']+['rand1', 'rand2', 'rand3','rand4']].values
all_classes = df['target'].values
dif = []
for i in range(100):
a = random.randint(0,1000)
(train_inputs, test_inputs, train_classes, test_classes) = train_test_split(all_inputs, all_classes, train_size=0.7, random_state = a)
dtc1 = DecisionTreeClassifier()
dtc1.fit(train_inputs, train_classes)
a1 = dtc1.score(test_inputs, test_classes)
(train_inputs, test_inputs, train_classes, test_classes) = train_test_split(all_inputs_with_random, all_classes, train_size=0.7, random_state = a)
dtc2 = DecisionTreeClassifier()
dtc2.fit(train_inputs, train_classes)
a2 = dtc2.score(test_inputs, test_classes)
dif.append(a1-a2)
sns.distplot(dif)
其它你可能感兴趣的问题