考虑以下简单的分类问题(Python,scikit-learn)
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
def get_product_data(size):
'''
Given a size(int), sets `log10(size)` features to be uniform
random variables `Xi` in [-1,1] and an target `y` given by 1 if
their product `P` is larger than 0.0 and zero otherwise.
Returns a pandas DataFrame.
'''
n_features = int(max(2, np.log10(size)))
features = dict(('x%d' % i, 2*np.random.rand(size) - 1) for i in range(n_features))
y = np.prod(list(features.values()), axis=0)
y = y > 0.0
features.update({'y': y.astype(int)})
return pd.DataFrame(features)
# create random data
df = get_product_data(1000)
X = np.array(df.drop(df.columns[-1], axis=1))
y = df['y']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33,
random_state=1)
def predict(clf):
'''
Splits train/test with a fixed seed, fits, and returns the accuracy
'''
clf.fit(X_train, y_train)
return accuracy_score(y_test, clf.predict(X_test))
和以下分类器:
foo10 = RandomForestClassifier(10, max_features=None, bootstrap=False)
foo100 = RandomForestClassifier(100, max_features=None, bootstrap=False)
foo200 = RandomForestClassifier(200, max_features=None, bootstrap=False)
为什么
predict(foo10) # 0.906060606061
predict(foo100) # 0.933333333333
predict(foo200) # 0.915151515152
给不同的分数?
具体来说,与
max_features=None
,为每棵树选择所有特征bootstrap=False
,没有样本的引导max_depth=None
(默认),所有树都达到最大深度
我希望每棵树都完全相同。因此,无论森林有多少棵树,预测应该是相等的。在这个例子中,树的可变性来自哪里?
我必须以具有相同分数RandomForestClassifier.__init__
的方式引入哪些进一步的参数?foo*