简短版本:
我正在尝试比较来自 kaggle 的某个数据集的不同分类器,并尝试在使用 PCA(形式 sklearn)之前和使用 PCA 之后在准确性和运行时间方面比较这些分类器。由于某种原因,使用 PCA 后分类器(XGBoost 和 AdaBoost 以 2 为例)的运行时间是使用 PCA 前分类器运行时间的 3 倍(大约)。我的问题是:为什么?我做错了什么还是有可能?
长版:
我对如何使用 PCA 的理解:
- 将标准化和干净的数据集拆分为训练集和测试集(使用 train_test_split)。
- PCA 拟合并转换 X_train 并将其保存到新的 df
- 使用拟合的 PCA,变换(不拟合)X_test
- 使用转换后的 X_train 和 X_test 运行分类器
PS:我检查了维度的数量是否在减少(从 21 到指定的数量:在 90% 的方差的情况下为 17)。数据集大小约为 130000 个条目,取自 kaggle。为实现此目的而编写的代码:
pca = PCA(n_components=0.9)
X_train_Reduced = pca.fit_transform(X_train)
X_test_Reduced = pca.transform(X_test)
使用 PCA 之前的分类器(XGBoost):
start_timeXGBoost = time.time()
warnings.filterwarnings('ignore')
modelXGBoost = XGBClassifier(learning_rate = 0.2, n_estimators = 200, verbosity = 0, use_label_encoder = False, n_jobs = -1)
modelXGBoost.fit(X_train, y_train)
predictionsXGBoost = modelXGBoost.predict(X_test)
accuracyXGBoost = metrics.accuracy_score(y_test, predictionsXGBoost)
print("Accuracy (XGBoost): ", accuracyXGBoost)
timeXGBoost = time.time() - start_timeXGBoost
print("Time taken to achive result: %s seconds" % (timeXGBoost))
代码输出:
准确度(XGBoost):0.9655066214967662
获得结果所需的时间:3.33561372756958 秒
PCA之后的分类器(XGBoost):
start_timeXGBoost = time.time()
warnings.filterwarnings('ignore')
modelXGBoost = XGBClassifier(learning_rate = 0.2, n_estimators = 200, verbosity = 0, use_label_encoder = False,
n_jobs = -1)
modelXGBoost.fit(X_train_Reduced, y_train)
predictionsXGBoost = modelXGBoost.predict(X_test_Reduced)
accuracyXGBoost = metrics.accuracy_score(y_test, predictionsXGBoost)
print("Accuracy (XGBoost): ", accuracyXGBoost)
timeXGBoost = time.time() - start_timeXGBoost
print("Time taken to achive result: %s seconds" % (timeXGBoost))
代码输出:
准确度(XGBoost):0.93032029565753
获得结果所需的时间:10.376214981079102 秒
PCA之前的另一个示例(AdaBoost)
分类器(AdaBoost):
start_timeAdaBoost = time.time()
modelDecTree = DecisionTreeClassifier(random_state=0, max_depth=2)
modelAdaBoost = AdaBoostClassifier(modelDecTree, n_estimators = 1000, random_state = 0, learning_rate = 1)
modelAdaBoost.fit(X_train, y_train)
predictionsAdaBoost = modelAdaBoost.predict(X_test)
accuracyAdaBoost = metrics.accuracy_score(y_test, predictionsAdaBoost)
print("Accuracy (AdaBoost): ", accuracyAdaBoost)
timeAdaBoost = time.time() - start_timeAdaBoost
print("Time taken to achive result: %s seconds" % (timeAdaBoost))
代码输出:
准确度(AdaBoost):0.9575762242069603
获得结果所需的时间:103.38761949539185 秒
PCA 之后的分类器(AdaBoost):
start_timeAdaBoost = time.time()
modelDecTree = DecisionTreeClassifier(random_state=0, max_depth=2)
modelAdaBoost = AdaBoostClassifier(modelDecTree, n_estimators = 1000, random_state = 0, learning_rate = 1)
modelAdaBoost.fit(X_train_Reduced, y_train)
predictionsAdaBoost = modelAdaBoost.predict(X_test_Reduced)
accuracyAdaBoost = metrics.accuracy_score(y_test, predictionsAdaBoost)
print("Accuracy (AdaBoost): ", accuracyAdaBoost)
timeAdaBoost = time.time() - start_timeAdaBoost
print("Time taken to achive result: %s seconds" % (timeAdaBoost))
代码输出:
准确度(AdaBoost):0.9141515244841392
获得结果所需的时间:295.6763050556183 秒
在理解我做错(或正确)的事情上,我将非常感谢任何帮助。
谢谢大家