数据挖掘 - 确定 PCA 中的组件数量 - 吾爱随笔录

我已经多次运行我的模型了。每次我根据我在 PCA 组件编号范围中输入的数字得到不同的结果（我在代码中使用原始数字而不是 range 函数）。

如果我将范围从 1 到组件的 max_number（例如 100）我得到一定的精度，比如说 60%，并且选择的组件编号是 80。所以 80 个组件时为 60%。

现在，如果我在 1 到 79 的范围内重复运行，我得到 62% 的准确率，组件数选择为 45

如果我再次运行整个过程，同时选择从 1 到 100 的范围，以 10 分隔（而不是 5 或 1），例如范围（1、100、10），我也会得到不同的精度。

精度是变化的而不是线性的，这意味着如果组件数量增加，精度不一定会提高。

所以我该怎么做？

我是否应该使用组件范围 1 到最大值以 1 分隔（例如范围 (1,max)）运行分析，然后每次我得到一个选定的组件编号时，我应该调查它下面的系列？有人可以帮忙吗？

这是我的代码

# Search for the best combination of PCA truncation
# and class reg (LogReg).

from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression(random_state=42, class_weight= 'balanced', max_iter=5000)
pipe_logreg = Pipeline(steps=[('pca', pca), ('logreg', logreg)])


# Parameters of pipelines can be set using ‘__’ separated parameter names:
parameters_logreg = [{'pca__n_components': [1, 6, 11, 16, 21, 26, 31, 36, 41, 46, 51, 56, 61, 66, 71, 76, 81, 86, 91, 96, 100]}, 
                     {'logreg__C':[0.5, 1, 10, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190, 200, 250, 300, 400, 500], 
                    'logreg__penalty':['l2'],
                    'logreg__warm_start':['False', 'True'],
                    'logreg__solver': ['newton-cg', 'lbfgs', 'sag'],
                    'logreg__multi_class': ['ovr', 'multinomial', 'auto']},
                     {'logreg__C':[0.5, 1, 10, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190, 200, 250, 300, 400, 500], 
                    'logreg__penalty':['l1'],
                    'logreg__warm_start':['False', 'True'],
                    'logreg__solver': ['liblinear', 'saga'],
                    'logreg__multi_class': ['ovr', 'auto'],
                    }]

clflogreg = GridSearchCV(pipe_logreg, param_grid =parameters_logreg, iid=False, cv=10,
                      return_train_score=False)
clflogreg.fit(X_balanced, y_balanced)


# Plot the PCA spectrum (logreg)
pca.fit(X_balanced)

fig1, (ax0, ax1) = plt.subplots(nrows=2, sharex=True, figsize=(6, 6)) #(I added 1 to fig)
ax0.plot(pca.explained_variance_ratio_, linewidth=2)
ax0.set_ylabel('PCA explained variance')

ax0.axvline(clflogreg.best_estimator_.named_steps['pca'].n_components,
            linestyle=':', label='n_components chosen')
ax0.legend(prop=dict(size=12))

# For each number of components, find the best classifier results
results_logreg = pd.DataFrame(clflogreg.cv_results_) #(Added _logreg to all variable def)
components_col_logreg = 'param_pca__n_components'
best_clfs_logreg = results_logreg.groupby(components_col_logreg).apply(
    lambda g: g.nlargest(1, 'mean_test_score'))

best_clfs_logreg.plot(x=components_col_logreg, y='mean_test_score', yerr='std_test_score',
               legend=False, ax=ax1)
ax1.set_ylabel('Classification accuracy (val)')
ax1.set_xlabel('n_components')

plt.tight_layout()
plt.show()