我正在尝试构建一个 2 级堆叠模型,以解决 8 个类的多类分类问题。我的基础(1 级)模型和他们在测试集中的微 f1 分数是:
- 随机森林分类器 (0.51)
- XGBoost 分类器 (0.54)
- LightGBM 分类器 (0.54)
- 逻辑回归 (0.44)
- keras 神经网络 (0.57)
- keras 神经网络 (0.56)
作为 2 级模型,我使用未调整的 XGBClassifier。我使用 7 折交叉验证来生成 2 级模型的元特征。我用来为简单分类器生成元特征的代码是:
ntrain = X_train.shape[0]
ntest = X_test.shape[0]
seed = 0
nfolds = 7
kf = StratifiedKFold(nfolds, random_state=seed)
def get_meta(clf, Χ_train, y_train, Χ_test):
meta_train = np.zeros((ntrain,))
meta_test = np.zeros((ntest,))
for i, (train_index, test_index) in enumerate(kf.split(X_train, y_train)):
Χ_tr = X_train.iloc[train_index]
y_tr = y_train.iloc[train_index]
Χ_te = Χ_train.iloc[test_index]
clf.train(Χ_tr, y_tr)
meta_train[test_index] = clf.predict(Χ_te)
clf.fit(X_train,y_train)
meta_test = clf.predict(X_test)
return meta_train.reshape(-1, 1), meta_test.reshape(-1, 1)
对于 keras 神经网络是:
def get_meta_keras(clf, Χ_train, y_train, Χ_test, epochs = 200, batch_size = 70, class_weight=class_weights):
meta_train = np.zeros((ntrain,))
meta_test = np.zeros((ntest,))
encoder = LabelEncoder()
encoder.fit(y_train)
encoded_Y = encoder.transform(y_train)
# convert integers to dummy variables (i.e. one hot encoded)
dummy_y = np_utils.to_categorical(encoded_Y)
for i, (train_index, test_index) in enumerate(kf.split(X_train, y_train)):
Χ_tr = X_train.iloc[train_index]
y_tr = dummy_y[train_index]
Χ_te = Χ_train.iloc[test_index]
clf.fit(Χ_tr, y_tr, epochs = epochs, batch_size = batch_size, class_weight=class_weights)
meta_train[test_index] = clf.predict_classes(Χ_te)
clf.fit(X_train, dummy_y, epochs = epochs, batch_size = batch_size, class_weight=class_weights)
meta_test = clf.predict_classes(X_test)
return meta_train.reshape(-1, 1), meta_test.reshape(-1, 1)
我最终的 micro f1 分数是 0.54,低于我的基本模型分数。我的模型不相关(corr<0.55)。我尝试添加更简单的模型,如 knn、朴素贝叶斯等,但分数下降得更多。为什么我的堆叠方法没有提高分数?