我想为具有四个类的数据框创建一个分类器。每行只能有一个类。我有两个预测模型:一个神经网络和一个树分类器。但是他们在培训期间以及因此在测试期间将每个人都放在一个班级中。
神经网络仅分类为一类
问题是我的神经网络的分类是:
我在这里调用模型:
from tensorflow.keras.callbacks import ModelCheckpoint
from tensorflow.keras.models import load_model
model = create_model(x_train.shape[1], y_train.shape[1])
epochs = 30
batch_sz = 64
print("Beginning model training with batch size {} and {} epochs".format(batch_sz, epochs))
checkpoint = ModelCheckpoint("lc_model.h5", monitor='val_acc', verbose=0, save_best_only=True, mode='auto', period=1)
# train the model
history = model.fit(x_train.to_numpy(),
y_train.to_numpy(),
validation_split=0.2,
epochs=epochs,
batch_size=batch_sz,
verbose=2,
# class_weight = weights, # class_weight tells the model to "pay more attention" to samples from an under-represented grade class.
# callbacks=[checkpoint]
)
# revert to the best model encountered during training
model = load_model("lc_model.h5", compile=False)
这是模型的架构:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.constraints import MaxNorm
# from tensorflow.python.compiler.tensorrt import trt_convert as trt
def create_model(input_dim, output_dim):
print(output_dim)
# create model
model = Sequential()
print("sequential")
# input layer
model.add(Dense(100, input_dim=input_dim, activation='relu', kernel_constraint=MaxNorm(3)))
model.add(Dropout(0.2))
# hidden layer
model.add(Dense(60, activation='relu', kernel_constraint=MaxNorm(3)))
model.add(Dropout(0.2))
# output layer
model.add(Dense(output_dim, activation='softmax'))
# Compile model
model.compile(loss=focal_loss(alpha=1), loss_weights=None, optimizer='nadam', metrics=['accuracy'])
return model
这里是一部分x_train
。
id reg 0.0_x 1.0_x 17.0 21.0 30.0 40.0 50.0 60.0 70.0 Célibataire Divorcé(e) Marié et j'ai des enfants à charge Marié et je n'ai pas encore d'enfants à charge Refus de répondre Veuf (ve) 1er cycle universitaire / Licence 2e cycle universtaire / Master 3e cycle universtaire / Doctorat BTS Je n'ai jamais été à l'école Niveau collège Niveau lycée Niveau primaire Autre. Merci de préciser :@NS$ Infirme J'ai une société Je ne travaille pas Je suis commerçant Je suis encore étudiant Je suis independent Je suis journalier, je travaille de temps à autre Je suis retraité Je travaille dans le secteur privé Je travaille dans le secteur public 0.0_y 250.0 3750.0 7500.0 8750.0 11250.0 11500.0 18750.0 25000.0 35000.0 45000.0 50000.0 0.0_x.1 1.0_y 0.0_y.1 1.0_x.1 Je ne suis pas d'accord Je suis d'accord False_x True_y False_y True_x False_x.1 True_y.1 False_y.1 True_x.1 False_x.2 True_y.2 False_y.2 True_x.2 False_x.3 True_y.3 False_y.3 True_x.3 False_x.4 True_y.4 False_y.4 True_x.4 False_x.5 True_y.5 False_y.5 True_x.5 0.0_x.2 1.0_y.1 0.0_y.2 1.0_x.2 0.0_x.3 1.0_y.2 0.0_y.3 1.0
0 NaN 0 1 0 0 1 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 1 0 1 0 1 1 0 1 0 0 1 1 0 1 0 1 0 1 0 1 0 1 0 1 0 0 1 1 0 0 1 0 1
1 NaN 0 1 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 1 0 0 1 1 0 1 0 1 0 0 1 1 0 1 0 1 0 1 0 0 1 1 0 1 0 1 0 1 0 1 0 0 0
...
这是一部分y_train
:
Voting intention in 2021_Cast a blank vote Voting intention in 2021_I know who I will be voting for in 2021 Voting intention in 2021_I won't vote Voting intention in 2021_I'm going to vote in 2021 but don't know for who
0 0 1 0 0
1 0 0 0 1
...
所以当我尝试测试这个模型时,它并不比随机更好:
sequential
Beginning model training with batch size 64 and 30 epochs
WARNING:tensorflow:`period` argument is deprecated. Please use `save_freq` to specify the frequency in number of samples seen.
Train on 768 samples, validate on 192 samples
Epoch 1/30
768/768 - 1s - loss: -inf - acc: 0.2448 - val_loss: -inf - val_acc: 0.2708
Epoch 2/30
768/768 - 0s - loss: -inf - acc: 0.2409 - val_loss: -inf - val_acc: 0.2708
...
Epoch 30/30
768/768 - 0s - loss: -inf - acc: 0.2409 - val_loss: -inf - val_acc: 0.2708
事实上,准确率略低于 25%,这是我从随机选择类时所期望的结果。它似乎永远不会学到任何东西,因为损失总是如此-inf
。
所以我在测试集上计算模型的准确率,结果更糟。确实使用以下代码:
import numpy as np
from sklearn.metrics import f1_score
y_pred = model.predict(x_test.to_numpy())
# Revert one-hot encoding to classes
y_pred_classes = pd.DataFrame((y_pred.argmax(1)[:,None] == np.arange(y_pred.shape[1])),
columns=y_test.columns,
index=y_test.index)
y_test_vals = y_test.idxmax(1)
y_pred_vals = y_pred_classes.idxmax(1)
# F1 score
# Use idxmax() to convert back from one-hot encoding
f1 = f1_score(y_test_vals, y_pred_vals, average='weighted')
print("Test Set Accuracy: {:.2%} (But results would have been better if trained on the FULL dataset)".format(f1))
我不明白,这是我设法解决另一个贷款分类问题的架构。
我得到:Test Set Accuracy: 10.92%
带重量:
所有之前的模型化都未加权或没有焦点损失。
我试图以不同的方式来应对类不平衡,例如重新采样。没有重新采样,我用权重做了:
weights = df_en2['Voting intention in 2021'].value_counts(normalize=True)
weights = weights.sort_index().tolist()
weights = {0: 1 / weights[0],
1: 1 / weights[1],
2: 1 / weights[2],
3: 1 / weights[3]}
dfen_2
提供x_train
, y_train
, x_test
,y_test
功能的数据框在哪里spilt_data()
,您可以在这里找到(它基本上是相同的架构,但针对贷款分类问题)。
多类树分类器
相比之下,使用树分类器,如果我离开max_depth
to None
,叶子会被扩展,直到所有叶子都是纯的或直到所有叶子包含少于 min_samples_split 样本。我的平均准确率为 42%。
# importing necessary libraries
from sklearn import datasets
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
# dividing X, y into train and test data
# X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0)
# training a DescisionTreeClassifier
from sklearn.tree import DecisionTreeClassifier
dtree_model = DecisionTreeClassifier().fit(x_train, y_train)
dtree_predictions = dtree_model.predict(x_test)
# creating a confusion matrix
cm = confusion_matrix(y_test.values.argmax(axis = 1), dtree_predictions.argmax(axis = 1))
它返回:
array([[0.19047619, 0.15873016, 0.45454545, 0.27118644],
[0.15873016, 0.38095238, 0.2 , 0.30508475],
[0.15873016, 0.15873016, 0.4 , 0.22033898],
[0.19047619, 0.19047619, 0.21818182, 0.38983051]])