我正在尝试使用 TensorFlow 后端使用 Keras 创建 1:N 说话人识别模型。我使用 LibriSpeech 语料库来训练数据,并通过首先将每个文件从 .FLAC 转换为 .WAV 然后从音频的前三秒计算梅尔频率倒谱系数 (MFCC) 来预处理数据。然后,我将 MFCC 输入到使用以下函数创建的卷积神经网络 (CNN) 中:
def createModel(self, model_input, n_outputs, first_session=True):
if first_session != True:
model = load_model('SI_ideal_model_fixed.hdf5')
return model
# Define Input Layer
inputs = model_input
# Define First Conv2D Layer
conv = Conv2D(filters=32,
kernel_size=(5, 5),
activation='relu',
padding='same',
strides=3,
name='conv_1A')(inputs)
conv = Conv2D(filters=32,
kernel_size=(5, 5),
activation='relu',
padding='same',
strides=3,
name='conv1B')(conv)
conv = MaxPooling2D(pool_size=(3, 3), padding='same', name='maxpool_1')(conv)
conv = Dropout(0.3, name='dropout_1')(conv)
# Define Second Conv2D Layer
conv = Conv2D(filters=64,
kernel_size=(3, 3),
activation='relu',
padding='same',
strides=3,
name='conv_2A')(conv)
conv = Conv2D(filters=64,
kernel_size=(3, 3),
activation='relu',
padding='same',
strides=3,
name='conv_2B')(conv)
conv = MaxPooling2D(pool_size=(3, 3), padding='same', name='maxpool_2')(conv)
conv = Dropout(0.3, name='dropout_2')(conv)
# Define Third Conv2D Layer
conv = Conv2D(filters=128,
kernel_size=(3, 3),
activation='relu',
padding='same',
strides=3,
name='conv_3A')(conv)
conv = Conv2D(filters=128,
kernel_size=(3, 3),
activation='relu',
padding='same',
strides=3,
name='conv_3B')(conv)
conv = MaxPooling2D(pool_size=(3, 3), padding='same', name='maxpool_3')(conv)
conv = Dropout(0.3, name='droupout_3')(conv)
# Define Flatten Layer
conv = Flatten(name='flatten')(conv)
# Define First Dense Layer
conv = Dense(256, activation='relu', name='dense_a')(conv)
conv = Dropout(0.2, name='dropout_4')(conv)
# Define Second Dense Layer
conv = Dense(128, activation='relu', name='dense_b')(conv)
conv = Dropout(0.2, name='dropout_5')(conv)
# Define Output Layer
outputs = Dense(n_outputs, activation='softmax', name='output')(conv)
# Create Model
model = Model(inputs, outputs)
model.summary()
return model
该模型旨在确定说话者是否是 15 位感兴趣的人之一,因此随后在 15 位说话者的数据库上使用迁移学习对其进行了重新训练。出于测试目的,该数据库由模型以前从未见过的 14 位 LibriSpeech 扬声器和我自己组成。在这个训练阶段,模型达到了 0.9416 的最大验证准确度得分:
Epoch 10/100
547/547 [==============================] - 0s 498us/step - loss: 0.1778 - accuracy: 0.9452 - val_loss: 0.2544 - val_accuracy: 0.9416
Epoch 00010: val_accuracy improved from 0.91971 to 0.94161, saving model to SI_ideal_model_fixed.hdf5
最后,我录制了我的声音的实时样本,并要求模型预测是我自己还是 LibriSpeech 扬声器之一,具有以下功能:
def predict(self, audio_path):
# Import Model
model, data = self.importModel(audio_path)
# Prepare Audio for Prediction
pla = ProcessLiveAudio()
file = pla.getMFCCFromRec()
# Interpret Prediction
prob = model.predict(file) # Make prediction
print(prob)
index = np.argmax(prob) # Decode one-hot vector
prob_max = prob[0][index] # Answer confidence
prediction = data[2][index] # Determine corresponding speaker
# Print Results
print('Speaker: ' + prediction)
print('Confidence: ' + str(prob_max*100) + ' %')
结果如下(演讲者在第一个列表中,ZH 是我自己,相应的概率在第二个列表中):
['1743', '1992', '2182', '2196', '2277', '2412', '2428', '2803', '2902', '3000', '3081', '3170', '3536',
'3576','ZH']
[[1.0116849e-04 9.0038668e-08 9.9856061e-01 5.8844932e-07 2.0543277e-05
5.1232328e-06 3.5524553e-07 5.9479582e-08 7.4726445e-06 6.2108990e-10
2.0075995e-10 2.6086704e-08 1.0949866e-03 2.0887743e-04 7.1733335e-12]]
因此,该模型不仅以 99.86% 的置信度预测我是一个完全不同的人,而且还假设我是音频信号所属的最不可能的分类器。此外,在模型的每次后续测试中都会出现相同的问题,有时会出现不同的错误扬声器。然而,我对如何改进模型感到困惑,因为它在训练阶段显然表现出色,并且避免了严重的过度拟合。问题是否源于培训,还是我的预测功能有问题,甚至是完全不同的问题?
TL;DR:修复在训练期间表现良好然后自信地预测错误分类器的多分类 Keras 模型的最佳步骤是什么?
我是 ML/Keras 的新手,任何帮助将不胜感激。