数据挖掘 - 高模型精度与非常低的验证精度 - 吾爱随笔录

高模型精度与非常低的验证精度

数据挖掘 Python 深度学习分类喀拉斯过拟合

2021-09-15 02:54:23

我正在使用 Keras Sequential 模型在 python 中构建情感分析程序以进行深度学习

我的数据是 20,000 条推文：

正面推文：9152 条推文
负面推文：10849 条推文

我编写了一个顺序模型脚本来进行二进制分类，如下所示：

model=Sequential()
model.add(Embedding(vocab_size, 100, input_length=max_words))
model.add(Conv1D(filters=32, kernel_size=3, padding='same', activation='relu'))
model.add(MaxPooling1D(pool_size=2))
model.add(Flatten())
model.add(Dense(250, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# Fit the model

print(model.summary())
history=model.fit(X_train[train], y1[train], validation_split=0.30,epochs=2, batch_size=128,verbose=2)

但是我得到了非常奇怪的结果！模型准确度几乎完美（>90），而验证准确度非常低（<1）（如下所示）

Train on 9417 samples, validate on 4036 samples
 Epoch 1/2
- 13s - loss: 0.5478 - acc: 0.7133 - val_loss: 3.6157 - val_acc: 0.0243
 Epoch 2/2
- 11s - loss: 0.2287 - acc: 0.8995 - val_loss: 5.4746 - val_acc: 0.0339

我试图增加 epoch 的数量，它只会增加模型的准确性并降低验证的准确性

关于如何克服这个问题的任何建议？

更新：

这就是我处理数据的方式

#read training data
pos_file=open('pos2.txt', 'r', encoding="Latin-1")
 neg_file=open('neg3.txt', 'r', encoding="Latin-1")
# Load data from files
pos = list(pos_file.readlines())
neg = list(neg_file.readlines()) 
x = pos + neg
docs = numpy.array(x)
#read Testing Data
pos_test=open('posTest2.txt', 'r',encoding="Latin-1")
posT = list(pos_test.readlines())
neg_test=open('negTest2.txt', 'r',encoding="Latin-1")
negT = list(neg_test.readlines())
xTest = posT + negT
total2 = numpy.array(xTest)

CombinedDocs=numpy.append(total2,docs)

# Generate labels
positive_labels = [1 for _ in pos]
negative_labels = [0 for _ in neg]
labels = numpy.concatenate([positive_labels, negative_labels], 0)

# prepare tokenizer
t = Tokenizer()
t.fit_on_texts(CombinedDocs)
vocab_size = len(t.word_index) + 1
# integer encode the documents
encoded_docs = t.texts_to_sequences(docs)
#print(encoded_docs)

# pad documents to a max length of 140 words
max_length = 140
padded_docs = pad_sequences(encoded_docs, maxlen=max_length, padding='post')

这里我使用了 Google public word2vec

# load the whole embedding into memory
embeddings_index = dict()
f = open('Google28.bin',encoding="latin-1")
for line in f:
values = line.split()
word = values[0]
coefs = asarray(values[1:], dtype='str')
embeddings_index[word] = coefs
f.close()

# create a weight matrix for words in training docs
embedding_matrix = zeros((vocab_size, 100))

for word, i in t.word_index.items():
embedding_vector = embeddings_index.get(word)
if embedding_vector is not None:
    embedding_matrix[i] = embedding_vector


#Convert to numpy
NewTraining=numpy.array(padded_docs)
NewLabels=numpy.array(labels)
encoded_docs2 = t.texts_to_sequences(total2)

# pad documents to a max length of 140 words

padded_docs2 = pad_sequences(encoded_docs2, maxlen=max_length, padding='post')


# Generate labels
positive_labels2 = [1 for _ in posT]
negative_labels2 = [0 for _ in negT]
yTest = numpy.concatenate([positive_labels2, negative_labels2], 0)
NewTesting=numpy.array(padded_docs2)
NewLabelsTsting=numpy.array(yTest)

4个回答

当机器学习模型具有高训练精度和非常低的验证率时，这种情况可能被称为过度拟合。其原因如下：

您使用的假设函数过于复杂，以至于您的模型完全适合训练数据，但无法在测试/验证数据上执行。
您的模型中的学习参数数量太大，以至于您的模型没有对示例进行泛化，而是学习了这些示例，因此该模型在测试/验证数据上表现不佳。

要解决上述问题，可以根据您的数据集尝试多种解决方案：

使用简单的成本和损失函数。
使用有助于减少过度拟合的规则，即 Dropout。
减少模型中学习参数的数量。

这些是最有可能提高模型验证准确性的 3 种解决方案，如果这些解决方案不起作用，请检查您的输入是否具有正确的形状和大小。

您应该尝试打乱所有数据并将它们拆分到训练和测试和有效集，然后再次训练。

似乎通过验证拆分，验证准确性无法正常工作。不要在模型的拟合函数中使用验证拆分，而是尝试将训练数据拆分为训练数据并在拟合函数之前验证数据，然后像这样在 feed 函数中输入验证数据。

而不是这样做

history=model.fit(X_train[train], y1[train], validation_split=0.30,epochs=2, batch_size=128,verbose=2)

通过任何方法将您的训练数据拆分为验证和训练数据，然后说您的验证数据是(X_val,Y_val)，然后将上面的代码行替换为这一行：

history=model.fit(X_train[train], y1[train], validation_data=(X_val,Y_val),epochs=2, batch_size=128,verbose=2)

我有同样的条件： Highacc和 low vad_acc。

因为Keras.model.fit的参数validation_split。

这会将最后一部分数据分离为验证数据。因此，如果您的数据是有序的，那么您的有效性数据将是相同的。尝试打乱训练数据。

其它你可能感兴趣的问题

上一篇我如何为神经网络怀疑论者提供答案？下一篇如何计算 conv2d_transpose 的输出形状？