我正在使用 Keras Sequential 模型在 python 中构建情感分析程序以进行深度学习
我的数据是 20,000 条推文:
- 正面推文:9152 条推文
- 负面推文:10849 条推文
我编写了一个顺序模型脚本来进行二进制分类,如下所示:
model=Sequential()
model.add(Embedding(vocab_size, 100, input_length=max_words))
model.add(Conv1D(filters=32, kernel_size=3, padding='same', activation='relu'))
model.add(MaxPooling1D(pool_size=2))
model.add(Flatten())
model.add(Dense(250, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
# Fit the model
print(model.summary())
history=model.fit(X_train[train], y1[train], validation_split=0.30,epochs=2, batch_size=128,verbose=2)
但是我得到了非常奇怪的结果!模型准确度几乎完美(>90),而验证准确度非常低(<1)(如下所示)
Train on 9417 samples, validate on 4036 samples
Epoch 1/2
- 13s - loss: 0.5478 - acc: 0.7133 - val_loss: 3.6157 - val_acc: 0.0243
Epoch 2/2
- 11s - loss: 0.2287 - acc: 0.8995 - val_loss: 5.4746 - val_acc: 0.0339
我试图增加 epoch 的数量,它只会增加模型的准确性并降低验证的准确性
关于如何克服这个问题的任何建议?
更新:
这就是我处理数据的方式
#read training data
pos_file=open('pos2.txt', 'r', encoding="Latin-1")
neg_file=open('neg3.txt', 'r', encoding="Latin-1")
# Load data from files
pos = list(pos_file.readlines())
neg = list(neg_file.readlines())
x = pos + neg
docs = numpy.array(x)
#read Testing Data
pos_test=open('posTest2.txt', 'r',encoding="Latin-1")
posT = list(pos_test.readlines())
neg_test=open('negTest2.txt', 'r',encoding="Latin-1")
negT = list(neg_test.readlines())
xTest = posT + negT
total2 = numpy.array(xTest)
CombinedDocs=numpy.append(total2,docs)
# Generate labels
positive_labels = [1 for _ in pos]
negative_labels = [0 for _ in neg]
labels = numpy.concatenate([positive_labels, negative_labels], 0)
# prepare tokenizer
t = Tokenizer()
t.fit_on_texts(CombinedDocs)
vocab_size = len(t.word_index) + 1
# integer encode the documents
encoded_docs = t.texts_to_sequences(docs)
#print(encoded_docs)
# pad documents to a max length of 140 words
max_length = 140
padded_docs = pad_sequences(encoded_docs, maxlen=max_length, padding='post')
这里我使用了 Google public word2vec
# load the whole embedding into memory
embeddings_index = dict()
f = open('Google28.bin',encoding="latin-1")
for line in f:
values = line.split()
word = values[0]
coefs = asarray(values[1:], dtype='str')
embeddings_index[word] = coefs
f.close()
# create a weight matrix for words in training docs
embedding_matrix = zeros((vocab_size, 100))
for word, i in t.word_index.items():
embedding_vector = embeddings_index.get(word)
if embedding_vector is not None:
embedding_matrix[i] = embedding_vector
#Convert to numpy
NewTraining=numpy.array(padded_docs)
NewLabels=numpy.array(labels)
encoded_docs2 = t.texts_to_sequences(total2)
# pad documents to a max length of 140 words
padded_docs2 = pad_sequences(encoded_docs2, maxlen=max_length, padding='post')
# Generate labels
positive_labels2 = [1 for _ in posT]
negative_labels2 = [0 for _ in negT]
yTest = numpy.concatenate([positive_labels2, negative_labels2], 0)
NewTesting=numpy.array(padded_docs2)
NewLabelsTsting=numpy.array(yTest)