使用 keras 为文本分类创建标签

数据挖掘 Python 深度学习 nlp 喀拉斯 文本挖掘
2022-02-15 07:15:33

我有一个文本文件,其中包含需要根据关键字进行分类的信息。文本文件包含许多段落。该段落包含我们想要的关键字(比如说工资金额、利率等..)

我想编写一个模型来提取包含我想要的关键字的段落(或 3 到 4 行文本)。在这种情况下如何创建标签?我所拥有的只是一个原始文本。

我是 NLP 新手。有什么建议我可以解决这个问题吗?

1个回答

您可以使用 Keras 库的 CNN 算法构建文本分类应用程序。请看一下这个 git 存储库。这里

如您所见,您需要通过从文件中加载极性数据、将数据拆分为单词、生成标签并返回拆分的句子和标签来创建训练和测试数据。并且可以用keras的Dense、Embedding、Conv2D、MaxPool2D创建卷积神经网络。

这是最终的模型训练片段。

from keras.layers import Input, Dense, Embedding, Conv2D, MaxPool2D
from keras.layers import Reshape, Flatten, Dropout, Concatenate
from keras.callbacks import ModelCheckpoint
from keras.optimizers import Adam
from keras.models import Model
from sklearn.model_selection import train_test_split
from data_helpers import load_data

print('Loading data')
x, y, vocabulary, vocabulary_inv = load_data()

# x.shape -> (10662, 56)
# y.shape -> (10662, 2)
# len(vocabulary) -> 18765
# len(vocabulary_inv) -> 18765

X_train, X_test, y_train, y_test = train_test_split( x, y,     test_size=0.2, random_state=42)

# X_train.shape -> (8529, 56)
# y_train.shape -> (8529, 2)
# X_test.shape -> (2133, 56)
# y_test.shape -> (2133, 2)


sequence_length = x.shape[1] # 56
vocabulary_size = len(vocabulary_inv) # 18765
embedding_dim = 256
filter_sizes = [3,4,5]
num_filters = 512
drop = 0.5

epochs = 100
batch_size = 30

# this returns a tensor
print("Creating Model...")
inputs = Input(shape=(sequence_length,), dtype='int32')
embedding = Embedding(input_dim=vocabulary_size,     output_dim=embedding_dim, input_length=sequence_length)(inputs)
reshape = Reshape((sequence_length,embedding_dim,1))(embedding)

conv_0 = Conv2D(num_filters, kernel_size=(filter_sizes[0],     embedding_dim), padding='valid', kernel_initializer='normal',     activation='relu')(reshape)
conv_1 = Conv2D(num_filters, kernel_size=(filter_sizes[1],     embedding_dim), padding='valid', kernel_initializer='normal',     activation='relu')(reshape)
conv_2 = Conv2D(num_filters, kernel_size=(filter_sizes[2],     embedding_dim), padding='valid', kernel_initializer='normal', activation='relu')(reshape)

maxpool_0 = MaxPool2D(pool_size=(sequence_length - filter_sizes[0] + 1, 1), strides=(1,1), padding='valid')(conv_0)
maxpool_1 = MaxPool2D(pool_size=(sequence_length - filter_sizes[1] + 1, 1), strides=(1,1), padding='valid')(conv_1)
maxpool_2 = MaxPool2D(pool_size=(sequence_length - filter_sizes[2] + 1, 1), strides=(1,1), padding='valid')(conv_2)

concatenated_tensor = Concatenate(axis=1)([maxpool_0, maxpool_1, maxpool_2])
flatten = Flatten()(concatenated_tensor)
dropout = Dropout(drop)(flatten)
output = Dense(units=2, activation='softmax')(dropout)

# this creates a model that includes
model = Model(inputs=inputs, outputs=output)

checkpoint = ModelCheckpoint('weights.{epoch:03d}-{val_acc:.4f}.hdf5',     monitor='val_acc', verbose=1, save_best_only=True, mode='auto')
adam = Adam(lr=1e-4, beta_1=0.9, beta_2=0.999, epsilon=1e-08, decay=0.0)

model.compile(optimizer=adam, loss='binary_crossentropy', metrics=['accuracy'])
print("Traning Model...")
model.fit(X_train, y_train, batch_size=batch_size, epochs=epochs, verbose=1, callbacks=[checkpoint], validation_data=(X_test, y_test))  #     starts training

通过运行此代码,您将获得具有 hd5 格式的训练模型。最后,您可以使用您的模型进行预测。