通过使用嵌入(用于词嵌入的 Glove 50)和双向 LSTM 将此问题视为分类问题,我得到了很好的结果。我知道这个问题看起来更像是一个实体识别问题,但在我的用例中,我只需要对一个已知的商家子集进行分类,所以效果很好。由于训练数据非常不平衡,我还使用数据合成来提高准确性。
我的 Keras 模型:
__________________________________________________________________________________________________
Layer (type) Output Shape Param # Connected to
==================================================================================================
words_input (InputLayer) (None, None) 0
__________________________________________________________________________________________________
casing_input (InputLayer) (None, None) 0
__________________________________________________________________________________________________
embedding_1 (Embedding) (None, None, 50) 20000000 words_input[0][0]
__________________________________________________________________________________________________
embedding_2 (Embedding) (None, None, 9) 81 casing_input[0][0]
__________________________________________________________________________________________________
concatenate_1 (Concatenate) (None, None, 59) 0 embedding_1[0][0]
embedding_2[0][0]
__________________________________________________________________________________________________
bidirectional_1 (Bidirectional) [(None, 400), (None, 416000 concatenate_1[0][0]
__________________________________________________________________________________________________
dense_1 (Dense) (None, 1591) 637991 bidirectional_1[0][0]
==================================================================================================
Total params: 21,054,072 Trainable params: 1,053,991 Non-trainable params: 20,000,081
__________________________________________________________________________________________________