如何将句子与一组关键字匹配?

数据挖掘 Python 数据 word2vec 相似
2022-02-12 23:17:21

我有一个分类问题。我有称为“经验”、“教育”、“能力”的集群。带有两列的标记数据(包含所有集群的 72,000 多个条目)如下所示。

year of education               Education
years education                 Education
years of educational            Education
two years of education          Education
years of education beyond       Education
education four year             Education
..........
of proven sales experience      Ability
knowledge of and                Ability
experience or education high    Ability
assigned knowledge skills       Ability
accountable for driving         Ability
..........
administrative and leadership skills    Experience
advanced negotiations skills            Experience
must have keyboarding skills            Experience
must have skills                        Experience
activities preferred skills             Experience
of clinical skill                       Experience

我必须给出一个字符串,并根据经过训练的模型确定它属于经验、教育还是能力。字符串的例子。

string1 = "There is a requirement of four-year professional degrees"
string2 = "Able to drive the teams to higher levels"
string3 = "Must have programming experience in C, C++"

当我测试这些字符串时,它应该能够将字符串分类到任一簇中。

  • 有哪些可能的方法来训练我的模型?
  • 参考 word2vec 和 doc2vec,这些模型会起作用吗?

我找不到任何相关的例子来训练单词模型和测试字符串。关于如何工作的任何想法?

2个回答

我对这个问题有点困惑。但我猜你在问使用什么模型以及如何表示单词。

这似乎是一个非常介绍性的示例,因此您可能想尝试简单的词袋表示。尝试将其作为第一遍推入朴素贝叶斯分类器。如果对结果不满意,可以实现 word2vec 和更复杂的方法。

一种方法是将模型构建为文本分类。给定一个文档语料库,学习预测离散类别的特征。

文本分类可以通过学习嵌入和训练模型来学习将文档分类来完成。

像这样的东西:

from tensorflow.keras.layers                   import Activation, Dense, Embedding, Convolution1D, MaxPooling1D
from tensorflow.keras.models                   import Sequential
from   tensorflow.keras.preprocessing.text     import Tokenizer
from   tensorflow.keras.preprocessing.sequence import pad_sequences
from   tensorflow.keras.utils                  import to_categorical

docs = ['year of education',              
'years education',                 
'years of educational',            
'two years of education',          
'years of education beyond',       
'education four year',             
'of proven sales experience',      
'knowledge of and',                
'experience or education high',    
'assigned knowledge skills',       
'accountable for driving',         
'administrative and leadership skills',    
'advanced negotiations skills',            
'must have keyboarding skills',            
'must have skills',                        
'activities preferred skills',             
'of clinical skill ',]      

# Tranform string documents into numeric sequences
tokenizer = Tokenizer(num_words=100) # Tokenizer lowercases and strips punctuation 
tokenizer.fit_on_texts(X)
X = tokenizer.texts_to_sequences(X) # Convert texts to token sequences

y = ['Education']*5
y += ['Ability']*5
y += ['Experience']*5

# Tranform string categories into numeric categories
mapping = {word: i for i,word in enumerate(y)}  # Hashing trick to encode string as category 
vec = [mapping[word] for word in y]  # Apply hashing trick
y = to_categorical(vec) # Convert numerical values into one-hot encoded values

# Create the model
model = Sequential()

# The Embedding layer is similar to word2vec
model.add(Embedding(input_dim=100, 
                    output_dim=3, 
                    input_length=4
                       ))

# Convolution layer
model.add(Convolution1D(filters=64,
                            kernel_size=2,
                            activation='relu',))
model.add(MaxPooling1D())

# The target specific output layer
model.add(Dense(units=3))
model.add(Activation('softmax'))

model.compile(loss='categorical_crossentropy',
              optimizer='sgd',
              metrics=['accuracy'])

# Train the model
model.fit(X, 
          y, 
          batch_size=256, 
          epochs=100);

# Predict the categories for new data
docs_new = ["There is a requirement of four-year professional degrees", 
         "Able to drive the teams to higher levels",
         "Must have programming experience in C, C++",]

for X_new in docs_new:
    X_new = tokenizer.texts_to_sequences(X_new) 
model.predict(X_new)