数据挖掘 - 从文档中提取相关词汇 - 吾爱随笔录

我正在为 QnA 训练 DSSM 模型。我有 200 个查询及其对应的答案 - 答案是回答与查询相关的文章应该包含什么样的信息。例如：

标题：非洲识字率

描述：非洲国家的识字率是多少

我已经用整个词汇表训练了我的模型，但验证并没有带来很好的结果。整个词汇表是指包含所有使用的单词的列表，因为我认为消除介词、连词等可能会导致语义损失。

现在我正试图找到一种方法来提取我的文档中更相关的词汇。我做了一些研究，我想到了n-grams。事实上，在CNKT中有一个类似的例子，他们用于答案的词汇是针对单个单词和 n-gram 形成的，但我自己还没有找到方法。

到目前为止，我已经找到了一种方法来做n-gram但这不是我想要的，例如在句子中：

母牛跳过月亮

我得到以下代码：

the_cow_jumps
cow_jumps_over
jumps_over_the
over_the_moon

当我对4-gram 感兴趣时：

cow_jumps_over_moon

请记住，即使消除文章（或停用词），我仍然会得到不止一个n-gram，这不是我想要的，因为我的主要目标是获得最终词汇来训练我的模型。

作为我想要的一个例子，可能是这样的话：

book 
book_character 
book_editions_published 
book_subject 
books_published

代替：

book
character
editions
published
subject

OUTPUT: ('This', 'is', 'random', 'text') ('is', 'random', 'text', 'to') ('random', 'text', 'to', 'demonstrate') ('text', 'to', 'demonstrate', 'the') ('to', 'demonstrate', 'the', 'use') ('demonstrate', 'the', 'use', 'of') ('the', 'use', 'of', 'n-grams')

words = 'This is random text we’re going to split apart' x=[] for word in words.split(): x.append(word) if len(x) == 4: print(x) x=[] print(x) OUTPUT: ['This', 'is', 'random', 'text'] ['we’re', 'going', 'to', 'split'] ['apart']