Doc2vec 计算余弦相似度 - 绝对不准确

数据挖掘 Python nlp 相似 文本 gensim
2021-10-10 19:58:43

我正在尝试修改 Doc2vec教程以计算余弦相似度并采用Pandas数据帧而不是.txt文档。我想从我的数据中找到与我输入的新句子最相似的句子。然而,在训练之后,即使我给出了与数据集中存在的几乎相同的句子,我得到的结果是低准确度的结果,并且没有一个是我修改过的句子。例如,我有一句话“这是你养的一只好猫”。在我训练 Doc2vec 的数据集中,然后我使用新句子“你养的这只猫很不错”。作为输入,它不会将第一句话作为相似的。

数据来自 excel 表,大致如下:

  Description                  | Group        | Number
0 Sent: This is a sentence       Regular        NUM1234
1 Sent: Another sentence         Regular        NUM1243
2 Sent: Basically all the input  Other group    NUM1278
3 Sent: Creating a test case to validate the routing between applications.  No action needed at this moment 
                               | Other group  | NUM1287
...etc...

我有以下代码(修剪了一些不需要理解的代码):

df = pd.read_excel("my_data.xls")

df["Description"] = df["Description"].apply(lambda x: removeGeneric(x)) #removeGeneric() just strips "Sent:" from the beginning of each sentence 
for index, row in df.iterrows():
    row["Description"] = row["Description"].lower()
    row["Description"] = normalize_text(row["Description"]) #normalize_text() removes stopwords defined in the nltk package and words shorter than 2 characters

SentimentDocument = namedtuple('SentimentDocument', 'words tags')

alldocs = []  
for index, row in df.iterrows():
    words = gensim.utils.to_unicode(row["Description"]).split()
    tags = [row["Number"]]
    alldocs.append(SentimentDocument(words, tags))

doc_list = alldocs[:]
cores = multiprocessing.cpu_count()
assert gensim.models.doc2vec.FAST_VERSION > -1, "This will be painfully slow otherwise"

simple_models = [
    # PV-DM w/ concatenation - window=5 (both sides) approximates paper's 10-word total window size
    Doc2Vec(dm=1, dm_concat=1, size=100, window=5, negative=5, hs=0, min_count=2, workers=cores),
    # PV-DBOW 
    Doc2Vec(dm=0, size=100, negative=5, hs=0, min_count=2, workers=cores),
    # PV-DM w/ average
    Doc2Vec(dm=1, dm_mean=1, size=100, window=10, negative=5, hs=0, min_count=2, workers=cores),
]

# Speed up setup by sharing results of the 1st model's vocabulary scan
simple_models[0].build_vocab(alldocs)  # PV-DM w/ concat requires one special NULL word so it serves as template
print(simple_models[0])
for model in simple_models[1:]:
    model.reset_from(simple_models[0])
    print(model)

models_by_name = OrderedDict((str(model), model) for model in simple_models)

from random import shuffle

alpha, min_alpha, passes = (0.025, 0.001, 20)
alpha_delta = (alpha - min_alpha) / passes

print("START %s" % datetime.datetime.now())

for epoch in range(passes):
    shuffle(doc_list)

    for name, train_model in models_by_name.items():
        # Train
        duration = 'na'
        train_model.alpha, train_model.min_alpha = alpha, alpha
        with elapsed_timer() as elapsed:
            train_model.train(doc_list, total_examples=len(doc_list), epochs=1)

for model in simple_models:
    new_sentence = "Test case creation to validation of routing between applications.  No action needed" #Notice how I'm testing with a sentence very similar to one in the original dataset
    new_sentence = removeGeneric(new_sentence)
    new_sentence = normalize_text(new_sentence)
    print(model.docvecs.most_similar(positive=[model.infer_vector(new_sentence)],topn=2))

为此,我得到以下输出:

[('NUM1254', 0.3154909014701843), ('NUM5247', 0.2487245500087738)]
[('NUM3875', 0.20226456224918365), ('NUM3793', 0.1970052272081375)]
[('NUM3585', 0.13086965680122375), ('NUM3857', 0.1298370361328125)]
creating test case validate routing applications action needed moment

所有的建议都是完全不相关的,出现诸如“site id factory address good owner power request approved number al region Province”之类的句子;它实际上接近的句子(来自数据集的句子“创建测试用例以验证应用程序之间的路由。此时不需要操作”)不在列表中。

你能看出我做错了什么吗?我能做些什么来提高准确性?有没有其他人在 doc2vec 的余弦相似度预测中遇到过这种不准确的情况?如果我对实现进行手动编码(例如像这样),它确实给出了正确的答案,这与 doc2vec 的答案完全不同(但实际上是准确的)。

1个回答

我认为您缺少model.infer_vector(new_sentence). 您需要根据训练有素的模型推断新向量。您可以在此处的评估模型部分中找到更多详细信息

相似性在向量之间,而不是在归一化的标记之间。因此,您必须首先使用您的模型来推断它们。