数据挖掘 - Doc2vec 计算余弦相似度 - 绝对不准确 - 吾爱随笔录

我正在尝试修改 Doc2vec教程以计算余弦相似度并采用Pandas数据帧而不是.txt文档。我想从我的数据中找到与我输入的新句子最相似的句子。然而，在训练之后，即使我给出了与数据集中存在的几乎相同的句子，我得到的结果是低准确度的结果，并且没有一个是我修改过的句子。例如，我有一句话“这是你养的一只好猫”。在我训练 Doc2vec 的数据集中，然后我使用新句子“你养的这只猫很不错”。作为输入，它不会将第一句话作为相似的。

数据来自 excel 表，大致如下：

  Description                  | Group        | Number
0 Sent: This is a sentence       Regular        NUM1234
1 Sent: Another sentence         Regular        NUM1243
2 Sent: Basically all the input  Other group    NUM1278
3 Sent: Creating a test case to validate the routing between applications.  No action needed at this moment 
                               | Other group  | NUM1287
...etc...

我有以下代码（修剪了一些不需要理解的代码）：

df = pd.read_excel("my_data.xls")

df["Description"] = df["Description"].apply(lambda x: removeGeneric(x)) #removeGeneric() just strips "Sent:" from the beginning of each sentence 
for index, row in df.iterrows():
    row["Description"] = row["Description"].lower()
    row["Description"] = normalize_text(row["Description"]) #normalize_text() removes stopwords defined in the nltk package and words shorter than 2 characters

SentimentDocument = namedtuple('SentimentDocument', 'words tags')

alldocs = []  
for index, row in df.iterrows():
    words = gensim.utils.to_unicode(row["Description"]).split()
    tags = [row["Number"]]
    alldocs.append(SentimentDocument(words, tags))

doc_list = alldocs[:]
cores = multiprocessing.cpu_count()
assert gensim.models.doc2vec.FAST_VERSION > -1, "This will be painfully slow otherwise"

simple_models = [
    # PV-DM w/ concatenation - window=5 (both sides) approximates paper's 10-word total window size
    Doc2Vec(dm=1, dm_concat=1, size=100, window=5, negative=5, hs=0, min_count=2, workers=cores),
    # PV-DBOW 
    Doc2Vec(dm=0, size=100, negative=5, hs=0, min_count=2, workers=cores),
    # PV-DM w/ average
    Doc2Vec(dm=1, dm_mean=1, size=100, window=10, negative=5, hs=0, min_count=2, workers=cores),
]

# Speed up setup by sharing results of the 1st model's vocabulary scan
simple_models[0].build_vocab(alldocs)  # PV-DM w/ concat requires one special NULL word so it serves as template
print(simple_models[0])
for model in simple_models[1:]:
    model.reset_from(simple_models[0])
    print(model)

models_by_name = OrderedDict((str(model), model) for model in simple_models)

from random import shuffle

alpha, min_alpha, passes = (0.025, 0.001, 20)
alpha_delta = (alpha - min_alpha) / passes

print("START %s" % datetime.datetime.now())

for epoch in range(passes):
    shuffle(doc_list)

    for name, train_model in models_by_name.items():
        # Train
        duration = 'na'
        train_model.alpha, train_model.min_alpha = alpha, alpha
        with elapsed_timer() as elapsed:
            train_model.train(doc_list, total_examples=len(doc_list), epochs=1)

for model in simple_models:
    new_sentence = "Test case creation to validation of routing between applications.  No action needed" #Notice how I'm testing with a sentence very similar to one in the original dataset
    new_sentence = removeGeneric(new_sentence)
    new_sentence = normalize_text(new_sentence)
    print(model.docvecs.most_similar(positive=[model.infer_vector(new_sentence)],topn=2))

为此，我得到以下输出：

[('NUM1254', 0.3154909014701843), ('NUM5247', 0.2487245500087738)]
[('NUM3875', 0.20226456224918365), ('NUM3793', 0.1970052272081375)]
[('NUM3585', 0.13086965680122375), ('NUM3857', 0.1298370361328125)]
creating test case validate routing applications action needed moment

所有的建议都是完全不相关的，出现诸如“site id factory address good owner power request approved number al region Province”之类的句子；它实际上接近的句子（来自数据集的句子“创建测试用例以验证应用程序之间的路由。此时不需要操作”）不在列表中。

你能看出我做错了什么吗？我能做些什么来提高准确性？有没有其他人在 doc2vec 的余弦相似度预测中遇到过这种不准确的情况？如果我对实现进行手动编码（例如像这样），它确实给出了正确的答案，这与 doc2vec 的答案完全不同（但实际上是准确的）。