数据挖掘 - 如何使用 paraphrase_mining 使用句子转换器预训练模型 - 吾爱随笔录

我正在尝试使用预先训练的句子转换器模型来查找句子之间的相似性。我正在尝试遵循此处的代码 - https://www.sbert.net/docs/usage/paraphrase_mining.html

在试验一中，我运行了 2 个 for 循环，在其中我尝试找到给定句子与其他句子的相似性。这是代码 -

from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('distilbert-base-nli-stsb-mean-tokens')


# Single list of sentences
sentences = ['The cat sits outside',
             'A man is playing guitar',
             'The new movie is awesome',
             'Do you like pizza?']

#Compute embeddings
embeddings = model.encode(sentences, convert_to_tensor=True)

#Compute cosine-similarities for each sentence with each other sentence
cosine_scores = util.pytorch_cos_sim(embeddings, embeddings)

#Find the pairs with the highest cosine similarity scores
pairs = []
for i in range(len(cosine_scores)-1):
    for j in range(i+1, len(cosine_scores)):
        pairs.append({'index': [i, j], 'score': cosine_scores[i][j]})

#Sort scores in decreasing order
pairs = sorted(pairs, key=lambda x: x['score'], reverse=True)

print(len(pairs))
6

for pair in pairs[0:10]:
    i, j = pair['index']
    print("{} \t\t {} \t\t Score: {:.4f}".format(sentences[i], sentences[j], pair['score']))

A man is playing guitar          Do you like pizza?          Score: 0.1080
The new movie is awesome         Do you like pizza?          Score: 0.0829
A man is playing guitar          The new movie is awesome        Score: 0.0652
The cat sits outside         Do you like pizza?          Score: 0.0523
The cat sits outside         The new movie is awesome        Score: -0.0270
The cat sits outside         A man is playing guitar         Score: -0.0530

这可以按预期工作，因为 4 个句子的组合之间可以有 6 个相似度得分组合。在他们的文档页面上，他们提到由于二次复杂性，这不能很好地扩展，因此他们建议使用 paraphrase_mining() 方法。

但是当我尝试使用该方法时，我没有得到 6 个组合，而是只得到 5 个。为什么会这样？

这是我尝试使用 paraphrase_mining() 方法的示例代码 -

# Single list of sentences
sentences = ['The cat sits outside',
             'A man is playing guitar',
             'The new movie is awesome',
             'Do you like pizza?']


paraphrases = util.paraphrase_mining(model, sentences)
print(len(paraphrases))
5

k = 0
for paraphrase in paraphrases:
    print(k)
    score, i, j = paraphrase
    print("{} \t\t {} \t\t Score: {:.4f}".format(sentences[i], sentences[j], score))
    print()
    k = k + 1

0
A man is playing guitar          Do you like pizza?          Score: 0.1080

1
The new movie is awesome         Do you like pizza?          Score: 0.0829

2
A man is playing guitar          The new movie is awesome        Score: 0.0652

3
The cat sits outside         Do you like pizza?          Score: 0.0523

4
The cat sits outside         The new movie is awesome        Score: -0.0270

工作方式有区别paraphrase_mining()吗？