给定字符串中每个单词之间的距离

人工智能 自然语言处理 相似
2021-10-30 00:31:32

Python 中计算两个字符串之间的 Levenshtein 距离,可以计算两个给定字符串(句子)之间的距离和相似度。

并从Python 中的 Levenshtein 距离和文本相似度返回每个字符的矩阵和两个字符串的距离。

有什么方法可以计算字符串中每个单词之间的距离和相似度,并打印字符串(句子)中每个单词的矩阵?

a = "This is a dog."
b = "This is a cat."

from difflib import ndiff

def levenshtein(seq1, seq2):
    size_x = len(seq1) + 1
    size_y = len(seq2) + 1
    matrix = np.zeros ((size_x, size_y))
    for x in range(size_x):
        matrix [x, 0] = x
    for y in range(size_y):
        matrix [0, y] = y

    for x in range(1, size_x):
        for y in range(1, size_y):
            if seq1[x-1] == seq2[y-1]:
                matrix [x,y] = min(
                    matrix[x-1, y] + 1,
                    matrix[x-1, y-1],
                    matrix[x, y-1] + 1
                )
            else:
                matrix [x,y] = min(
                    matrix[x-1,y] + 1,
                    matrix[x-1,y-1] + 1,
                    matrix[x,y-1] + 1
                )
    print (matrix)
    return (matrix[size_x - 1, size_y - 1])

levenshtein(a, b)

输出

>> 3

矩阵

[[ 0.  1.  2.  3.  4.  5.  6.  7.  8.  9. 10. 11. 12. 13. 14.]
 [ 1.  0.  1.  2.  3.  4.  5.  6.  7.  8.  9. 10. 11. 12. 13.]
 [ 2.  1.  0.  1.  2.  3.  4.  5.  6.  7.  8.  9. 10. 11. 12.]
 [ 3.  2.  1.  0.  1.  2.  3.  4.  5.  6.  7.  8.  9. 10. 11.]
 [ 4.  3.  2.  1.  0.  1.  2.  3.  4.  5.  6.  7.  8.  9. 10.]
 [ 5.  4.  3.  2.  1.  0.  1.  2.  3.  4.  5.  6.  7.  8.  9.]
 [ 6.  5.  4.  3.  2.  1.  0.  1.  2.  3.  4.  5.  6.  7.  8.]
 [ 7.  6.  5.  4.  3.  2.  1.  0.  1.  2.  3.  4.  5.  6.  7.]
 [ 8.  7.  6.  5.  4.  3.  2.  1.  0.  1.  2.  3.  4.  5.  6.]
 [ 9.  8.  7.  6.  5.  4.  3.  2.  1.  0.  1.  2.  3.  4.  5.]
 [10.  9.  8.  7.  6.  5.  4.  3.  2.  1.  0.  1.  2.  3.  4.]
 [11. 10.  9.  8.  7.  6.  5.  4.  3.  2.  1.  1.  2.  3.  4.]
 [12. 11. 10.  9.  8.  7.  6.  5.  4.  3.  2.  2.  2.  3.  4.]
 [13. 12. 11. 10.  9.  8.  7.  6.  5.  4.  3.  3.  3.  3.  4.]
 [14. 13. 12. 11. 10.  9.  8.  7.  6.  5.  4.  4.  4.  4.  3.]]

字符级别的一般 Levenshtein 距离如下图所示。 在此处输入图像描述

是否可以计算单词级别的 Levenshtein 距离?

所需矩阵

          This is a cat

This
is
a
dog
2个回答

也许试试这个:

from functools import lru_cache
from itertools import product

@lru_cache(maxsize=4095)
def ld(s, t):
    """
    Levenshtein distance memoized implementation from Rosetta code:
    https://rosettacode.org/wiki/Levenshtein_distance#Python
    """
    if not s: return len(t)
    if not t: return len(s)
    if s[0] == t[0]: return ld(s[1:], t[1:])
    l1 = ld(s, t[1:])      # Deletion.
    l2 = ld(s[1:], t)      # Insertion.
    l3 = ld(s[1:], t[1:])  # Substitution.
    return 1 + min(l1, l2, l3)


a = "this is a sentence".split()
b = "yet another cat thing".split()

# To get the triplets.
for i, j in product(a, b):
    print((i, j, ld(i, j)))

获取矩阵:

from scipy.sparse import coo_matrix
import numpy as np

a = "this is a sentence".split()
b = "yet another cat thing , yes".split()

tripets = np.array([(i, j, ld(w1, w2)) for (i, w1) , (j, w2) in product(enumerate(a), enumerate(b))])
row, col, data = [np.squeeze(splt) for splt in np.hsplit(tripets, tripets.shape[-1])]
coo_matrix((data, (row, col))).toarray()

[出去]:

array([[4, 5, 4, 2, 4, 3],
       [3, 7, 3, 4, 2, 2],
       [3, 6, 2, 5, 1, 3],
       [6, 7, 7, 7, 8, 7]])

好吧...只需.split()在前两行的末尾加上 a :

a = "This is a dog.".split()
b = "This is a cat.".split()

您的算法适用于可迭代对象,并且字符串被分解为字符。您进行拆分,并且a,b将是单词列表,然后您的算法在单词级别上起作用

您的示例的输出:

[[0. 1. 2. 3. 4.]
 [1. 0. 1. 2. 3.]
 [2. 1. 0. 1. 2.]
 [3. 2. 1. 0. 1.]
 [4. 3. 2. 1. 1.]]

1.0