Damerau-Levenshtein 在 Python 中编辑距离

数据挖掘 Python 执行
2022-02-12 20:10:39

我通过谷歌在 Damerau Levensthein 编辑距离上找到了一些 python 代码,但是当我查看他们的评论时,很多人说算法不正确。我很困惑。

有人可以在 Damerau Levensthein Distance 上分享正确的 python 代码吗?

哪一个是正确的?

https://www.guyrutenberg.com/2008/12/15/damerau-levenshtein-distance-in-python/

第一个:

"""Compute the Damerau-Levenshtein distance between two given strings (s1 and s2)"""


    def damerau_levenshtein_distance(s1, s2):
    
    
        d = {}
        lenstr1 = len(s1)
        lenstr2 = len(s2)
        for i in xrange(-1,lenstr1+1):
            d[(i,-1)] = i+1
        for j in xrange(-1,lenstr2+1):
            d[(-1,j)] = j+1
    
        for i in xrange(lenstr1):
            for j in xrange(lenstr2):
                if s1[i] == s2[j]:
                    cost = 0
                else:
                    cost = 1
                d[(i,j)] = min(
                               d[(i-1,j)] + 1, # deletion
                               d[(i,j-1)] + 1, # insertion
                               d[(i-1,j-1)] + cost, # substitution
                              )
                if i and j and s1[i]==s2[j-1] and s1[i-1] == s2[j]:
                    d[(i,j)] = min (d[(i,j)], d[i-2,j-2] + cost) # transposition
    
        return d[lenstr1-1,lenstr2-1]

我尝试将“zx”改为“xyz”,算法答案为 3,但正确答案是 2。所以这不起作用。

第二个:

https://gist.github.com/pombredanne/0d83ad58f45986ddeb0917266e106be0

Damerau-Levenshtein 编辑远程实现

基于维基百科的伪代码:https ://en.wikipedia.org/wiki/Damerau-Levenshtein_distance

通过处理转置字符之间的 1 次添加 + 1 次删除 = 1 次替换可能会有所改进:

"abcdef" 和 "abcfad" = 3 的 Damerau-Levenshtein 距离:

  1. 用“d”代替“f”
  2. 用“e”代替“a”
  3. 用“f”代替“d”

或者:

  1. 转置“d”和“f”
  2. 删除“一”
  3. 插入“e”

很明显,第二个分析中的 (2) 和 (3) 实际上只是一种替换:

  1. 转置“d”和“f”
  2. 用“e”代替“a”

使用这个变体,“abcdef”和“abcfad”之间的距离实际上是 2。

def damerau_levenshtein_distance_improved(a, b):

    # "Infinity" -- greater than maximum possible edit distance
    # Used to prevent transpositions for first characters

    INF = len(a) + len(b)

    # Matrix: (M + 2) x (N + 2)
    matrix  = [[INF for n in xrange(len(b) + 2)]]
    matrix += [[INF] + range(len(b) + 1)]
    matrix += [[INF, m] + [0] * len(b) for m in xrange(1, len(a) + 1)]

    # Holds last row each element was encountered: DA in the Wikipedia pseudocode
    last_row = {}

    # Fill in costs
    for row in xrange(1, len(a) + 1):
        # Current character in a
        ch_a = a[row-1]

        # Column of last match on this row: DB in pseudocode
        last_match_col = 0

        for col in xrange(1, len(b) + 1):
            # Current character in b
            ch_b = b[col-1]

            # Last row with matching character
            last_matching_row = last_row.get(ch_b, 0)

            # Cost of substitution
            cost = 0 if ch_a == ch_b else 1

            # Compute substring distance
            matrix[row+1][col+1] = min(
                matrix[row][col] + cost, # Substitution
                matrix[row+1][col] + 1,  # Addition
                matrix[row][col+1] + 1,  # Deletion

                # Transposition
                # Start by reverting to cost before transposition
                matrix[last_matching_row][last_match_col]
                    # Cost of letters between transposed letters
                    # 1 addition + 1 deletion = 1 substitution
                    + max((row - last_matching_row - 1),
                          (col - last_match_col - 1))
                    # Cost of the transposition itself
                    + 1)

            # If there was a match, update last_match_col
            if cost == 0:
                last_match_col = col

        # Update last row for current character
        last_row[ch_a] = row

    # Return last element
    return matrix[-1][-1]

This code is not working.

The psedo code in wikipedia below is not working too. Strings cannot taken as index for k := da[b[j]] and da[a[i]] := i


algorithm DL-distance is
    input: strings a[1..length(a)], b[1..length(b)]
    output: distance, integer
    
    da := new array of |Σ| integers
    for i := 1 to |Σ| inclusive do
        da[i] := 0
    
    let d[−1..length(a), −1..length(b)] be a 2-d array of integers, dimensions length(a)+2, length(b)+2
    // note that d has indices starting at −1, while a, b and da are one-indexed.
    
    maxdist := length(a) + length(b)
    d[−1, −1] := maxdist
    for i := 0 to length(a) inclusive do
        d[i, −1] := maxdist
        d[i, 0] := i
    for j := 0 to length(b) inclusive do
        d[−1, j] := maxdist
        d[0, j] := j
    
    for i := 1 to length(a) inclusive do
        db := 0
        for j := 1 to length(b) inclusive do
            k := da[b[j]]
            ℓ := db
            if a[i] = b[j] then
                cost := 0
                db := j
            else
                cost := 1
            d[i, j] := minimum(d[i−1, j−1] + cost,  //substitution
                               d[i,   j−1] + 1,     //insertion
                               d[i−1, j  ] + 1,     //deletion
                               d[k−1, ℓ−1] + (i−k−1) + 1 + (j-ℓ−1)) //transposition
        da[a[i]] := i
    return d[length(a), length(b)]

Thank you. 
1个回答

我一直在使用以下代码,到目前为止它对我很有帮助:

#Calculates the normalized Levenshtein distance of 2 strings
def levenshtein(s1, s2):
    l1 = len(s1)
    l2 = len(s2)
    matrix = [list(range(l1 + 1))] * (l2 + 1)
    for zz in list(range(l2 + 1)):
      matrix[zz] = list(range(zz,zz + l1 + 1))
    for zz in list(range(0,l2)):
      for sz in list(range(0,l1)):
        if s1[sz] == s2[zz]:
          matrix[zz+1][sz+1] = min(matrix[zz+1][sz] + 1, matrix[zz][sz+1] + 1, matrix[zz][sz])
        else:
          matrix[zz+1][sz+1] = min(matrix[zz+1][sz] + 1, matrix[zz][sz+1] + 1, matrix[zz][sz] + 1)
    distance = float(matrix[l2][l1])
    result = 1.0-distance/max(l1,l2)
    return result

如果您不需要对其进行规范化,则应该很容易删除代码的最后一部分。