数据挖掘 - Smith-Waterman-Gotoh 算法 - 如何确定总体相似度百分比 - 吾爱随笔录

Smith-Waterman-Gotoh 算法 - 如何确定总体相似度百分比

数据挖掘算法

2021-09-21 17:41:02

使用Smith-Waterman-Gotoh 算法，我想获得两个序列之间的总体相似性百分比。最好的方法是什么？

例如。比较字符串COELACANTH，PELICAN 在此示例中，对齐得分为 4：

ELACAN
ELICAN

然后，我将如何确定两者之间的总体相似性百分比COELACANTH并PELICAN以此为基础？

1个回答

我不知道如何像在Smith-Waterman 的维基百科页面上那样编写数学代数，所以我将使用伪代码。我在SmithWatermanGotoh java 代码中找到了 SimMetrics 的逻辑。

str1 = PELICAN
str2 = COELACANTH
matchValue = 1 #in the comparisons below, when characters are equal, assign this value
mismatchValue = -2 #in the comparisons below, when characters are not equal, assign this value
gapValue = -0.5 #the gap penalty used in smith-waterman 

# get the maxDistance which is the smallest number of characters between str1 and str2 multiplied by 
# the largest of matchValue and gapValue 
maxDistance = min(length(str1), length(str2)) x max(matchValue, gapValue);

# function to compare character at index aIndex of string a with character at index bIndex of string b 
function compareCharacters(a, aIndex, b, bIndex, matchValue, mismatchValue) {
  if a[aIndex] === b[bIndex] 
    return matchValue 
  else 
    return mismatchValue 
}

v0 = an array 
v1 = an array 

lengthOfStr1 = number of characters in str1 
lengthOfStr2 = number of characters in str2 

# do the smith waterman similarity measure (currentMax)
currentMax = v0[0] = max(0, gapValue, compareCharacters(str1, 0, str2, 0, matchValue, mismatchValue))

for (j = 1; j < lengthOfStr2; j++) {
  v0[j] = max(0, v0[j - 1] + gapValue,
            compareCharacters(str1, 0, str2, j, matchValue, mismatchValue))

  currentMax = max(currentMax, v0[j])
}

for (i = 1; i < lengthOfStr1; i++) {
  v1[0] = max(0, v0[0] + gapValue, compareCharacters(str1, i, str2, 0, matchValue, mismatchValue))

  currentMax = max(currentMax, v1[0])

  for (j = 1; j < lengthOfStr2; j++) {
    v1[j] = max(0, v0[j] + gapValue, v1[j - 1] + gapValue,
            v0[j - 1] + compareCharacters(str1, i, str2, j, matchValue, mismatchValue))

    currentMax = max(currentMax, v1[j])
  }

  for (j = 0; j < lengthOfStr2; j++) {
    v0[j] = v1[j]
  }
}

# calculate the overallSimilarity between the strings 
overallSimilarity = currentMax / maxDistance #<- 0.4767 for COELACANTH vs PELICAN

其它你可能感兴趣的问题

上一篇深度学习中的建议和缺失数据下一篇创建特定领域的问答系统