Smith-Waterman-Gotoh 算法 - 如何确定总体相似度百分比

数据挖掘 算法
2021-09-21 17:41:02

使用Smith-Waterman-Gotoh 算法,我想获得两个序列之间的总体相似性百分比。最好的方法是什么?

例如。比较字符串COELACANTHPELICAN 在此示例中,对齐得分为 4:

ELACAN
ELICAN

然后,我将如何确定两者之间的总体相似性百分比COELACANTHPELICAN以此为基础?

1个回答

我不知道如何像在Smith-Waterman 的维基百科页面上那样编写数学代数,所以我将使用伪代码。我在SmithWatermanGotoh java 代码中找到了 SimMetrics 的逻辑。

str1 = PELICAN
str2 = COELACANTH
matchValue = 1 #in the comparisons below, when characters are equal, assign this value
mismatchValue = -2 #in the comparisons below, when characters are not equal, assign this value
gapValue = -0.5 #the gap penalty used in smith-waterman 

# get the maxDistance which is the smallest number of characters between str1 and str2 multiplied by 
# the largest of matchValue and gapValue 
maxDistance = min(length(str1), length(str2)) x max(matchValue, gapValue);

# function to compare character at index aIndex of string a with character at index bIndex of string b 
function compareCharacters(a, aIndex, b, bIndex, matchValue, mismatchValue) {
  if a[aIndex] === b[bIndex] 
    return matchValue 
  else 
    return mismatchValue 
}

v0 = an array 
v1 = an array 

lengthOfStr1 = number of characters in str1 
lengthOfStr2 = number of characters in str2 

# do the smith waterman similarity measure (currentMax)
currentMax = v0[0] = max(0, gapValue, compareCharacters(str1, 0, str2, 0, matchValue, mismatchValue))

for (j = 1; j < lengthOfStr2; j++) {
  v0[j] = max(0, v0[j - 1] + gapValue,
            compareCharacters(str1, 0, str2, j, matchValue, mismatchValue))

  currentMax = max(currentMax, v0[j])
}

for (i = 1; i < lengthOfStr1; i++) {
  v1[0] = max(0, v0[0] + gapValue, compareCharacters(str1, i, str2, 0, matchValue, mismatchValue))

  currentMax = max(currentMax, v1[0])

  for (j = 1; j < lengthOfStr2; j++) {
    v1[j] = max(0, v0[j] + gapValue, v1[j - 1] + gapValue,
            v0[j - 1] + compareCharacters(str1, i, str2, j, matchValue, mismatchValue))

    currentMax = max(currentMax, v1[j])
  }

  for (j = 0; j < lengthOfStr2; j++) {
    v0[j] = v1[j]
  }
}

# calculate the overallSimilarity between the strings 
overallSimilarity = currentMax / maxDistance #<- 0.4767 for COELACANTH vs PELICAN