Given a string S = GATCAC
and another string T = GTCACC
, what is the minimum number of additions/deletions/substitutions required to convert S
to T
?
Build up a bank of solutions based on substrings that eventually increase to both strings. (Dynamic Programming— and a video here.)
G | A | T | C | A | C | ||
---|---|---|---|---|---|---|---|
0 | 1 | 2 | 3 | 4 | 5 | 6 | |
G | 1 | 0 | 1 | 2 | 3 | 4 | 5 |
T | 2 | 1 | 1 | 1 | 2 | 3 | 4 |
C | 3 | 2 | 2 | 2 | 1 | 2 | 3 |
A | 4 | 3 | 2 | 3 | 2 | 1 | 2 |
C | 5 | 4 | 3 | 3 | 3 | 2 | 1 |
C | 6 | 5 | 4 | 4 | 3 | 3 | 2 |
It's very mechanical:
- Build a table
E
of size |S+1| × |T+1| - Fill the first row with the numbers 1 to |S|
- For every row
i
(skipping the first row):- For every column
j
:- If
T[i] = S[j]
— that is, if the characters are the same:- Set
E[i][j]
=E[i-1][j-1]
- Set
- Otherwise, set it to the minimum of either:
E[i-1][j-1]
(one cell up and left)E[i][j-1]
(one cell left)E[i-1][j]
(one cell up)
- If
- For every column
- Return the bottom-right-most value as the edit distance
The key technique is:
A | B |
---|---|
C | = min(A,B,C) |
Otherwise, just A if the character is the same.
def edit_distance(S,T):
E = [ [None for char in range(len(S)+1)] ] * (len(T) + 1)
for i in range(len(S)):
E[0][i] = i
for i in range(1,len(S)):
for j in range(len(T)):
if S[i] = T[j]:
E[i][j] = E[i-1][j-1]
else:
a = E[i-1][j-1]
b = E[i][j-1]
c = E[i-1][j]
E[i][j] = min(a,b,c)
return E[len(S)][len(T)]