In the Artificial Intelligence Technology, mostly in the field of Natural Language Processing (NLP), Computer Linguistics and in other fields of Computer Science, The Edit Distance Method is a way of quantifying how dissimilar two text far from each other in char comparison by counting the minimum number of operations required to transform one string into the other. Edit distances find applications in NLP where automatic spelling corrections can be determined. Determines candidate corrections for a misspelled word by selecting words from a dictionary that have a low distance to the word in question This method also used in In bioinformatics to quantify the similarity of DNA sequences, which can be viewed as strings of the letters A, C, G and T.
As described in Wikipedia, different types of edit distance methods allow different sets of string operations. For instance:
- The Levenshtein distance allows deletion, insertion and substitution.
- The Longest common subsequence (LCS) distance allows only insertion and deletion, not substitution.
- The Hamming distance allows only substitution, hence, it only applies to strings of the same length.
- The Damerau–Levenshtein distance allows insertion, deletion, substitution, and the transposition of two adjacent characters.
- The Jaro distance allows only transposition.
In The minimum edit distance method, for example between ‘intention’ word and ‘execution’ word can be visualized using their alignment and this alignment is a correspondence between substrings of the two sequences. In this method each operation has cost of 1 and in Levenshtein Distance method substitutions cost 2.
Mostly examples in C++ about minimum edit distance are written for char arrays, ASCII strings. This example is good to understand Minimum Edit Distance method.
In this post we modified this Minimum Edit Distance method to Unicode Strings for the C++ Builder.
Basically, we use two unicode strings (source and dest) in this method, and for these two string inputs,
• We define T[i][j] as the edit distance matrix between source[i] and dest[j] chars
• Here we compare all characters of source string and all characters of dest string
• The edit distance between source and dest is
that we retun as an output of this function.
Here is the full Example of Minimum Edit Distance method,
int MinEditDistance(UnicodeString source, UnicodeString dest)
int sol1, sol2, sol3;
int i, j;
for ( i = 0; i <= source.Length(); i++ ) T[i] = i;
for ( j = 0; j <= dest.Length(); j++ ) T[j] = j;
for ( i = 1; i <= source.Length(); i++ )
for ( j = 1; j <= dest.Length(); j++ )
if ( source[1+i-1] == dest[1+j-1] )
sol1 = T[i-1][j-1];
T[i][j] = sol1;
sol1 = T[i-1][j];
sol2 = T[i][j-1];
sol3 = T[i-1][j-1];
sol1 = sol1 + 1;
sol2 = sol2 + 1;
sol3 = sol3 + 1;
if ( sol1 <= sol2 && sol1 <= sol3 ) T[i][j] = sol1;
if ( sol2 <= sol1 && sol2 <= sol3 ) T[i][j] = sol2;
if ( sol3 <= sol1 && sol3 <= sol2 ) T[i][j] = sol3;
We can use this function as given below,
distance = MinEditDistance(L"intention" , L"execution" );
distance = MinEditDistance(L"intention" , L"intentoin" );
This distance will give us how far both words, if they match distance is equal to 0.