Author: Fujimoto Seiji <fujimoto at ceptord.net>
Date: 2015-01-24 (Last Update: 2016-09-02)
The key concept of fastcomp is edit pattens. As an example, consider the statement:
LevenshteinDistance(str1, str2) = 1. This statement can be rephrased as "We can transform str1 to str2 with one insertion, deletion or substitution." This means that we can verify the original statement by checking if str1 can be transformed into str2 by applying any of these edit patterns (insertion, deletion, or substitution).
This is exactly how fastcomp computes edit distance. Internally, fastcomp holds the computational models of possible edit patterns, and try to apply each pattern to the inputs. The diagram below shows how it measures the Levenshtein distance between 'kitten' and 'bitten' when the maximum distance is 1:
So how does it make the computation fast? The crucial point is that we can reject the most of edit patterns without actually applying them. Back to the diagram above. If we focus on the fact that the input words 'kitten' and 'bitten' have the same string length, it becomes obvious that a substitution is the only possible option in this context at all, as applying an insertion (or deletion) will change the length of a string.
The same argument can be applied to the other cases as well. For the sake of brevity, let us denote the difference in length between the input strings by w and the maximum Levenshtein distance by md. Below is the matrix table that shows all the possible edit patterns when md = 1:
|w = 0||one substitution|
|w = 1||one insertion|
|w > 1||NA|
(Note that insertion and deletion are essentially interchangeable in this context. If str1 can be transformed into str2 by one insertion, it implies that str2 can be transformed into str1 by one deletion. Therefore, checking only insertion does suffice here.)
We can easily expand the table above to the case when md = 2:
|w = 0||(two substitutions) or (one insertion + one deletion)|
|w = 1||(one insertion + one replacement)|
|w = 2||two insertions|
|w > 2||NA|
Thus, by observing the length of input strings carefully, we can reduce the task of computing the Levenshtein distance into testing very limited number of edit patterns.