Table of Contents
hide
1. Introduction to Minimum Edit Distance:
- Minimum edit distance is a measure of string similarity. It quantifies how similar two strings are by calculating the minimum number of single-character edits required to change one word into the other.
- Applications:
- Spell Correction: Identifying the closest correct word to a misspelled input (e.g., “graffe” “graph,” “grail,” or “giraffe”).
- Computational Biology: Aligning sequences of nucleotides (A, C, G, T) to understand similarities and differences between biological samples.
- Machine Translation: Evaluating the quality of machine-translated text by comparing it to human-generated reference translations.
- Information Extraction: Identifying variations of named entities (e.g., “IBM Inc.” vs. “IBM”).
- Speech Recognition: Assessing the accuracy of transcribed speech by comparing it to the actual spoken words.
2. Definition of Edit Distance:
- The minimum edit distance between two strings is the minimum count of editing operations needed to transform one string into the other.
- Standard Editing Operations:
- Insertion: Adding a character.
- Deletion: Removing a character.
- Substitution: Replacing one character with another.
- More complex operations (transposition, long-distance movements) are generally avoided for simplicity.
- Example: Transforming “intention” to “execution” requires 5 edits (assuming each operation costs 1): delete ‘i’, substitute ‘n’ with ‘e’, substitute ‘t’ with ‘x’, insert ‘c’, substitute ‘n’ with ‘u’.
- Levenshtein Distance: A specific type of edit distance where insertion and deletion have a cost of 1, and substitution has a cost of 2. In the “intention” to “execution” example with Levenshtein distance, the cost would be 8 (1 + 2 + 2 + 1 + 2).
3. Computing Minimum Edit Distance using Dynamic Programming:
- The standard algorithm to compute minimum edit distance is dynamic programming.
- Dynamic Programming Approach: A tabular method that solves the problem by breaking it down into smaller, overlapping subproblems and storing their solutions to avoid redundant computations.
- Distance Matrix D(i, j): Defined as the edit distance between the first i characters of string X (length n) and the first j characters of string Y (length m).
- The edit distance between the entire strings X and Y is given by D(n, m).
- Initialization:
- The distance between the first i characters of X and an empty string is i (cost of i deletions).
- The distance between an empty string and the first j characters of Y is j (cost of j insertions).
- Recurrence Relation (for Levenshtein Distance):
D(i, j) = min { D(i-1, j) + 1 (Deletion) D(i, j-1) + 1 (Insertion) D(i-1, j-1) + cost (Substitution, cost = 0 if X[i] == Y[j] else 2) }
- The distance D(i, j) is the minimum cost to reach the state (i, j) from the three possible preceding states (deletion, insertion, substitution/match).
- The distance matrix is filled iteratively, and the final value at D(n, m) represents the minimum edit distance.
4. Alignment using Backtrace:
- Knowing the edit distance is often insufficient; we need the alignment between the two strings to understand which characters correspond.
- Backtrace: A pointer stored in each cell of the distance matrix that records the cell from which the minimum cost to reach the current cell was derived.
- By starting at the final cell D(n, m) and following the backtrace pointers, we can reconstruct the sequence of operations (insertions, deletions, substitutions, matches) that resulted in the minimum edit distance.
- The path traced back reveals how each character in string X aligns with a character (or a gap representing insertion/deletion) in string Y.
- Backtrace Construction: When computing D(i, j), if the minimum value comes from:
D(i-1, j)
(deletion), point from (i, j) to (i-1, j).D(i, j-1)
(insertion), point from (i, j) to (i, j-1).D(i-1, j-1)
(substitution/match), point from (i, j) to (i-1, j-1).
- Performance:
- Time Complexity: O(n*m) due to filling the n x m distance matrix.
- Space Complexity: O(n*m) to store the distance matrix and backtrace pointers.
- Backtrace Path Length: At most O(n + m) in the worst case (all insertions and deletions).
5. Weighted Edit Distance:
- In many applications, the cost of different editing operations (insertion, deletion, substitution of specific characters) might not be uniform.
- Weighted Edit Distance: Modifies the standard algorithm to incorporate weights for each operation based on the likelihood or cost of that specific edit.
- Applications:
- Spell Correction: Using a confusion matrix of common spelling errors to assign lower costs to likely mistakes (e.g., confusing vowels).
- Computational Biology: Assigning different costs to mismatches or gaps based on biological plausibility.
- Algorithm Modification: The recurrence relation is adjusted to include the specific costs for deletion of character X[i], insertion of character Y[j], and substitution of X[i] with Y[j]. These costs are typically looked up from weight tables or matrices.
- Initialization and Termination: Similar to the standard algorithm, but using the specific weighted costs.
- Dynamic Programming Name Origin: Richard Bellman, the inventor of dynamic programming, coined the term partly as a “public relations move” to make the algorithm sound more exciting.
6. Advanced Variants in Computational Biology:
- Computational biology often employs specialized variants of edit distance for aligning nucleotide or protein sequences. These variants may include different scoring schemes for matches, mismatches, and gaps, as well as affine gap penalties (different costs for opening and extending a gap).
- The core dynamic programming approach remains similar, but the recurrence relations and cost functions are tailored to the specific biological context.
- The goal is to produce an optimal alignment that reflects the evolutionary relationships or functional similarities between the sequences.
References
- CS 124: From Languages to Information, By Dan Jurafsky.
- Basic Text Processing by Dan Jurafsky 2025
CITE THIS AS:
“CS 124: From Languages to Information – Edit Distance” From NotePub.io – Publish & Share Note! From Languages to Information – Edit Distance