Approximate String Matching
-- An Introduction
Finding close matches to words in text,
searching a database for entries, and homology (similarity) searches on DNA
sequences are but a few of the applications of string matching, which extends
to many instances of Pattern Recognition problems. The problem of
approximate string matching can be formalized as the problem of finding in
a given text of length n all text positions which match a pattern
p of length m with up to k errors, where allowed errors
are character deletions, insertions and substitution. Less formally, this
can be viewed as the problem of transforming a part of the input text to obtain
p using k insertion, deletion and substitution operations.
The error level
is defined to be a = k/m, for
which typical values range from 0.1 to 0.3.
The fastest classical algorithm for low error levels is a
, and as such its performance degrades quickly as error levels are increased.
The algorithm shown here improves upon previous techniques for finding potential
matches, and improves the filter for biased texts such as natural languages
. These new features allow the algorithm to continues to perform well even
on moderate error levels, making it the fastest algorithm in the field for
string matching in all "interesting" cases.