introduction.html

Approximate String Matching -- An Introduction

Finding close matches to words in text, searching a database for entries, and homology (similarity) searches on DNA sequences are but a few of the applications of string matching, which extends to many instances of Pattern Recognition problems. The problem of approximate string matching can be formalized as the problem of finding in a given text of length n all text positions which match a pattern p of length m with up to k errors, where allowed errors are character deletions, insertions and substitution. Less formally, this can be viewed as the problem of transforming a part of the input text to obtain p using k insertion, deletion and substitution operations. The error level is defined to be a = k/m, for which typical values range from 0.1 to 0.3.
The fastest classical algorithm for low error levels is a filtration algorithm , and as such its performance degrades quickly as error levels are increased. The algorithm shown here improves upon previous techniques for finding potential matches, and improves the filter for biased texts such as natural languages . These new features allow the algorithm to continues to perform well even on moderate error levels, making it the fastest algorithm in the field for string matching in all "interesting" cases.