New

A New Verification Technique

In the original algorithm the whole pattern is checked for a match each time a match is found, which can be improved upon by instead trying to eliminate a potential match from candidacy by quickly determining that the match of a small piece is not in fact part of a complete match. In order for this improvement to be achieved, a stronger lemma must be used:

Stronger Lemma: If segm = Text[a...b] matches pat with k errors, and pat = p₁...p_j ( a concatenation of subpatterns), then segm includes a substring that matches at least one of the pi's, with floor(a_i k/A) errors, where the ai ≥ 0 are arbitrary and A = Σ^j_i=1 a_i .

Proof:

    The proof proceeds by construction. The first case is where k + 1 is a power of 2, where the lemma is used with A = 2, j = 2, a₁ = a₂ = 1. So the pattern is split in two (along with the number of errors), and the stronger lemma states that at least one half must match with floor(k/2) errors. The splitting is recursively continued until the resulting pieces are to be directly searched with no errors. This defines a tree, where the leaves represent the pattern pieces that are to be directly searched, , the internal nodes are larger subpatterns used for intermediate verification, and the root corresponds to the complete pattern. The search remains the same in this case, but the verification is more complex, with the (tree)parents of complete matches being checked rather than the root of the tree. After a complete match (0 errors) is found at a leaf node, the search is continued at the leaf's parents k = 1 , then doubles every time an approximate substring match is found within the allowed error interval, as the parent of a given node is approximately twice the size of the child that reported to the parent. This continues upwards through the tree until either the root is reached or the pattern is found to contain more than the allowed number of errors at a given level of the tree, and the solutions output by the algorithm are all searches that reach the root node of the tree and do not have more than the allowed number of errors at that level.

    The above construction is valid because the lemma holds at every level of the tree, and so any approximate match reported by the root must include an approximate substring match reported by one of its two children, and so each child is searched, This applies recursively to each of the subtrees rooted at the root's child nodes, and all of the subtrees below them. Figure 1 above shows an example of the formation of the tree using this algorithm and the pattern "aaabbbcccddd". If a text being searched is "xxxbbbxxxxxx", and this pattern is presented with the tree above, the leaf with labe "bbb" will be found as a complete match, and the search can continue on to its parent ("aaabbb") to determine whether an approximate substring match to the subpattern represented by that node is not found. This is in contrast to the original algorithm, which would find the leaf node "bbb" and attempt to prove a complete match (rather than disproving one at the soonest possible instant).

    In the case where k + 1 is not a power of two, the tree is constructed as balanced as possible to avoid the long linear searches inherent with large leaf nodes (created by the presence of small ones). If the pattern is broken up into a left child with i pieces and a right child with j pieces, where i + j = k, the left subtree is searched with floor( ik /(k+1)) errors, and the right subtree is searched with floor( jk / (k +1) ) errors, yielding an error tolerance of 0 when the leaf nodes are reached. The lemma again applies to each level of the node, and a search can only have continued to a node if a child of that node contained less than the maximum number of allowed errors for its level.

This new technique of verification yields a faster performance for the algorithm on medium error levels with its elimination of candidates, and remains as fast as the original on low error levels. Using this hierarchichal verification algorithm, the verification cost per piece is L² / s^floor(^m^/(^k^{+1) )} . Since L = m / (k + 1) ≈ 1 / a , and since there are k + 1 ≈ ma pieces to search, the total verification cost is m / (a s^{1 /}^s). In order to leave the total cost of the algorithm unaffected by the new verification method, we must have a verification cost of O( a ), which occurs for a<1/log_sm.

The improvement thus obtained can be seen in Figure 2 below, where on random text(left), the algorithm works well up to a < 1/2, whereas the previous method only works well up to a < 1/3. The hierarchical qualities of the new method carries with it an increased number of calculations when the error level is high, however, and its performance is much poorer at high error levels than the old algorithm. This is ecpected, however, since this verification method relies on some of the characters being found in the text(of which there is a high probability), and then verifying that these characters are not part of a match.

Relative performance of the old and new algorithms

Figure 2. The old (dashed line) versus the new (solid line) verification techniques. We use m = 30 and show random (left) and English text (right).