Optimize

Optimizing the Partition

    The choice available when partitioning the pattern into k + 1 pieces can be used to minimize the expected number of verifications by using the occurrence frequency of letters in the alphabet under consideration(and this is not limited to the English alphabet). Once the probability of an occurrence of each letter in the alphabet is known, the probability of the occurrence of a substring can be computed by simply taking the product of the occurrence probabilities of the letters that make up the substring. Thus if we wish to know the occurrence probability of the substring "yes", and we know the probability of 'y' is, 0.1, 'e' is 0.3, and 's' is 0.3, then the probability of that substring being verified is (0.1)X(0.3)X(0.3) = 0.009. This probability of a sequence can be used in inbuilding the tree, as a partition that minimizes the sum of the probabilities of the pieces is desired, since this is directly related to the number of verifications that will be carried out.

This leads to the folllowing dynammic programming algorithm:

Let R[i,j] = _r=iΠ ^j^-1Pr (pat[r]) for every 0 ≤ i ≤ j ≤ m. It can be observed that R can be computed in O(m²) time since R[i, j+1] = R[i,j] X Pr(pat[j]).

Using R we build two matrices,

P[i,k] = sum of the probabilities of the pieces in the best partition for pat[i...(m-1)] with k errors.
C[i,k] = where the next piece must start in order to obtain P[i,k].

    This requires O(m²) space, and the following algorithm computes the optimal partition in O(m²) time. The final probability of verification is P[0,k], and the pieces start at L₀ = 0, L₁ = C[L₀ , k ], L₂ = C[L₁, k-1], ..., L_k = C[L_k-1, 1].

for( i = 0 ; i < m ; increment i by 1 each iteration)
    P[i,0] = R[i,m];
    C[i,0] = m;
for(r=1; r ≤ k ; increment r by 1 each iteration)
    for(i=0; i < m - r; increment i by 1 each iteration)
        P[i,r] = minimum((R[i,j] + P[j, r - 1])) for all j in the interval [(i+1), ..., (m-r)]
        C[i,r] = j that minimizes the expression above(that is, it minimizes P[i,r])

    The speedup obtained by implementing this modification to the partition is very slight, and sometimes even counterproductive, since it only takes the probability of verifying a letter sequence into consideration. The search time of the extended algorithm degrades as the length of the shortest piece is reduced (as in an uneve n partition), and so a cost model that is closer to the real search cost is considered by optimizing 1/(minimum length) + Pr(verifying) X m².

    Figure 3 (above) shows the result of running 100 experiments on English text using the optimized and normal splitting methods, each using the original verification method and not the hierarchical one. The small degree of improvement is visible in both runs, and is especially noticeable in the medium error range (upper right on both graphs).