Markov Methods in Compound Decision Theory


 Context in pattern recognition

Markov versus dictionary look-up

More on natural languages


Raviv's algorithm

Hidden Markov Models



Context in pattern recognition

    In many pattern recognition systems there are obvious dependencies between patterns to be recognized. Often we use the supplemental contextual information in the identification process.   

   For example, our visual perception is highly context-dependent as well as our comprehension of speech and handwritten / printed text. In many cases, we first identify familiar patterns, and then deduce the rest from the context. We usually are able to "parse" a reasonably unclear handwriting - some characters we recognize, and in ambiguous cases we guess from the context with high degree of certainty. Consider the classical  "cat" vs. "the" example: the same written character is perceived as an "a" or an "h"; our decision depends solely on the context.


Our listening comprehension is also highly contextual; try to write down a list of unfamiliar geographic locations or last names dictated over the phone - the results are likely to be not very impressive. As a rule, we do not necessarily hear all the bits of information correctly, but rather we use contextual information and match what we have heard to what we expected to hear, recognizing the whole phrase first and only then breaking it into syllables. 

     In many highly contextual recognition systems (natural language processing, speech recognition and etc.), humans perform significantly better than machines, often it is hard to analyze exactly how we do it, which makes such tasks extremely hard to automate. 

     So, how do we incorporate contextual information into machine learning systems? There are two main approaches: dictionary look-up methods and Markov methods.

Markov versus dictionary look up methods

    How dictionary look-up methods work is clear: we maintain a table of legal patterns; the patterns to be recognized are compared against those templates, the best match wins. This is what we do in the "cat" vs. "the" case - both are legal words from our dictionary, the first pattern is closer to "cat", the second - to "the", hence the decision.

Main advantage: it is simple to implement.

Disadvantages: only applicable for systems with relatively small vocabulary, time complexity quickly becomes an issue.

How Markov methods work is best seen from the case study example (see Raviv's algorithm for contextual character recognition). Generally no dictionary is maintained, statistics on pattern dependencies is used instead. Our recognition process is modeled as Markov chain, its architecture, of course, is different in each case, but the idea remains the same: statistical information on pattern dependencies is represented via state transition probabilities.

    There also exist various combinations of Markov methods and dictionary look-up methods.

   Hidden Markov Models

     (This section is incomplete, to be written as a separate page)

     In pattern recognition literature you might often come across Hidden Markov Models (HMMs) There is an excellent tutorial on HMMs written by L.Rabiner (not for our course :))) ). It is downloadable from ; definitely read it if you are interested in HMMs, or wish to learn more about Markov methods in pattern recognition.

        Hidden Markov Model is a regular Markov chain with one significant difference: the actual state of the system is hidden from us, instead we are given a sequence of observations, from which we are typically required to recover the underlying state of the system.

     HMMs have many applications in bioinformatics. Gene finders which perform sequence alignments on DNA sequences are based on HMMs. /Also you might want to go to Kaleigh's page for readings on HMMs and gene finders, she has a great summary; apparently if you want to learn more HMMs in bioinformatics, cs562 is the way to go/

     HMMs are also extensively applied in speech recognition, computer vision and handwriting recognition systems.

More on natural languages dependencies

      Natural languages in general are very non-random. Consider that, in English, for example:

   The four vowels A, E, I, O together with the four consonants N, R, S, T form 2/3 of the standard plaintext;

   Four and more consecutive consonants are very seldom (borschts, latchstring, ?);

   Certain combinations of vowels are also hard to come across (queueing is rather unique) ;

   The letter Q in English is almost followed by a U. (The only exceptions I can think of are proper names of foreign origin);

   The letter Z is almost always preceded by an A, I, U or another Z (randomized, puzzle, lazy).

   Given an extract from a meaningful text with missing letters, we often can fill in the gaps from the context, given that the number of missing letters is relatively small./remove letters/

   To measure "degree of language non-randomness", computational linguists use the notion of entropy (first introduced by C.Shannon) and language redundancy.

The entropy HL of a natural L is defined as
HL =


where Hn is the entropy of the nth order approximation to L, defined as

Hn =

all n-grams 
P(n-gram) * log2 (P(n-gram))

Language redundancy RL for L is for natural language L is defined to be .
RL = 1 -  HL

log2 | alphabet |

      For completely random systems the ratio would be 1; so language redundancy represents "the fraction of excessive characters". For English language redundancy is approximately 0.75. It does not mean that we can remove every 3 letter out of 4, it only means that English can be compressed by 75%; and it also gives some indication of language dependencies.

      Suppose we are given a meaningful string and are to predict the next letter. For our first strategy, consider choosing a letter from the alphabet at random. Would our guess be correct with probability 1/26 ? Only if we assume that all letters are equally likely to appear in the language, which is clearly not true as seen from the examples above. We can do better if we know frequencies with which letters occur in the language. Now, would our guess be correct with P(a), probability of observing a in nature? Also not quite so, we could do better if we know the bigram statistics and could compute P(a| b) as a ratio (P(a,b)) / (P(b)) , where b is the previous letter. Is it exact? Again, no, we could do better if we consider trigrams. The more information means more knowledge and better prediction ability.

     Our model is called the kth-order language approximation, where k is the maximal length of  a sequential group of  letters considered. If we view the nth character identity as the state of the system at time n, our language model becomes the (k -1) th order Markov source: next state depends only on the most recently seen (k -1) states; and is independent of the states prior to that. When k = 2 (i.e. under bigram model), we have a regular Markov chain: the next state (= character) depends only on the current state (= character); all prior history is irrelevant. 

   So, how do we obtain  n-gram frequency tables and which approximations work best?  Letter statistics is typically generated from encyclopedia Britannica or the bible, the longer the source - the better. Usually only 2nd and 2rd order approximations are considered, it has been shown that each subsequent approximation is better than the previous up to 32nd order language approximation. ( though I don't know  how that result was achieved, computing statistics on 32-grams seems at best problematic)



 SmartHangman: run the applet in auto mode and  see how much better (also "closer to human performance") it can guess the word, when using statistical contextual information. 

  alt="Your browser understands the <APPLET> tag but isn't running the applet, for some reason." Your browser is completely ignoring the <APPLET> tag!                      




Wait for the applet to load and click on it to start. To begin a new game click on the word in the lower left corner; right click if you want to restart the game with the old word. If the automode option is checked, the game will be started in the automatic mode (= the program plays against itself); otherwise it is started in singular  mode (= you play against the program). In order to chage the mode, the game must be restarted by clicking on the word. When in automatic mode, depending on whether bigram or uniform model is selected, there are two possible strategies: when in uniform,  Hangman opens the next letter at random, when in bigram, adopts Hangman adopts the "2-order approximation to the English  language" (= first order Markov chain) and selects the next letter to be opened according to bigram probabilities.

Raviv's algorithm

Suppose we have a string of meaningful English text to be recognized. The string is read sequentially, so that by the time we get to the nth character we have observed the previous (n -1) characters.

Let li be the true identity of the ith character and let xi be the measurement vector for the ith character.

If in case of an isolated character, we make the decision on character identity that maximizes P(l = k | x ) - the aposteriori probability of observing a particular character given the measurement vector; in the contextual case we are looking to maximize P(ln = k | x1, ... , xn ).

How far should we go back in history? How many previous characters are relevant when deciding on the current character identity? Generally all are, but we might want to simplify the model by considering each time only the last seen k characters (of course, together with the current character). This is called the kth order approximation to the natural language. If m=1 our language model becomes a classical Markov source (the future depends on the past only through the present), if m > 1 we have the mth order Markov source.

To see our recognition process as a discrete Markov chain assume that in each time unit a single character is read and that the state of the system at time n is the nth character identity. Decision on the nth character is made at time n.

ln = k, if P(l= k | x1, ... , xn ) = maxP(l= i | x1, ... , xn )

How do we compute P(l = k | x1, ... , xn )? Typically we know the probability of observing particular model on the previous vector set P(x | l) (as we did for isolated characters) and P(ln | ln - m,..., ln-1)

The reduction is quite simple, all it is based on is the definition of conditional probability and the law of total probabilities.

(1) Given that event B has occurred, conditional probability of A happening is defined as

P(A|B) =



// where P(A,B), sometimes also denoted as P(A ЗB) is probability of both events happening

Now suppose we have 3 events A, B, C and we are interested in the conditional probability of A occurring given that B and C occurred. After a bit of consideration we can write:

P(A| B, C) =

 P(A, B | C)

P(B | C )


Since P(A | B, C) can be seen as P(A | B | C), intuitively the expression must hold as we simply added "given C" to both parts. Formally we have

P(A | B,C) =

 P(A,B,C )



 P(A, B | C) * P(C)

P(B | C ) * P(C)




(2) Suppose we have a disjoint set of events{E1,...,En} that spans the set of all possible outcomes, then

P(A) = P(A|E1) * P(E1) + ... + P(A|En)*P(En)

If event C comes into picture, similarly to conditional probability, we have

P(A | C) = P(A |E1,C) * P(E1|C) + ... + P(A|En,C)*P(En | C)


Back to P(ln = k | x1,..., xn), how do we express it in terms of probabilities we know or at least know how compute?

First note that we can expand conditional probability, applying (1)

P(ln = k | x1,..., xn) =

  P(ln = k, xn | x1,..., xn-1)



If you absolutely can not see how it follows from the definition of conditional probability, expand P(ln = k | xn) as [(P(ln = k, xn))/(P(xn))] , consider x1,...,xn-1 to be our event C and then apply (1)

Next we reduce the denominator. By the law of total probabilities (2), we have

P(xn | x1, ..., xn-1 ) =

all k 

P(xn | ln = k, x1,...,xn-1) * P(ln = k| x1, ... ,xn)


since the measurement vector depends on the character measured alone and does not depend on the previous measurements - P(xn | ln, x1, ..., xn-1) = P(xn | ln), the previous expression becomes

P(xn | x1, ..., xn-1 ) =

all k 

P(xn | ln = k) * P(ln = k| x1, ... ,xn)


Finally, expanding the numerator gives:

P(ln=k, xn / x1,.., xn-1) = P(ln = k | x1, ... , xn -1) * P(xn | ln = k)


Combining expressions for numerator and denominator, we get:

P(ln = k | x1, ... , xn) =

  P ( ln = k | x1, ... , xn -1) * P(xn | ln = k)


P(ln = i | x1,..., xn-1) * P(xn | ln = i)



J.Raviv "Decision making in Markov chains applied to the problem of pattern recognition"

G.T.Toussaint "The use of context in pattern recognition"

various resources on the web (see links below)


HMMs applications (in speech processing and bioinformatics)


Letter statistics

Linguistics fun

Information theory


Submitted by Irina Guilman (

File translated from TEX by TTH, version 3.13.
On 26 Sep 2002, 23:07.











recognize, and in