New Page 1

Context in pattern recognition

Markov versus dictionary look-up

More on natural languages

Applet

Context in pattern recognition

In many pattern recognition systems there are obvious dependencies between patterns to be recognized. Often we use the supplemental contextual information in the identification process.

For example, our visual perception is highly context-dependent as well as our comprehension of speech and handwritten / printed text. In many cases, we first identify familiar patterns, and then deduce the rest from the context. We usually are able to "parse" a reasonably unclear handwriting - some characters we recognize, and in ambiguous cases we guess from the context with high degree of certainty. Consider the classical "cat" vs. "the" example: the same written character is perceived as an "a" or an "h"; our decision depends solely on the context.

Our listening comprehension is also highly contextual; try to write down a list of unfamiliar geographic locations or last names dictated over the phone - the results are likely to be not very impressive. As a rule, we do not necessarily hear all the bits of information correctly, but rather we use contextual information and match what we have heard to what we expected to hear, recognizing the whole phrase first and only then breaking it into syllables.

In many highly contextual recognition systems (natural language processing, speech recognition and etc.), humans perform significantly better than machines, often it is hard to analyze exactly how we do it, which makes such tasks extremely hard to automate.

So, how do we incorporate contextual information into machine learning systems? There are two main approaches: dictionary look-up methods and Markov methods.

Markov versus dictionary look up methods

How dictionary look-up methods work is clear: we maintain a table of legal patterns; the patterns to be recognized are compared against those templates, the best match wins. This is what we do in the "cat" vs. "the" case - both are legal words from our dictionary, the first pattern is closer to "cat", the second - to "the", hence the decision.

Main advantage: it is simple to implement.

Disadvantages: only applicable for systems with relatively small vocabulary, time complexity quickly becomes an issue.

How Markov methods work is best seen from the case study example (see Raviv's algorithm for contextual character recognition). Generally no dictionary is maintained, statistics on pattern dependencies is used instead. Our recognition process is modeled as Markov chain, its architecture, of course, is different in each case, but the idea remains the same: statistical information on pattern dependencies is represented via state transition probabilities.

There also exist various combinations of Markov methods and dictionary look-up methods.

Hidden Markov Models

(This section is incomplete, to be written as a separate page)

In pattern recognition literature you might often come across Hidden Markov Models (HMMs) There is an excellent tutorial on HMMs written by L.Rabiner (not for our course :))) ). It is downloadable from ; definitely read it if you are interested in HMMs, or wish to learn more about Markov methods in pattern recognition.

Hidden Markov Model is a regular Markov chain with one significant difference: the actual state of the system is hidden from us, instead we are given a sequence of observations, from which we are typically required to recover the underlying state of the system.

HMMs have many applications in bioinformatics. Gene finders which perform sequence alignments on DNA sequences are based on HMMs. /Also you might want to go to Kaleigh's page for readings on HMMs and gene finders, she has a great summary; apparently if you want to learn more HMMs in bioinformatics, cs562 is the way to go/

HMMs are also extensively applied in speech recognition, computer vision and handwriting recognition systems.

More on natural languages dependencies

Natural languages in general are very non-random. Consider that, in English, for example:

The four vowels A, E, I, O together with the four consonants N, R, S, T form 2/3 of the standard plaintext;

Four and more consecutive consonants are very seldom (borschts, latchstring, ?);

Certain combinations of vowels are also hard to come across (queueing is rather unique) ;

The letter Q in English is almost followed by a U. (The only exceptions I can think of are proper names of foreign origin);

The letter Z is almost always preceded by an A, I, U or another Z (randomized, puzzle, lazy).

Given an extract from a meaningful text with missing letters, we often can fill in the gaps from the context, given that the number of missing letters is relatively small./remove letters/

To measure "degree of language non-randomness", computational linguists use the notion of entropy (first introduced by C.Shannon) and language redundancy.

The entropy H_L of a natural L is defined as

H_L =
lim
n
H_n

n

where H_n is the entropy of the n^th order approximation to L, defined as

H_n =

å
all n-grams

P(n-gram) * log₂ (P(n-gram))

Language redundancy R_L for L is for natural language L is defined to be .

R_L = 1 - H_L

log₂ | alphabet |

For completely random systems the ratio would be 1; so language redundancy represents "the fraction of excessive characters". For English language redundancy is approximately 0.75. It does not mean that we can remove every 3 letter out of 4, it only means that English can be compressed by 75%; and it also gives some indication of language dependencies.

Suppose we are given a meaningful string and are to predict the next letter. For our first strategy, consider choosing a letter from the alphabet at random. Would our guess be correct with probability 1/26 ? Only if we assume that all letters are equally likely to appear in the language, which is clearly not true as seen from the examples above. We can do better if we know frequencies with which letters occur in the language. Now, would our guess be correct with P(a), probability of observing a in nature? Also not quite so, we could do better if we know the bigram statistics and could compute P(a| b) as a ratio (P(a,b)) / (P(b)) , where b is the previous letter. Is it exact? Again, no, we could do better if we consider trigrams. The more information means more knowledge and better prediction ability.

Our model is called the k^th-order language approximation, where k is the maximal length of a sequential group of letters considered. If we view the n^th character identity as the state of the system at time n, our language model becomes the (k -1)^th order Markov source: next state depends only on the most recently seen (k -1) states; and is independent of the states prior to that. When k = 2 (i.e. under bigram model), we have a regular Markov chain: the next state (= character) depends only on the current state (= character); all prior history is irrelevant.

So, how do we obtain n-gram frequency tables and which approximations work best? Letter statistics is typically generated from encyclopedia Britannica or the bible, the longer the source - the better. Usually only 2nd and 2rd order approximations are considered, it has been shown that each subsequent approximation is better than the previous up to 32nd order language approximation. ( though I don't know how that result was achieved, computing statistics on 32-grams seems at best problematic)

Applet

SmartHangman: run the applet in auto mode and see how much better (also "closer to human performance") it can guess the word, when using statistical contextual information.

alt="Your browser understands the <APPLET> tag but isn't running the applet, for some reason." Your browser is completely ignoring the <APPLET> tag!

Wait for the applet to load and click on it to start. To begin a new game click on the word in the lower left corner; right click if you want to restart the game with the old word. If the automode option is checked, the game will be started in the automatic mode (= the program plays against itself); otherwise it is started in singular mode (= you play against the program). In order to chage the mode, the game must be restarted by clicking on the word. When in automatic mode, depending on whether bigram or uniform model is selected, there are two possible strategies: when in uniform, Hangman opens the next letter at random, when in bigram, adopts Hangman adopts the "2-order approximation to the English language" (= first order Markov chain) and selects the next letter to be opened according to bigram probabilities.

Raviv's algorithm

Suppose we have a string of meaningful English text to be recognized. The string is read sequentially, so that by the time we get to the n^th character we have observed the previous (n -1) characters.

Let lⁱ be the true identity of the i_th character and let xⁱ be the measurement vector for the i^th character.

If in case of an isolated character, we make the decision on character identity that maximizes P(l = k | x ) - the aposteriori probability of observing a particular character given the measurement vector; in the contextual case we are looking to maximize P(lⁿ = k | x¹, ... , xⁿ ).

How far should we go back in history? How many previous characters are relevant when deciding on the current character identity? Generally all are, but we might want to simplify the model by considering each time only the last seen k characters (of course, together with the current character). This is called the kth order approximation to the natural language. If m=1 our language model becomes a classical Markov source (the future depends on the past only through the present), if m > 1 we have the m^th order Markov source.

To see our recognition process as a discrete Markov chain assume that in each time unit a single character is read and that the state of the system at time n is the n^th character identity. Decision on the n^th character is made at time n.

lⁿ = k, if P(l⁼ k | x¹, ... , xⁿ ) = maxP(l⁼ i | x¹, ... , xⁿ )

How do we compute P(l = k | x¹, ... , xⁿ )? Typically we know the probability of observing particular model on the previous vector set P(x | l) (as we did for isolated characters) and P(lⁿ | l^{n -
m},..., l^n-1)

The reduction is quite simple, all it is based on is the definition of conditional probability and the law of total probabilities.

(1) Given that event B has occurred, conditional probability of A happening is defined as

P(A|B) =

P(A,B)

P(B)

// where P(A,B), sometimes also denoted as P(A ЗB) is probability of both events happening

Now suppose we have 3 events A, B, C and we are interested in the conditional probability of A occurring given that B and C occurred. After a bit of consideration we can write:

P(A| B, C) =

P(A, B | C)

P(B | C )

(1)

Since P(A | B, C) can be seen as P(A | B | C), intuitively the expression must hold as we simply added "given C" to both parts. Formally we have

P(A | B,C) =

P(A,B,C )

P(B,C)

=

P(A, B | C) * P(C)

P(B | C ) * P(C)

=

P(A,B|C)

P(B|C)

(2) Suppose we have a disjoint set of events{E₁,...,E_n} that spans the set of all possible outcomes, then

P(A) = P(A|E₁) * P(E₁) + ... + P(A|E_n)*P(E_n)

If event C comes into picture, similarly to conditional probability, we have

P(A | C) = P(A |E₁,C) * P(E₁|C) + ... + P(A|E_n,C)*P(E_n | C)

(2)

Back to P(lⁿ = k | x¹,..., xⁿ), how do we express it in terms of probabilities we know or at least know how compute?

First note that we can expand conditional probability, applying (1)

P(lⁿ = k | x¹,..., xⁿ) =

P(lⁿ = k, xⁿ | x¹,..., x^n-1)

P(xⁿ|x¹,...,x^n-1)

(3)

If you absolutely can not see how it follows from the definition of conditional probability, expand P(lⁿ = k | xⁿ) as [(P(lⁿ = k, xⁿ))/(P(xⁿ))] , consider x¹,...,x^n-1 to be our event C and then apply (1)

Next we reduce the denominator. By the law of total probabilities (2), we have

P(xⁿ | x¹, ..., x^n-1 ) =

е
all k

P(xⁿ | lⁿ = k, x¹,...,x^n-1) * P(lⁿ = k| x¹, ... ,xⁿ)

(4)

since the measurement vector depends on the character measured alone and does not depend on the previous measurements - P(xⁿ | lⁿ, x¹, ..., x^n-1) = P(xⁿ | lⁿ), the previous expression becomes

P(xⁿ | x¹, ..., x^n-1 ) =

е
all k

P(xⁿ | lⁿ = k) * P(lⁿ = k| x¹, ... ,xⁿ)

(5)

Finally, expanding the numerator gives:

P(lⁿ=k, xⁿ / x¹,.., x^n-1) = P(lⁿ = k | x¹, ... , x^{n -1}) * P(xⁿ | lⁿ = k)

(6)

Combining expressions for numerator and denominator, we get:

P(lⁿ = k | x¹, ... , xⁿ) =

P ( lⁿ = k | x¹, ... , x^{n -1}) * P(xⁿ | lⁿ = k)

е
i

P(lⁿ = i | x¹,..., x^n-1) * P(xⁿ | lⁿ = i)

(7)

References

J.Raviv "Decision making in Markov chains applied to the problem of pattern recognition"

G.T.Toussaint "The use of context in pattern recognition"

various resources on the web (see links below)

Links

HMMs applications (in speech processing and bioinformatics)

http://cslu.cse.ogi.edu/HLTsurvey/ch1node7.html

http://www.cs.mcgill.ca/~kaleigh/work/hmm/hmm_paper.html

http://www.cs.brown.edu/research/ai/dynamics/tutorial/Documents/HiddenMarkovModels.html

Letter statistics

http://raphael.math.uic.edu/~jeremy/crypt/freq.html

http://www.unimainz.de/~pommeren/Kryptologie/Klassisch/6_Transpos/Bigramme.html

Linguistics fun

http://www.askoxford.com/asktheexperts/faq/aboutwords/frequency

http://www.ojohaven.com/fun/trivia.html

Information theory

http://libox.net/infotheory.htm

http://www.math.cudenver.edu/~wcherowi/courses/m5410/m5410lc1.html

Submitted by Irina Guilman (iguilm@cs.mcgill.ca)

File translated from T_EX by T_TH, version 3.13.
On 26 Sep 2002, 23:07.

recognize, and in