Hidden Markov Models
A Hidden Markov Model (HMM) is a statistical Markov model in which the system being modeled is assumed to be a Markov process. That is, it is a "memoryless" system whose trajectory is solely determined by its current state. The HMM is considered "hidden" because we do not (or cannot) know about the states of the variable being observed (say,
Here is an example of a 3-state Markov model:
As we move from state to state (node to node or circle to circle), there is a weight associated with each edge, indicating the probability that we move from one node to another.
A Markov chain is useful when we need to compute a probability for a sequence of observable events. In many cases, however, the events we are interested in are hidden: we don’t observe them directly. For example, we don’t normally observe part-of-speech tags in a text. Rather, we see words and must infer the tags from the word sequence. We call the tags hidden because they are not observed.
HMMs have applications in all sorts of areas including thermodynamics, economics, speech, pattern recognition, bioinformatics, and more. They provide a foundation for probabilistic models of linear sequence ‘labeling’ problems.
Mathematically, if we consider a sequence of state variables
- Transition probabilities: $$ 𝐴 = (𝑎_{ij}), 𝑎_{ij} = 𝑃(𝑠_𝑖,𝑠_𝑗) $$
-
Initial Probabilities (
$\pi$ ): $$ \pi = \left{P(s_1), P(s_2), \dots, P(s_i) \right} $$ A hidden Markov model requires one more mathematical definition. We need to know the probability of observing an event given a state:
$$ 𝐵=(𝑏_𝑖(𝑣_𝑚)),𝑏_𝑖(𝑣_𝑚)=𝑃(𝑣_𝑚|𝑠_𝑖) $$ These are known as emission probabilities. The probability that given a state, we "emit" to a certain observation.
Given the above, we can alter the graph model above to represent a hidden Markov model:
The following example problem is pulled from wikipedia:
Consider two friends, Alice and Bob, who live far apart from each other and who talk together daily over the telephone about what they did that day. Bob is only interested in three activities: walking in the park, shopping, and cleaning his apartment. The choice of what to do is determined exclusively by the weather on a given day. Alice has no definite information about the weather, but she knows general trends. Based on what Bob tells her he did each day, Alice tries to guess what the weather must have been like.
Alice believes that the weather operates as a discrete Markov chain. There are two states, "Rainy" and "Sunny", but she cannot observe them directly, that is, they are hidden from her. On each day, there is a certain chance that Bob will perform one of the following activities, depending on the weather: "walk", "shop", or "clean". Since Bob tells Alice about his activities, those are the observations. The entire system is that of a hidden Markov model (HMM).
Alice knows the general weather trends in the area, and what Bob likes to do on average. In other words, the parameters of the HMM are known. They can be represented as follows in Python:
states = ('Rainy', 'Sunny')
observations = ('walk', 'shop', 'clean')
start_probability = {'Rainy': 0.6, 'Sunny': 0.4}
transition_probability = {
'Rainy' : {'Rainy': 0.7, 'Sunny': 0.3},
'Sunny' : {'Rainy': 0.4, 'Sunny': 0.6},
}
emission_probability = {
'Rainy' : {'walk': 0.1, 'shop': 0.4, 'clean': 0.5},
'Sunny' : {'walk': 0.6, 'shop': 0.3, 'clean': 0.1},
}
In this piece of code, start_probability
represents Alice's belief about which state the HMM is in when Bob first calls her (all she knows is that it tends to be rainy on average). The particular probability distribution used here is not the equilibrium one, which is (given the transition probabilities) approximately {'Rainy': 0.57, 'Sunny': 0.43}
. The transition_probability
represents the change of the weather in the underlying Markov chain. In this example, there is only a 30% chance that tomorrow will be sunny if today is rainy. The emission_probability
represents how likely Bob is to perform a certain activity on each day. If it is rainy, there is a 50% chance that he is cleaning his apartment; if it is sunny, there is a 60% chance that he is outside for a walk.
There are many computational problems with HMMs. Below are just a few. In general, they involve the use of dynamic programming and gradient descent while solving for the maximum likelihood of a certain sequence of states given observations. Oftentimes, the probabilities in these algorithms are represented in log space to make it easier to work with the math while preventing underflow errors at the CPU level (numbers way too small for a computer to handle).
Given the HMM M=(A,B,$\pi$), and an observation sequence
Similar to the above decoding problem, given the HMM, M=(A,B,$\pi$), and an observation sequence
$$ P(O|S) = \prod_{i=1}^TP(O_i|S_i) $$ However, the state sequence is unknown.
Given an observation sequence,
The most well-known algorithm for this is the Baum-Welch algorithm, which utilizes a stochastic gradient descent algorithm and is not guaranteed to be provide an optimal solution. It can also very computationally complex.
HMMs offer great prediction and modeling potential in the form of a highly-interpretable and statistically sound model/algorithm. They can be applied to many real-world problems and are often computationally efficient (when making inferences). They still, however, have both pros and cons:
- HMM models are highly studied, statistically sound, and highly interpretable models.
- Easy to implement and analyze.
- Incorporates prior knowledge into the model architecture.
- Can be initialized close to something believed to be correct
- Widely applicable
- Bounded by the Markov assumption: The next state is only determined by the current state and not previous ones
- For EM learning problems, the number of parameters to be evaluated is huge. So it needs a large data set for training.
- Training an HMM can often be computationally challenging.