Hidden Markov Models for Speech Recognition
by B. H. Juang; L. R. Rabiner in Technometrics, Vol. 33, No. 3. (Aug., 1991)
http://luthuli.cs.uiuc.edu/~daf/courses/Signals%20AI/Papers/HMMs/0.pdf
- Introduction
- 1. MEASUREMENTS AND MODELING OF SPEECH
- 2. THE STATISTICAL METHOD OF THE HIDDEN MARKOV MODEL
- 3. STRENGTHS OF THE METHOD OF HIDDEN MARKOV MODELS AS APPLIED TO SPEECH RECOGNITION
- 4. HIDDEN MARKOV MODEL ISSUES FOR FURTHER CONSIDERATION
- Hidden Markov models (HMMs) have become predominant in speech recognition research in the last decade due to:
- Inherent statistical framework
- Ease and availability of training algorithms
- Flexibility of resulting recognition systems
- Ease of implementation
Performance:
- Speech recognition technology: Can now achieve almost perfect accuracy in speaker-independent isolated-digit recognition, with 2-3% error rate for connected speech by nonspecific talkers.
- In a continuous speech environment with 1,000-word vocabulary and grammatical constraints, advanced HMM systems can achieve 96% word accuracy, rivaling human performance.
- However, the general problem of fluent, speaker-independent speech recognition is still not solved, and no system reliably recognizes unconstrained conversational speech or infers language structure from limited spoken sentences.
Theory of HMMs:
- HMMs are a statistical model used to represent a sequence of hidden states that generate observable data through a transition matrix and an emission probability distribution.
- The Baum-Welch algorithm is the standard method for estimating the parameters (transition probabilities and emission distributions) of an HMM from observed data.
- The maximum a posteriori decoding method is used to decode the most likely hidden state sequence given the observable data.
Hidden Markov Models:
- Hidden states represent phonetic or linguistic features that cannot be directly observed in speech.
- Observed data represents speech features like pitch, amplitude, and spectral characteristics.
- The transition matrix models the probability of moving from one hidden state to another, while the emission probabilities model the likelihood of observing a particular feature given a hidden state.
Challenges and Future Work:
- There are many theoretical and practical issues that need to be addressed to further advance speech recognition research:
- Acoustic modeling: Improving accuracy by modeling noise, variability in speech patterns, and speaker adaptation.
- Language modeling: Incorporating statistical language models to improve word accuracy and handle grammatical constraints.
- Feature extraction: Developing new features that better capture speech characteristics, like prosody and intonation.
- Speaker adaptation: Adapting the model to individual speakers for improved recognition performance.
Speech Signal Modeling
- Speech is a nonstationary signal
- Articulatory apparatus (lips, jaw, tongue, velum) modulates air pressure and flow to produce sounds
- Spectral content of individual sounds may include frequencies up to several thousand Hz
- Short-time spectral properties analyzed at 10 ms intervals
- Long-term development of sound sequences characterized on the order of 100 ms due to articulatory configuration changes
Digital Processing of Speech Signal
- Analog speech signal sampled at 8-20 kHz, with 16-bit quantization
- Short-time spectral properties analyzed using windows and spectral analysis methods (FFT, LPC, autoregressive/moving average models)
- Auditory models incorporated for emphasizing important frequencies to human listeners
Observation Vector or Observation
- Short-time spectral vector obtained by analyzing speech signal with a window
Spectral Analysis Methods
- Discrete Fourier transform (FFT)
- All-pole minimum-phase linear prediction (LPC) methods
- Autoregressive/moving average models
- Filter bank method of spectral analysis
- Spectral properties emphasized using auditory models (e.g., Cohen 1985, Ghitza 1986)
Cepstrum
- Fourier transform of the log magnitude spectrum, particularly LPC model
- Computationally efficient for stable all-pole systems
- Figure 4 shows histograms of the first 12 LPC cepstral coefficients obtained from a speech data base.
HMM Formulation
First-Order N-State Markov Chain:
- Illustrated for N = 3 states in Figure 5
- System can be described as being in one of the N distinct states (1, 2, ..., N) at any discrete time t
- State variable: qr (state at time t)
- State transition probability matrix: A = [aij]
- Axiomatic constraints: aij ≥ 0 and ∑j=1N aij = 1 for all i
Observation Representation:
- Cepstral vector: represents a short time speech spectrum
- Discrete symbol representation of spectral vector obtained through spectral labeling process
- Sequence of observations: {Ot} (T = length of speech sequence)
Markov Modeling:
- Spectral sequence can be modeled as a Markov chain describing how one sound changes to another
- Relation between direct spectral representation and Markov modeling discussed in Juang (1984) and Bridle (1984)
Hidden Markov Model:
- Discussed here is the case of an explicit probabilistic structure imposed on the sound sequence representation
- **Observation probability measures B = {bi(Ot)}
- Joint probability: P(**0, q | τ, A, B) = n₊₁n\n aij\nb,(Ot)\nb,1\nb,2\nb,…nb,NT)
- Stochastic process: P(**0 | τ, A, B) = P(**0, q | τ, A, B)
2. THE STATISTICAL METHOD OF THE HIDDEN MARKOV MODEL
The Statistical Method of the Hidden Markov Model (HMM)
Problems Addressed by the Hidden Markov Model:
- Evaluation Problem: Given observation sequence
0
and modeli
, how do we efficiently evaluate the probability of0
being produced by source modeli
? (P(0|i)) - Estimation Problem: Given observation sequence
0
, how do we solve the inverse problem to estimate the parameters ini
? - Decoding Problem: How do we deduce the most likely state sequence
q
from observation0
?
- Direct calculation of P(0|i) requires exponential computation
- The forward-backward procedure allows for linear computational efficiency
- The forward variable
a(i)
is defined as the probability of partial observation up to time t and state i - The desired result is P(0|i) = sum(aT(i))
- Efficient scoring mechanism for classification of unknown observation sequence
- Maximum Likelihood (ML) method is used to find the "right" model parameters that maximize P(0|i)
- The Baum-Welch algorithm accomplishes this in a two-step procedure:
- Transform objective function into new divergence measure Q(E'|i), and maximize over
i
to improve E. - Repeat the steps until some stopping criterion is met
- Transform objective function into new divergence measure Q(E'|i), and maximize over
- The Baum-Welch algorithm is similar to the classical EM algorithm, but the HMM formulation lacks global optimality in practice (Paul 1985)
The Decoding Problem
Most Likely State Sequence:
- Uncovering the most likely state sequence that led to an observation sequence
- Useful for:
- Determining speech segment correspondences to word sounds
- Duration of individual speech segments provides useful information for speech recognition
Decoding Objectives:
- Maximize the instantaneous a posteriori probability of the state at time t
- Extend to pairs, triples, and so on
- Maximum a posteriori (MAP) rule: Produces MAP result with minimum incorrect decoded state pairs
Global Decoding:
- Maximize Pr(q / 0, 1.)
- Equivalent to maximizing Pr(q, 0 1 i.)
- Suitable for solving by dynamic programming methods like the Viterbi algorithm
Speech Recognition Using HMMs:
- Define a set of L sound classes (V)
- Collect labeled training sets for each class
- Solve the estimation problem to obtain models for each class
- Evaluate P r(0 / I, i.) for unknown utterance 0 and identify speech class u with maximum probability
3. STRENGTHS OF THE METHOD OF HIDDEN MARKOV MODELS AS APPLIED TO SPEECH RECOGNITION
Strengths of Hidden Markov Models (HMM) for Speech Recognition
Mathematical Framework:
- Consistent statistical methodology
- Combines modeling of stationary stochastic processes and temporal relationships in a well-defined probability space
- Decomposable measure into observation, state sequence, and joint probability
- Flexibility to study short-time static characterization and long-term transitions independently
- Relatively easy and straightforward training from labeled data
- Baum-Welch algorithm: iterative hill-climbing to maximum likelihood criterion
- Segmental k-means algorithm: segmentation and optimization steps for state sequence and model parameters
- Separate optimization of components (state transition probability matrix and observation distributions) leads to simpler implementation
Observation Distributions:
- Accommodates various types: strictly log-concave densities, elliptically symmetric densities, mixtures, and discrete distributions.
HMM Modeling Flexibility
- Flexibility of Basic HMM: Manifested in:
- Model Topology
- Observation Distributions
- Decoding Hierarchy
- Model Topology: Many topological structures studied for speech modeling
- Isolated Utterances: Left-to-Right HMM (Figure 7a) is appropriate
- Utterance begins and ends at well-identified time instants
- Sequential behavior of speech is represented by sequential HMM
- Other Speech-Modeling Tasks: Ergodic models (Figure 7b) are more appropriate
- Topological configuration and number of states reflect a priori knowledge of the speech source
- Isolated Utterances: Left-to-Right HMM (Figure 7a) is appropriate
- Observation Distributions: Rich class of distributions can be accommodated
- No analytical problems make use of any distribution impractical
- Speech displays quite irregular probability distributions (Jayant and Noll 1984; Juang et al. 1985)
- Mixture Distributions: Used to characterize observations in each state
- Mixture density function: (a) Left-to-Right HMM, (b) Ergodic HMM
- Useful for modeling spectral observations
- Mixture distribution approximates densities of virtually any shape
- Variants: Vector quantizer HMM, semicontinuous HMM, continuous HMM
- Observation Distributions in HMM: Can also be an HMM probability measure
- Basis for subword unit-based speech recognition algorithms (Lee et al. 1989; Lee et al. 1988)
Ease of Implementation
- Potential Numerical Difficulties: Result from multiplicative terms in the HMM probability measure
- Normalization and interpolation techniques can help prevent numerical problems
- Scaling algorithm (Levinson et al. 1983; Juang and Rabiner 1985) normalizes partial probabilities
- Deleted interpolation scheme proposed by Jelinek and Mercer (1980) deals with sparse data problems
- Normalization and interpolation techniques can help prevent numerical problems
- Computational Complexity: Relatively low due to grammatical constraints limiting the average number of words following a given word to around 100
4. HIDDEN MARKOV MODEL ISSUES FOR FURTHER CONSIDERATION
Issues for Further Consideration in Hidden Markov Models (HMMs)
Background:
- Theory of hidden Markov modeling developed over last two decades
- Application to speech recognition still has some issues
- Original method: maximize probability P(O|π), where π is the assumed source model
- ML method is optimal according to this criterion
- Baum-Welch reestimation algorithm used for solving the ML HMM estimation problem
Classification Error Rate:
- Classical Bayes rule (Duda and Hart, 1973) requires minimum classification error rate
- Decision rule: C*(O) = arg max Pr(C|O), where O is observation sequence and C is a class
- MAP decoder: decision rule written as c*(O) = arg max Pr(O|C) Pr(C)
- Problem lies in estimating unknown prior distribution Pr(C) and conditional distribution P(O|C) from finite training set
Strategies to Overcome Difficulty:
- Choose a reduced set of subword units instead of words for classes
- Obtain reliable estimates by using larger training sets or reducing the vocabulary size
- Decompose large sound classes into smaller ones (subword units)
- Use lexical description to obtain class probability Pr(C) independently of spoken training set
- Estimate P(O|C) from each subword unit in the lexical entry for words
Complete Label Case:
- Training data and class association known a priori (hand labeled)
- Illustration: isolated or connected word tasks, hand-segmented continuous speech recognition
Incomplete Label Case:
- Partial knowledge of data: only phoneme sequence known but not the correspondence between each phoneme and segment of speech
- Illustration: estimating models of subword speech units from continuous speech, words in connected word tasks without prior word segmentation.
Markovian Structure of HMM Model
Speech Signals and Markov Property:
- Speech signals may not be Markovian
- Need to account for potential errors in modeling with a mechanism called "model-mismatch case"
Model-Mismatch Case:
- When the chosen class model (conditional and a priori probability) is incorrect, ML method may lead to suboptimal recognition performance
Corrective Training:
- Alternative approach that minimizes recognition error rate during training by identifying sources of errors or near-misses
- Uses HMM as a discriminant function rather than focusing on modeling accuracy
Complete-Label Case:
- In this situation, the association between training data and class is precisely known (Fig. 8a)
- Concerns about supervised learning apply, but unique to speech recognition due to the use of a prescribed class prior model Pr(Ci)
Maximum Mutual Information (MMI) Estimator:
- When the class prior Pr(Ci) is obtained independent of spoken training set, MMI estimator is used
- Maximizes mutual information between input observations and target classes
Conditional Maximum Likelihood Estimator (CML) vs. CMLE:
- CML leads to a set of equations to optimize the conditional probability Pr(Oi | Ci)
- CMLE (Eq. 25) compensates for inaccurate class prior model Pr(Ci) by maximizing joint log-likelihood between observations and target classes
Concerns with CMLE:
- Lack of a convenient and robust algorithm to obtain estimate Pr_CMLE(O | Ci)
- Previous attempts at using MMI criterion did not guarantee convergence to optimal solution
- CMLE has larger variance than MLE, undermining potential gain in offsetting sensitivity due to inaccurate class prior when decoder is used on test data
Incomplete Label Case
- Result of practical difficulties in labeling continuous speech data
- Ambiguity among sound classes due to class definition similarities and time uncertainty in signals
- Difficulties in training prescribed subword unit models without definitive labeling
Handling Incomplete Labeling:
- Retain known class sequence constraints: Solve for optimal set of class models sequentially
- Optimize sound models simultaneously with decoding: Constrained approach to iteratively refine segmentation and improve unit models
- Theoretical shortcomings: Segmentation results depend on number of segments; inconsistent segmentations possible
Alternative Approach:
- Combine segmentation, decoding, and modeling: Solve for both class models and segmentations simultaneously without label-sequence constraint
- Iteratively improve data segmentation and model estimation via the segmental MDI modeling approach
Model-Mismatch Issues:
- Choose HMM model before enough source information known
- Optimality in decoding meaningful only when actual source matches assumed model
- Improve model based on new data or revise decoding rule if source inconsistent with assumed model
Minimum Discrimination Information (MDI) Approach:
- Allow search into general set of models for better matching under MDI measure
- Find HMM parameter set that minimizes MDI on a sequence of constraints R
Corrective Training:
- Design classifier based on estimates of Pr(C,) and P r (0 | C,) to achieve minimum error rate on training set
- Use HMM P r (0 | 1.) as contrasting reference for improved generalization capability to independent data sets.
Integration of Nonspectral and Spectral Features
HMM Use in Speech Recognition:
- HMMs have been limited to modeling short-time speech spectral information (smoothed representation of the speech spectrum)
- Spectral feature vector is extremely useful and led to successful recognizer designs
- Success can be attributed to spectral analysis techniques and understanding of the importance of the speech spectrum
Nonspectral Speech Features:
- Prosody (segmental and supra-segmental levels)
- Physical manifestations: normalized energy, differential energy, pitch (fundamental frequency)
Integrating Nonspectral and Spectral Features:
- Performance Improvement:
- Incorporating log energy (and differential) into feature vector or likelihood calculation with moderate success
- Performance improvement smaller than anticipated based on importance of prosody in speech
- Temporal Rate of Change:
- Spectral parameters require high sampling rate (100 Hz) to characterize vocal tract variations
- Prosodic features occur at syllabic rate (10 Hz), which is much slower than spectral parameters
- Combining Feature Sets:
- Need to combine two feature sets with fundamentally different time scales and sequential characteristics
- Challenges: performing optimum modeling and decoding
- Statistical Characterization:
- Need to know joint distribution of the two feature sets
- For correlated (not independent) features, correction for correlation is required
- Proposed method: principal component analysis on joint feature set before hidden Markov modeling
- Alleviates difficulties but not a total solution due to lack of physical significance and open-set problem
Duration Modeling in HMMs
Inherent Limitations of HMM Approach:
- Temporal duration is modeled as an exponential distribution within a state (Pr(d) = e^(-a*d))
- This model is inappropriate for most speech events
Alternatives to Exponential Duration Model:
- Semi-Markov Chain:
- State transitions do not occur at regular time intervals
- Durations between states are modeled as independent random variables (Pr(d_i))
- Joint probability: P(Observation_sequence, State_sequence) = C * Pr(Observation_sequence) * Pr(State_sequence)
- Duration Modeling in Semi-Markov HMMs:
- Duration model (Pr(d)) is treated as a discrete distribution over possible dwell times (1 <= di <= Dm)
- Computational complexity increases due to irregular transition timing
- Drawbacks of Duration Model for Speech Recognition:
- Increased computational complexity, especially in decoding lattices
- Complications with search algorithms like beam search and stack algorithm
Model Clustering and Splitting
Assumptions in Statistical Modeling:
- Variability in observations from an information source can be modeled by statistical distributions
- Source could be a single word, subword unit like phoneme, or word sequence
Motivations for Multiple HMM Approach:
- Lumping all variability leads to unnecessarily complex models with lower modeling accuracy
- Some variability may be known a priori and warrant separate modeling of source data sets
Clustering Algorithms:
- k-means clustering algorithm
- Generalized Lloyd algorithm in vector quantizer designs
- Greedy growing algorithm in set-partition or decision-tree designs
- Suitable for separating inconsistent training data into homogeneous subgroups
Clustering Algorithms Application:
- Successful application to speech recognition using ML criterion
- Interaction with other estimation criteria and Bayes minimum-error classifier design remains open question
Alternative: Model Merging
- Subdivide source into large number of subclasses, merge based on source likelihood considerations
- Examples: Build context sensitive units (left-context and right-context) for speech recognition
- Determine which pairs of models to merge based on small change in entropy (AH)
- Practical questions about acceptable AH remain
Parameter Significance and Statistical Independence
A. Discrimination Capability of a's vs. b's:
- Importance of state transition coefficients (a's) is the same as observation density (b's) in theory
- In practice, this is not usually the case due to:
- Limited dynamic range of a's, especially for left-to-right models
- Nearly unlimited dynamic range of b's, particularly with continuous density functions
B. Combining a's and b's:
- When combined, probabilities give the probability distribution of HMM
- In practice, atj's can be neglected for left-to-right model with no effect on recognition performance
C. Unbalanced Numerical Significance:
- a's affect bj(.) estimate, and vice versa in training
- Paradox is due to discrimination capability of a's vs. b's dynamic range
D. Usefulness of Markov Chain Contribution:
- Transition probability plays an important role in parameter estimation
- Introduction of semi-Markov models can enhance Markov chain contribution significance
E. Statistical Independence Assumption:
- Pr(do) implies statistical independence given known state sequence
- Argument against assuming independence within a stationary sound state
- Problems: difficult to choose appropriate form and estimate parameters from limited data
F. Higher Order HMMs
G. Summary of First-Order Model Limitations:
- Inadequate for grammatical structure modeling in higher level processing
- Need for further understanding and practical implementation advances
H. Performance of HMM Systems:
- Capable of achieving over 95% word accuracy in certain speaker-independent tasks with vocabularies up to 1,000 words
- Further progress expected to make technology usable in everyday life.