ECE 417/MP5: Audiovisual Speech Recognition

Overview

In this machine problem, you will get a chance to develop a simple bi-modal speech recognizer using Hidden Markov Model(HMM). Bi-modal speech recognition makes use of both the speech features and the tracking of lips to do the speech recognition. The feature extraction is already done. The data provided is mp5_av_data.tar. It is extracted from some video recordings of subjects speaking the digit 2 or 5.

Audio feature: the sequences of ceptral coefficients.
Visual feature: result of tracking of lips(width, height)

You are provided the code that can be used to learn the HMM (mp5_learn_hmm.zip). However, you are required to write the code that can compute the likelihood of a sequence given the model parameters. That is, the code should be able to compute the probability that a particular sequence came from a particular HMM (e.g. using forward, backward algorithms). The data set consists of 10 sequences of digit 2 and 10 sequences of digit 5. For evaluating the speech recognizer, we use leave one out scheme which basically divided the 20 utterances into two set, one has 19 utterances for training and the other one utterance for testing. Repeat this procedure for 20 times, we can get average accuracy for the speech recognizer.

Experiment

There are three parts of the MP.

Use only the audio feature for speech recognition.
Use only the visual feature for speech recognition.
Concatenate the audio and visual feature, use the joint feature for speech recognition.

BONUS

Compute the recognition rate AND/OR the state alignment, using Viterbi algorithm instead of the forward algorithm. This means that, for each test utterance and for each HMM, find the best state sequence via Viterbi algorithm, and compute the likelihood given that state sequence. Then do EITHER: (1) test to see which model has the best likelihood, AND/OR: (2) compute the state alignment.

If you do (1) Recognition: For this extra credit you don't need to do cross-validation if you don't want to: train using all of the data, then test using the training data. Also, you can do recognition using just one modality (audio or video or audiovisual); report results, and compare Viterbi results to forward-algorithm results.

If you do (2) Alignment: find some way to plot the state alignment overlaid on top of a plot of either the cepstrogram or the spectrogram. (a) The feature vector you're given is a cepstrogram, meaning it contains T consecutive cepstral vectors. Plot this using imagesc, then you can either use hold on to hold the image and plot the state index on top of it, or you can create another subplot below and show the state index as a function of t there. (b) If you'd like something that looks more like an audio spectrogram, you can get one by zero-padding each cepstral vector (to a length of 256 or 512, say), then using idct to convert it back to a log magnitude FFT. This will give you a Tx512 spectrogram that you can plot using imagesc. The frequencies will actually be warped to mel-scale, rather than Hertz, so it should look like a spectrogram that has been smoothed and then stretched vertically.

Things to note

Use the left-to-right non skip HMM for speech recognition.
Learn HMM with 5 hidden states.