ECE 417/MP4: Audiovisual Person Identification

Overview

In this machine problem, you will use the results developed in the previous MPs and use them to develop multi-modal person recognition system. The two modalities that you are going to work with are speech and vision. As is obvious, in case of noisy environment vision is better, but in general vision requires a frontal face which is difficult to obtain more over in low bandwidth operations, audio seems to be a better choice. When both modalities are noisy, we argue that it is beneficial to use both modalities.

Methods

In this MP, you are going to use probabilistic models to do the fusion of the two modalities. There are three parts to the problem.

Modify the speech based person ID algorithm of MP3. This time we use Gaussian Mixture Model (GMM) to estimate the distribution for each person and then compute the probabilities for the observed data on the GMMs.
1. Feature extraction: Compute the 12 cepstral coefficients according to the method in MP3. Use the window size of 500. You can represent each audio file as a data matrix whose dimensions are 12 by N where N is the number of windows. Please be noted that, do NOT stack the columns of the data matrix into a long vector this time. Save the matrix for each audio file.
2. GMM estimation: Use the training feature matrices of each person (15 audio files per person) to estimate a GMM with two mixtures (code for learning the GMM is provided). You will assume each mixture has a diagonal covariance matrix.
3. Probability calculation: Compute the probability of the test audio files using each GMM.
Modify the face recognition algorithm of MP2 to output the probability of different classes. Use Principal Component Analysis (PCA) based features and use the K-NN(K-nearest neighbor) algorithm with K=10 for face recognition. However, instead of just picking the winner, compute the probability of each class. (i.e. if 3 of 10 nearest neighbors belong to class 1, then the probability of class 1 is 0.3)
Fusion: Compute the probability of each class by multiplying the probabilities computed in the above two cases. The class with the highest probability is returned as the winner.

Data

The data are divided into two parts: training set and test set. There are 4 people in total. For each person, there are 10 face images and 15 audio files in the training set, and 10 face images and 10 audio files in the testing set.

Analysis

The parts of the MP are

Work with audio alone. Use mixture of Gaussians with 2 Gaussians for each class. Report the percentage recognition rate for each person. Also report the average overall percentage recognition rate.
Just work with images and report the results for the face recognition experiment for each person (percentage recognition rate). Also report the average percentage recognition rate of all the persons. Use K-NN with K=10 as the classification algorithm.
Do the fusion of the audio and vision based recognition (without weights). You should test for each combination of audio-images in the test set of each person (10 face images x 10 audio files = 100 combinations per person). Report the percentage recognition rate for each person. Also report the average percentage recognition rate of all the persons.
You are also asked to try different weights during fusion: P(audio, class i | image) = P(audio | class i) ^ w * P(class i | image) ^ (1-w), for w ? {0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9}. Identify and report the weight which gives best overall average recognition performance. For this (best) weight, report the percentage recognition rate for each person. Also report the average percentage recognition rate of all the persons.

Overview

Methods

Data

Analysis

Download: