MP1: Speech and Speaker Recognition


Due date: Febuary 4 before class starts.

Overview

In this machine problem, you will get a chance to develop a simple speech recognition system which you are then also going to use for person recognition (speaker identification). During the process you will learn the concepts of audio feature extraction and nearest neighbor classification. The audio features you are going to use are the "raw features" and "Cepstrum coefficients." You are provided the data (waveforms) of four people, speaking 5 different digits (1, 2, 3, 4 and 5), each five times. Download the data from here (speechdata.zip or speechdata.tar.gz). You are suggested to use Matlab for all your experiments. I am going to refer to the various Matlab commands that you are supposed to use during the process. You can also take the Matlab file MP1_Walkthrough.m as a starting point.

System Development

  1. Feature Extraction. You are going to work with the raw features and the Cepstrum coefficients.
    1. The first step is to read the wave files in Matlab (wavread). Choose the left channel of the signals (note: the wave files are stereo files). Since the files may be of slightly different sizes, you must resize (imresize) the data using linear interpolation so that each signal is of the same size (say 10000 samples long). Thus each audio file is represented as a long vector. Save this vector for each audio file. All such vectors comprise a data set, and a vector in the data set can be referred to as a datum.
    2. The second step is to compute the Cepstrum coefficients (rceps). During Cepstrum analysis, one of the main parameter is the window size with which you are going to work. In this MP, you are also going to explore the effect of the window size on the performance of the system. Use the resized data from the previous step and try three window sizes W={100, 500, 10000}. For each window size, choose the overlap size of 10% and compute the features (Cepstrum coefficients). For each window, choose the first 12 Cepstrum coefficients. Now, you can represent each audio file as a data matrix whose dimensions are 12 by N where N is the number of windows. For computing the distance of two audio signals, you first need to stack the columns of the data matrix into a long vector of length 12*N. Save this vector for each audio file. All such vectors comprise a data set and a vector in the data set can be referred to as a datum.
  2. Pattern Matching. We are going to use the nearest neighbor (NN) algorithm for speech and speaker recognition. The way the nearest neighbor (NN) algorithm works is: Let's say, we have some labeled data (training set) and we want to do the classification of a new datum (test datum). We compute the distance of the new datum from each datum in the training set and choose the label of the nearest datum (to the test datum) to be the label of the test datum. In all experiments we are going to use the Euclidean distance as our distance metric.

    The problem of the above NN algorithm is that it is prone to error if the data are noisy. An alternative approach is to use the K-Nearest Neighbor (KNN) algorithm. The idea is the same as that of the NN algorithm except that instead of finding the nearest datum you now look at the k nearest data and choose the label by majority vote. In case of a tie, you may decide to increase k so as to resolve the tie.

  3. Data Partition. To test the algorithm, you need to use the data in a way so that they allow you to evaluate the performance of your algorithm. We have the data of four people. Let's call the people A, B, C, and D. Let's refer to each audio file (or the corresponding feature vector) as Zij. Where
    1. Z={A, B, C, D} refer to the person.
    2. i={1,2,3,4,5} refer to the digit spoken.
    3. j={a, b, c, d, e} refers to the particular utterance of the digit.

Experiments

The various experiments that you are required to do are as follows. Note that you are going to do these experiments with all the four kinds of features, namely the raw features and the Cepstrum coefficients with window size W = 100, 500, 10000, respectively.

  1. Speech Recognition Experiments. Take one datum out from your data set, say A1a. Remove from the data set all the data corresponding to the same person (Aij for all i, j). Compute its distance from all the remaining data. Let Zij, be the nearest point. If i=1 then you have recognized the word correctly otherwise you made an error. Repeat this for all the data in your data set. Report the performance (the percentage of the data that are correctly recognized) – over all, for each digit, and for each person. Also for each case, report the results for both the nearest neighbor (NN) algorithm and K-nearest neighbor (KNN) algorithm with k=5.
  2. Speaker Recognition Experiments. Take one datum out from your data set, say A1a. Remove from the data set all the data corresponding to the same digit (Z1j for all Z, j). Compute its distance from all the remaining data. Let Zij, be the nearest point. If Z=A then you have recognized the person correctly otherwise you made an error. Repeat this for all the points. Report the performance – overall, for each digit, and for each person. Also for each case, report the results for both the nearest neighbor (NN) algorithm and k-nearest neighbor (KNN) algorithm with k=5.

What to submit

  1. The results of the speech recognition and speaker recognition experiments. Note that you must report the overall recognition accuracy as well as the recognition accuracy for each person, each digit (i.e., in a table form) for the 2 recognition tasks, 4 feature choices, and 2 recognition algorithm choices.
  2. Please answer why we removed from the data set all the data corresponding to the same person in the speech recognition experiments and why we removed from the data set all the data corresponding to the same digit in the speaker recognition experiments?
  3. Please try to make some comments on how the performance varies with respect to the feature and algorithm choices.
  4. Include all your Matlab code in the submission. Also, please give us a README file (in any format you like, e.g. doc/pdf/html/text) including your MP narrative, and telling us how we run your code to obtain the same results as you did. Compress your report, code, and README file as an xxx.zip or xxx.tar.gz file where xxx is your Net ID. For example, if your Net ID is chang87, then your compressed file should be named chang87.zip or chang87.tar.gz. Please upload this file on Compass.
  5. BONUS 1: Identify the best combination of audio feature and algorithm that gives the highest average speech recognition accuracy and the best combination that gives the highest average speaker recognition accuracy, respectively. For this combination, re-do the experiments with the number of Cepstrum coefficients being 2, 4, 6, 8, 10, and 12, respectively. Plot the average speech recognition accuracy and average speaker recognition accuracy with respect to the number of Cepstrum coefficients.
  6. BONUS 2: You are encouraged to implement the Mel-frequency cepstral coefficients (MFCCs) as the speech feature, and compare its performance with other features. For more information on MFCC, please refer to the Wikipedia page.

Note that you must report the overall results as well as the results for each person, each digit, with both the nearest neighbor (NN) and the k-nearest neighbor (KNN) algorithms, and for each of the four feature choices.