Syllabus

Lectures

Homework

Overview

In this machine problem, you will get a chance to develop a simple speech recognition system which you are then also going to use for person recognition (speaker identification). During the process you will learn the concepts of audio feature extraction and nearest neighbor classification. The audio features you are going to use are the "raw features" and "Cepstrum coefficients." You are provided the data (waveforms) of four people, speaking 5 different digits (1, 2, 3, 4 and 5), each five times. Download the data from here (speechdata.zip or speechdata.tar.gz). You are suggested to use Matlab for all your experiments. I am going to refer to the various Matlab commands that you are supposed to use during the process.

Useful Files

System Development

  1. Feature Extraction. You are going to work with the raw features, the Cepstrum coefficients, and the mel-frequency cepstral coefficients.
    1. The first step is to read the wave files in Matlab (wavread). Choose the left channel of the signals (note: the wave files are stereo files). Since the files may be of slightly different sizes, you must resize (imresize) the data using linear interpolation so that each signal is of the same size (say 10000 samples long). Thus each audio file is represented as a long vector. Save this vector for each audio file. All such vectors comprise a data set, and a vector in the data set can be referred to as a datum.
    2. The second step is to compute the Cepstrum coefficients (rceps). During Cepstrum analysis, one of the main parameter is the window size with which you are going to work. In this MP, you are also going to explore the effect of the window size on the performance of the system. Use the resized data from the previous step and try three window sizes W={100, 500, 10000}. For each window size, choose the overlap size of 10% and compute the features (Cepstrum coefficients). For each window, choose the first 12 Cepstrum coefficients. Now, you can represent each audio file as a data matrix whose dimensions are 12 by N where N is the number of windows. For computing the distance of two audio signals, you first need to stack the columns of the data matrix into a long vector of length 12*N. Save this vector for each audio file. All such vectors comprise a data set and a vector in the data set can be referred to as a datum.
    3. The third step is to compute the mel-frequency cepstral coefficients (MFCC). To do this, you could either use the Matlab files (mp3_mfcc.zip) or follow the procedure given in the next few lines. Procedure: You will need to (1) divide the audio file into frames, (2) calculate the X=abs(fft(x)) of each frame, (3) multiply the magnitude FFT by a matrix to compute the filterbank coefficients, F=W*X. The matrix W should be NBANDSxNFFT, where NBANDS is the number of mel-frequency cepstral bands, and NFFT is the number of FFT bins. Each row of the matrix should contain a triangle (if you plot the row, it looks like a triangle), centered at the center frequency of the band, and hitting zero at the two edges of the band. The center frequency of one band should be the edge of the next band. The center frequencies should be spaced so that their mel frequencies are uniformly spaced, between hz2mel(0Hz) and hz2mel(Fs), Fs = sampling frequency of the speech waveform. If you wish, you can use the melbankm, frq2mel and/or mel2frq functions from the voicebox toolbox to help you compute the W matrix, but DO NOT USE THE MELCEPST FUNCTION FROM VOICEBOX OR ANY SIMILAR END-TO-END MFCC FUNCTION FROM ANY ON-LINE TOOLBOX. In other words, you can use toolboxes from on-line sources to help you design the filterbank matrix, but the pathway from speech samples to spectrum to filterbank coefficients to MFCC should be entirely your own code, with no pre-packaged functions at any level of abstraction higher than fft and idct. (4) Use c=idct(log(max(1e-6,F))) to take the inverse discrete cosine transform of the log of the mel-spaced filterbank coefficients. The threshold of 1e-6 is arbitrary, but changing its value probably won't change your result very much.
  2. Pattern Matching. We are going to use the nearest neighbor (NN) algorithm for speech and speaker recognition. The way the nearest neighbor (NN) algorithm works is: Let's say, we have some labeled data (training set) and we want to do the classification of a new datum (test datum). We compute the distance of the new datum from each datum in the training set and choose the label of the nearest datum (to the test datum) to be the label of the test datum. In all experiments we are going to use the Euclidean distance as our distance metric.

    The problem of the above NN algorithm is that it is prone to error if the data are noisy. An alternative approach is to use the K-Nearest Neighbor (KNN) algorithm. The idea is the same as that of the NN algorithm except that instead of finding the nearest datum you now look at the k nearest data and choose the label by majority vote. In case of a tie, you may decide to increase k so as to resolve the tie.

  3. Data Partition. To test the algorithm, you need to use the data in a way so that they allow you to evaluate the performance of your algorithm. We have the data of four people. Let's call the people A, B, C, and D. Let's refer to each audio file (or the corresponding feature vector) as Zij. Where
    1. Z={A, B, C, D} refer to the person.
    2. i={1,2,3,4,5} refer to the digit spoken.
    3. j={a, b, c, d, e} refers to the particular utterance of the digit.

Experiments

The various experiments that you are required to do are as follows. Note that you are going to do these experiments with all the four kinds of features, namely the raw features and the Cepstrum coefficients with window size W = 100, 500, 10000, respectively.

  1. Speech Recognition Experiments. Take one datum out from your data set, say A1a. Remove from the data set all the data corresponding to the same person (Aij for all i, j). Compute its distance from all the remaining data. Let Zij, be the nearest point. If i=1 then you have recognized the word correctly otherwise you made an error. Repeat this for all the data in your data set. Report the performance (the percentage of the data that are correctly recognized) over all, for each digit, and for each person. Also for each case, report the results for both the nearest neighbor (NN) algorithm and K-nearest neighbor (KNN) algorithm with k=5.
  2. Speaker Recognition Experiments. Take one datum out from your data set, say A1a. Remove from the data set all the data corresponding to the same digit (Z1j for all Z, j). Compute its distance from all the remaining data. Let Zij, be the nearest point. If Z=A then you have recognized the person correctly otherwise you made an error. Repeat this for all the points. Report the performance – overall, for each digit, and for each person. Also for each case, report the results for both the nearest neighbor (NN) algorithm and k-nearest neighbor (KNN) algorithm with k=5.

Extra Credit