ECE 417/MP7: Audio-to-Video Animation

Overview

In this machine problem, you will get familiar with face modeling and animation techniques, learn to use ANNs (artificial neural networks) to map features of speech signals to facial animation parameters, and produce facial animation sequences from the audio tracks.

Useful Files

The Audio-Visual Database

The pre-processed database will be provided in the Matlab MAT file format, namely, ECE417_MP5_AV_DATA.mat. This file contains the following four Matlab variables:

av_train
av_train is a structure variable storing the audio-visual data for training the ANN. It has the following elements:
- av_train.audio is a matrix of the audio features. The kth column of the audio feature matrix can be accessed as av_train.audio(: , k). It represents the audio feature vector of the frame k of the audio.
- av_train.visual is a matrix of the visual features. The kth column of the visual feature matrix can be accessed as av_train.visual(:,k). It represents the visual feature vector of the frame k of the video. A visual feature vector contains three numbers. The first one is the Δ width of lips (Δw=w-w0). The last two numbers are the Δ height of the upper lip (Δh1=h1-h10), and the Δ height of the lower lip (Δh2=h2-h20) where w0, h10, h20 are the width and heights of the neutral lips (see Figure 1).
  
  Figure 1. The visual features.
av_validate
av_validate is a structure variable storing the audio-visual data for validation of the ANN. It has the following elements:
- av_validate.audio is the audio feature matrix. Each column av_ validate.audio(:,k) represents the audio feature vector of frame k.
- av_validate.visual is the visual feature matrix. Each column av_ validate.visual(:,k) represents the visual feature vector of frame k.
testAudio
testAudio is the test audio data matrix. Each column is an audio feature vector.
silenceModel
silenceModel will be used to decide if an audio frame corresponds to silence.

Data for Producing Animation

Figure 2: Mouth image (left); Mouth image with triangular mesh (right)

In this machine problem, facial animation is achieved by image warping. Two files are provided for image warping. In addition, a waveform file is provided as the sound track corresponding to testAudio for making the final movie file.

mouth.jpg
A neutral mouth image will be provided (see Figure 2). You will use it to generate a mouth animation image sequence.
mesh.txt
A triangular mesh that triangulates the mouth area in mouth.jpg (see Figure 2). You will use this mesh and the mouth image to generate new mouth images through image warping. The format of the file mesh.txt is:
1. Number of vertices
2. x coordinate of vertex 1, y coordinate of vertex 1
3. x coordinate of vertex 2, y coordinate of vertex 2
4. ...
5. Number of triangles
6. vertex 1, vertex 2, vertex 3 (of the 1st triangle)
7. ...
test.wav
The waveform file corresponding to the audio feature matrix testAudio.

Tasks

Write your image warping code in MATLAB. The code takes the visual features as input and synthesizes new mouth images. (We recommend you to do this part first.)
Load pre-processed training data from ECE417_MP5_AV_DATA.mat.
Use the training data set to train a set of ANNs as the mapping from audio features to visual features. The MATLAB code ECE417_MP5_AV_train is provided.
Apply the mapping to the test audio features and obtain synthetic visual features. The MATLAB code ECE417_MP5_AV_test is provided.
Produce image sequence for the synthesized visual features.

Detailed Description

Image warping: First, you need to deform the mesh according to the visual features. The deformation of the mesh can be decided by interpolation from the visual features. A MATLAB function “interpVert” using linear interpolation will be provided. Then write a warpimg function to generate the deformed mouth images using the given mouth image (See Figure 2). For the pixels outside the mesh and pixels in the holes inside lips, leave them black.
Load pre-processed data.
ANNs training and testing:
1. Matlab function ECE417_MP5_train will be provided. One parameter (number of hidden units) can be adjusted to get good mapping results.
2. Matlab function ECE417_MP5_test will be provided.
Use the estimated visual feature from test data, the triangular mesh, and the mouth image to generate mouth image sequences.
Produce an animation movie file.
1. Firstly generate face images from visual features.
2. Save the images in JPEG format and name them as test_\#\#\#\#.jpg. \#\#\#\# is the frame number of the image, starting from 0. For example, for the 15th frame, the file name is test_0014.jpg.
3. Use the provided executable DxBMP.exe to convert the image sequence into a movie file. If DxBMP.exe is in the same directory as the images, the command line is `DxBMP -framerate 30 test_*.jpg test.avi.' The output movie file is test.avi. More information about DxBMP.exe can be found in the provided DxBMP.htm. OR use matlab's VideoWriter object, and writeVideo command, to create a movie inside matlab.
4. Open Windows Movie Maker, or VirtualDub, or imovie, or any other movie program of your choice. Import the video test.avi, and the audio test.wav. Save the resulting file, with both audio and video.

What to submit:

The movie file mp5.wmv you generate, your code, a README file, and a narrative report briefly describing how you do this machine problem and including all the sections in the MP rubric. Compress everything in a zip file named xxx.zip where xxx is your NetID. Please upload this file on Compass.