MP7b: Shot Boundary Detection in Videos

Due date: May 5 before midnight

Useful Files

Overview

In this machine problem, you will use the vision and speech features to develop a shot boundary detection algorithm. A shot refers to a continuous video segment without any significant content change between pairs of successive frames. Frame pairs with high content change are termed as shot boundaries. Most of the existing methods detect shot boundaries by employing some kind of distance measure and by measuring the frame-to- frame content change. A predefined threshold for the value of these distances is mostly used to detect shot boundaries.

Data

In this MP, you are going to work independently with the audio and video to determine the shot boundaries. The features that you are going to work with are

  1. Video: Primarily you are expected to use color features. For color related features, divide the image into four parts. Each pixel is represented by r, g, b values. However, you are advised to work in the normalized r, g, b domain. Define nr=r/(r+g+b), ng=g/(r+g+b). These are the two color components corresponding to each pixel. Using this, you can represent your image in terms of normalized rgb values. The advantage is that under this new representation, the color is more robust to lighting variations. For each of the four parts of the image, form a 8x8 bin histogram (the value of nr is between 0 1, divide this range into 8 parts and count the number of pixels falling in that range. ) Each of the four parts of the image can now be represented using an 8x8=64 dimensional vector and four such vectors when combined give you a 256 dimensional vectors. The distance between any two consecutive frames is now measured by taking the Euclidean distance between these two vectors (corresponding to the consecutive frames). If distance is more than some threshold (that you have to choose), then we say this pair of frames corresponds to a shot boundary.
  2. Audio: For audio, you are expected to use 12 cepstral coefficients, energy (sum of square of raw signal value) and the zero crossing rate(ZCR). Given a signal S(0),…, S(N), ZCR=sum(|sign(S(2:N))-sign(S(1:N-1))|). Form 14 dimensional feature vector and then by measuring the distance between the successive windows (with respect to this measure), try to detect the shot boundaries. In general energy term may dominate as such you may like to normalize it. Since you want to report the shot boundaries in terms of the video frames, choose the window (non-overlap windows) size appropriately.

You are given two videos of a little over 1 minute. For your convenience, the frames have been extracted. You can download all the frames (in single zip files, about 2200 frames for each video) and the audio file (in wave format) from the course website (links provided as follows).

Download:

Video 1; Video 2;

What to hand in?

  1. Narrative report and run.m package, as standard for all MPs.
  2. First browse the video frames and try to locate the shot boundaries manually. You can do this by browsing the directory content in thumbnail mode in Windows XP. Report the boundaries in terms of the number of the video frames.
  3. From the different feature vectors (video only, Audio only, and audio-video combined features) develop a shot boundary detection algorithm to detect the boundaries. Compare the shot boundaries located manually and those detected by your algorithms. Argue about your particular choice of threshold. Expect there would be around 16 shot boundaries for video 1, but you will have to decide the optimal shot boundaries in video 2. Can you explain differences between the audio based and video based shot boundaries?
  4. Please submit your report and code to compass.