In this MP, you will construct an LSTM layer using PyTorch for use with the system described in Tulyakov, Sergey, et al. "MoCoGAN: Decomposing motion and content for video generation" (CVPR 2018). The file you actually need to complete is The unit tests are provided in and tests/ All of these are available as part of the code package,

How to debug

A number of blocks below have two options: you can either show your own results, or you can show the distributed solutions. In order to decide which one you want to see, you should just comment out one of the lines among some specified pairs.

PyTorch basics

For this MP, we will be using the PyTorch machine learning framework in addition to NumPy.

PyTorch has quickly gained widespread use, both in research and in production, for the great depth in its repertoire of functionality and for how much low-level, backend functionality is handled by the library itself compared to what a user of the library needs to implement (including and especially its simple automatic gradient calculation interface).

A number of comparisons to NumPy may hint at the ease with which PyTorch may be used. To start with the basics, just as the primary object of manipulation in NumPy is the N-dimensional array (ndarray), PyTorch's object of choice is the N-dimensional tensor. The two behave very similarly, since many methods used in PyTorch are designed with their direct equivalents in NumPy in mind:

Of course, as useful as these simple functions may be, they're certainly not the entire story.

A typical PyTorch model consists of, at minimum, a class with two procedures: 1) an initialization method (as with any Python class), in which one assembles a number of layers into a coherent whole, and 2) a forward method, in which said layers are used in the forward propagation of a number of inputs.

The weight and bias of the linear layer, as you saw, could be manipulated to be arbitrary parameters, which we can verify visually:

This linear layer, by itself, can now be treated as an entire model if we wanted, as you will see below.

To obtain an output from this new model, we can just call the model itself with an input (the arguments after self in the forward method):

Now how do we meaningfully train this model? We can use any of the loss functions provided in torch.nn, such as a mean square error loss, to obtain a metric for model performance:

To then obtain the gradients of this loss, all we need to do is call the model's backward method, and then every parameter in our model will have gradients of the error with respect to it defined:

We can then use an optimizer to adjust those parameters based on the resulting gradients, such as with stochastic gradient descent (with a learning rate lr and momentum):

Now we can print the parameters of our model after this step:

Notice anything different? Think you should have done more to get this change? :-)

The layers of a neural network are all organized into modules, which when combined form a graph of computations (think the graph in lecture 19 slide 5) based on which deeper and deeper gradients can be computed.

If you wish to probe specific layers in your model later, you can of course declare them separately and compose them in your forward layer:

If you aren't particularly concerned about probing the layers later, you can compose them on initialization and cut the size of your forward method:

The remaining four methods you will be implementing pertain to a module, named here simply LSTM, which uses both the linear layer and the LSTM cell layer you implemented:

Now what might we be using this module for?

What are GANs?

You have seen in lecture a discussion of Deep Voxel Flow, which has been used to bilinearly interpolate between two images in time using a procedure quite similar to barycentric coordinates. For this MP, however, we will be generating (hallucinating?) video frames from scratch using a simple GAN.

The typical GAN consists of a generator of data, this ultimately being a transformation of some vector in what can be considered a latent space of possible candidate data, and a discriminator which attempts to discern real data from some dataset from that produced by the generator. For images and other high-dimensional data, the discriminator models are usually implemented as convolutional neural networks (remember lectures 9 and 10?), while the generator models are conversely implemented as deconvolutional neural networks (that is, the reverse process).

The networks are called adversarial because these two models are trained in an alternating fashion over some period, with the generator producing better quality images (as an "attempt" to "fool" the discriminator) and the discriminator getting better at classifying real images from fake (as an "attempt" to "not be fooled" by the generator).

(A more detailed discussion of GANs in general is to come in Thursday's lecture; stay tuned!)


In MoCoGAN, there are in fact two discriminators; one which discriminates between real and fake videos $D_V$ and another which discriminates between real and fake images (that is, the frames of those videos) $D_I$. Since the generator $G_I$ used in MoCoGAN only produces frames—which are later concatenated into videos—any improvements in the generator in this regard must come from whatever produces the input to the generator which corresponds to the animation in the output, such as the LSTM (the unlabeled pink element in the diagram).


The input to the generator in this case consists of a vector $\{\mathbf{z}_M\}_{i=1}^{K}$, from what is considered the motion latent subspace of the overall latent space of candidate videos, and a similar vector $\mathbf{z}_C$ from a similar content latent subspace, which is kept constant. The former of these is generated by feeding an input, such as random noise, to an LSTM and propagating it over a number of timesteps.

The developers of MoCoGAN demonstrated the effectiveness of this GAN with, among other datasets as a basis, a collection of videos of tai chi (some sample frames from the resulting generator are shown below). The specific implementation of this GAN that you are provided has been trained on the classification database from "Actions as Space-Time Shapes" (ICCV 2005, PAMI 2007) and outputs 96x96 videos roughly 2-4 seconds long. (If they were any larger, though not necessarily if they were any longer, you might run into problems running this if you lack a GPU.)