In this MP, you will construct an LSTM layer using PyTorch for use with the system described in Tulyakov, Sergey, et al. "MoCoGAN: Decomposing motion and content for video generation" (CVPR 2018). The file you actually need to complete is mp6.py. The unit tests are provided in run_tests.py and tests/test_visible.py. All of these are available as part of the code package, https://courses.engr.illinois.edu/ece417/fa2020/ece417_20fall_mp6.zip.

In [ ]:

```
import numpy as np
import matplotlib.figure
import matplotlib.pyplot as plt
%matplotlib inline
import torch
torch.manual_seed(417)
np.random.seed(417)
```

In [ ]:

```
import mp6
import importlib
importlib.reload(mp6)
```

A number of blocks below have two options: you can either show your own results, or you can show the distributed solutions. In order to decide which one you want to see, you should just comment out one of the lines among some specified pairs.

In [ ]:

```
import json
with open('solutions.json') as f:
solutions = json.load(f)
```

For this MP, we will be using the PyTorch machine learning framework in addition to NumPy.

PyTorch has quickly gained widespread use, both in research and in production, for the great depth in its repertoire of functionality and for how much low-level, backend functionality is handled by the library itself compared to what a user of the library needs to implement (including and especially its simple automatic gradient calculation interface).

A number of comparisons to NumPy may hint at the ease with which PyTorch may be used. To start with the basics, just as the primary object of manipulation in NumPy is the N-dimensional array (ndarray), PyTorch's object of choice is the N-dimensional *tensor*. The two behave very similarly, since many methods used in PyTorch are designed with their direct equivalents in NumPy in mind:

In [ ]:

```
print(torch.zeros(5), np.zeros(5))
print(torch.ones(5), np.ones(5))
randa, randb = torch.randn(5), np.random.rand(5)
print(randa, randb)
print(torch.stack([torch.randn(4) for _ in range(3)]),np.stack([np.random.rand(4) for _ in range(3)]))
print(torch.cos(torch.sin(randa)), np.cos(np.sin(randa)))
```

Of course, as useful as these simple functions may be, they're certainly not the entire story.

A typical PyTorch model consists of, at minimum, a class with two procedures: 1) an initialization method (as with any Python class), in which one assembles a number of layers into a coherent whole, and 2) a `forward`

method, in which said layers are used in the forward propagation of a number of inputs.

- The first two methods you will be implementing are the initialization and
`forward`

methods of a fully connected layer,`MyLinear`

:

In [ ]:

```
input_dim = solutions["linear_input_size"]
output_dim = solutions["linear_output_size"]
# Uncomment exactly one of the following lines:
# linear_layer = torch.nn.Linear(input_dim,output_dim)
# linear_layer = mp6.MyLinear(input_dim,output_dim)
linear_layer.weight = torch.nn.Parameter(torch.Tensor(solutions["linear_weight"]))
linear_layer.bias = torch.nn.Parameter(torch.Tensor(solutions["linear_bias"]))
input_array = torch.Tensor(solutions["linear_input"])
```

The weight and bias of the linear layer, as you saw, could be manipulated to be arbitrary parameters, which we can verify visually:

In [ ]:

```
for parameter_name, parameter_value in linear_layer.named_parameters():
print(parameter_name,parameter_value)
```

This linear layer, by itself, can now be treated as an entire model if we wanted, as you will see below.

To obtain an output from this new model, we can just call the model itself with an input (the arguments after `self`

in the `forward`

method):

In [ ]:

```
current_output = linear_layer(torch.Tensor(input_array))
print(current_output)
```

Now how do we meaningfully train this model? We can use any of the loss functions provided in `torch.nn`

, such as a mean square error loss, to obtain a metric for model performance:

In [ ]:

```
# Uncomment exactly one of the following lines:
# desired_output = torch.randn_like(current_output)
# desired_output = torch.Tensor(solutions["linear_output"])
current_loss = torch.nn.MSELoss()(current_output,desired_output)
print(current_loss)
```

To then obtain the gradients of this loss, all we need to do is call the model's `backward`

method, and then every parameter in our model will have gradients of the error with respect to it defined:

In [ ]:

```
current_loss.backward()
for parameter_name, parameter_value in linear_layer.named_parameters():
print(parameter_name,parameter_value.grad)
```

We can then use an optimizer to adjust those parameters based on the resulting gradients, such as with stochastic gradient descent (with a learning rate `lr`

and momentum):

In [ ]:

```
optimizer = torch.optim.SGD(linear_layer.parameters(),lr=0.5,momentum=0.)
optimizer.step()
```

Now we can print the parameters of our model after this step:

In [ ]:

```
for parameter_name, parameter_value in linear_layer.named_parameters():
print(parameter_name,parameter_value)
```

Notice anything different? Think you *should* have done more to get this change? :-)

- The next two methods you will be implementing are the initialization and
`forward`

methods of a single LSTM cell,`MyLSTMCell`

:

In [ ]:

```
input_dim = solutions["lstmcell_input_size"]
hidden_dim = solutions["lstmcell_hidden_size"]
lstmcell_weight_ih = solutions["lstmcell_weight_ih"]
lstmcell_weight_hh = solutions["lstmcell_weight_hh"]
lstmcell_bias_ih = solutions["lstmcell_bias_ih"]
lstmcell_bias_hh = solutions["lstmcell_bias_hh"]
# Uncomment exactly one of the following lines:
# lstmcell_layer = torch.nn.LSTMCell(input_dim, hidden_dim)
# lstmcell_layer = mp6.MyLSTMCell(input_dim, hidden_dim)
lstmcell_layer.weight_ih = torch.nn.Parameter(torch.Tensor(lstmcell_weight_ih))
lstmcell_layer.weight_hh = torch.nn.Parameter(torch.Tensor(lstmcell_weight_hh))
lstmcell_layer.bias_ih = torch.nn.Parameter(torch.Tensor(lstmcell_bias_ih))
lstmcell_layer.bias_hh = torch.nn.Parameter(torch.Tensor(lstmcell_bias_hh))
h_in = torch.Tensor(solutions["lstmcell_h_init"])
c_in = torch.Tensor(solutions["lstmcell_c_init"])
input_array = torch.Tensor(solutions["lstmcell_input"])
```

In [ ]:

```
lstmcell_layer(input_array, (h_in, c_in))
```

The layers of a neural network are all organized into *modules*, which when combined form a graph of computations (think the graph in lecture 19 slide 5) based on which deeper and deeper gradients can be computed.

If you wish to probe specific layers in your model later, you can of course declare them separately and compose them in your forward layer:

In [ ]:

```
class MyFirstModule(torch.nn.Module):
def __init__(self, input_size, output_size):
super(MyFirstModule, self).__init__()
self.linear1 = torch.nn.Linear(input_size,35)
self.relu = torch.nn.ReLU()
self.linear2 = torch.nn.Linear(35,100)
self.relu6 = torch.nn.ReLU6()
self.linear3 = torch.nn.Linear(100,output_size)
self.selu = torch.nn.SELU()
def forward(self, module_input):
return self.selu(
self.linear3(
self.relu6(
self.linear2(
self.relu(
self.linear1(module_input))))))
```

If you aren't particularly concerned about probing the layers later, you can compose them on initialization and cut the size of your `forward`

method:

In [ ]:

```
class MyFirstModule(torch.nn.Module):
def __init__(self, input_size, output_size):
super(MyFirstModule, self).__init__()
self.layers = torch.nn.Sequential([
torch.nn.Linear(input_size,35),
torch.nn.ReLU(),
torch.nn.Linear(35,100),
torch.nn.ReLU6(),
torch.nn.Linear(100,output_size),
torch.nn.SELU()
])
def forward(self, module_input):
return self.layers(module_input)
```

The remaining four methods you will be implementing pertain to a module, named here simply `LSTM`

, which uses both the linear layer and the LSTM cell layer you implemented:

the initialization method;

a method which initializes all of the parameters of the model;

a method which sets an initial hidden state and memory cell; and

the

`forward`

method.

Now what might we be using this module for?

In [ ]:

```
input_dim = solutions["lstmcell_input_size"]
hidden_dim = solutions["lstmcell_hidden_size"]
batch_size = solutions["lstm_batch_size"]
forget_bias = solutions["lstm_forget_bias"]
n_frames = solutions["lstm_n_frames"]
lstmcell_weight_ih = torch.Tensor(solutions["lstmcell_weight_ih"])
lstmcell_weight_hh = torch.Tensor(solutions["lstmcell_weight_hh"])
lstmcell_bias_ih = torch.Tensor(solutions["lstmcell_bias_ih"])
lstmcell_bias_hh = torch.Tensor(solutions["lstmcell_bias_hh"])
lstm_linear_weight = torch.Tensor(solutions["lstm_linear_weight"])
lstm_linear_bias = torch.Tensor(solutions["lstm_linear_bias"])
lstm_inputs = torch.Tensor(solutions["lstm_inputs"])
lstm_h = torch.Tensor(solutions["lstm_h"])
lstm_c = torch.Tensor(solutions["lstm_c"])
lstmlayer = mp6.LSTM(input_dim, hidden_dim)
lstmlayer.lstm.weight_ih = torch.nn.Parameter(lstmcell_weight_ih)
lstmlayer.lstm.bias_ih = torch.nn.Parameter(lstmcell_bias_ih)
lstmlayer.lstm.weight_hh = torch.nn.Parameter(lstmcell_weight_hh)
lstmlayer.lstm.bias_hh = torch.nn.Parameter(lstmcell_bias_hh)
lstmlayer.linear.weight = torch.nn.Parameter(lstm_linear_weight)
lstmlayer.linear.bias = torch.nn.Parameter(lstm_linear_bias)
lstmlayer.h = torch.autograd.Variable(torch.Tensor(lstm_h))
lstmlayer.c = torch.autograd.Variable(torch.Tensor(lstm_c))
# Uncomment exactly one of the following lines:
# lstm_output = lstmlayer(lstm_inputs, n_frames)
# lstm_output == torch.Tensor(solutions["lstm_output"])
```

You have seen in lecture a discussion of Deep Voxel Flow, which has been used to bilinearly interpolate between two images in time using a procedure quite similar to barycentric coordinates. For this MP, however, we will be generating (hallucinating?) video frames from scratch using a simple GAN.

The typical GAN consists of a *generator* of data, this ultimately being a transformation of some vector in what can be considered a *latent space* of possible candidate data, and a *discriminator* which attempts to discern real data from some dataset from that produced by the generator. For images and other high-dimensional data, the discriminator models are usually implemented as convolutional neural networks (remember lectures 9 and 10?), while the generator models are conversely implemented as *de*convolutional neural networks (that is, the reverse process).

The networks are called *adversarial* because these two models are trained in an alternating fashion over some period, with the generator producing better quality images (as an "attempt" to "fool" the discriminator) and the discriminator getting better at classifying real images from fake (as an "attempt" to "not be fooled" by the generator).

(A more detailed discussion of GANs in general is to come in Thursday's lecture; stay tuned!)

In MoCoGAN, there are in fact two discriminators; one which discriminates between real and fake videos $D_V$ and another which discriminates between real and fake images (that is, the frames of those videos) $D_I$. Since the generator $G_I$ used in MoCoGAN only produces frames—which are later concatenated into videos—any improvements in the generator in this regard must come from whatever produces the input to the generator which corresponds to the animation in the output, such as the LSTM (the unlabeled pink element in the diagram).

The input to the generator in this case consists of a vector $\{\mathbf{z}_M\}_{i=1}^{K}$, from what is considered the *mo*tion latent subspace of the overall latent space of candidate videos, and a similar vector $\mathbf{z}_C$ from a similar *co*ntent latent subspace, which is kept constant. The former of these is generated by feeding an input, such as random noise, to an LSTM and propagating it over a number of timesteps.

The developers of MoCoGAN demonstrated the effectiveness of this GAN with, among other datasets as a basis, a collection of videos of tai chi (some sample frames from the resulting generator are shown below). The specific implementation of this GAN that you are provided has been trained on the classification database from "Actions as Space-Time Shapes" (ICCV 2005, PAMI 2007) and outputs 96x96 videos roughly 2-4 seconds long. (If they were any larger, though not necessarily if they were any longer, you might run into problems running this if you lack a GPU.)