Corresponding TA: Raymond Yeh
Release Date: 10/19/17
Due Date: 11/09/17
In this assignment you will:
Derive and learn the concepts on expectation maximization (EM) and the k-means algorithm.
Implement GMM train with k-means or EM.
Implement semi-supervised learning algorithm with k-means on hand-written digits.
Implement VAE using TensorFlow to generate hand-written digits.
Problem 1. Given a training dataset . Let be the indicator of whether $i^{th}$ example is in $k^{th} $cluster, and let $\mu_k$ be the $k^{th}$ cluster center.
a) What cost function does kMeans optimize?
b) Assume that are fixed, show that minimized the cost function you wrote in a).
c) Write out the psuedo code of the kMeans algorithm.
d) Show that at each update step of the kMeans algorithm, the cost function you wrote in a) decreases monotonically.
Problem 2. Recall that Jensen’s Inequality states: For a real function $f$, numbers $x_1$, …$x_n$ in its domain, and positive weights $a_i$ and $\sum_{i=1}^n a_i = 1$ then
ssh (netid)@remlnx.ews.illinois.edu
You may need vpn if you are not on a campus network.
module load python/3.4.3
source ~/ece544na_fa2017/bin/activate
svn cp https://subversion.ews.illinois.edu/svn/fa17-ece544/_shared/mp3 .
cd mp3
mkdir data
wget --user (netid) --ask-password
https://courses.engr.illinois.edu/ece544na/fa2017/secure/assignment3_data.zip -O data/assignment3_data.zip
unzip assignment1_data.zip -d .
svn propset svn:ignore data .
In this exercise you will build a system to classify between hand-written digits (0 to 9), using labeled and unlabeled data. The main motivation for this approach is that labeled data tends to be expensive to collect.
To use the unlabeled data, we first formulate the problem as a clustering problem. (i.e. arrange the data into $k$ groups according to some distance metric). In the ideal case, the distance for the examples with the same label should be small, then we will only need one labeled example for each cluster.
In main.py
, the overall program structure is provided for you.
The data has been provided to you in csv format.
Each row is a hand-written digit, the first element is the label (0-9), and the remaining elements are the pixel intensities; The image is of size 28x28 in gray scale. For the unlabeled data, a dummy label of -1 is included.
Revelant Files:
utils/io_tools.py
,
Revelant Files:
Review Lecture 13 for K-means formulation, and more details are provided in the function docstring.
Revelant Files:
models/kmeans.py
For the supervised learning part, we simply count! We assign labeled examples to each cluster which maximizes the posterior probability. Then, we count how many times a digit shows up in each of the cluster. (e.g if the most frequency digit in cluster 1, is digit ‘5’, then our model will predict that any example assigned to cluster 1 to have the label of digit ‘5’).
Revelant Files:
models/kmeans.py
, `models/gaussian_mixture_model.py`
In this exercise, we will learn to generate examples from the data distribution using VAE. We have provided the training loop, visualization, and data reader for you.
Review Lecture 17 for formulation on Variational AutoEncoder. More details are provided in the function docstring.
For the ease of visualization, we have chosen the latent space to be 2 dimension, The visualized results will be something similar to the figure below.
Revelant Files:
models/variational_autoencoder.py
, main_tf.py
Submitting your code is simply commiting you code. This can be done with the follow command:
svn commit -m "Some meaningful comment here."
Note: The assignment will be autograded. It is important that you do not use additional libraries, or change the provided functions input and output.