• Lecture 1

    • Principal Component Analysis, I. Jolliffe, 2nd Edition

    • Probabilistic Principal Component Analysis, Bishop and Tipping, Journal of the Royal Statistical Society. Series B (Statistical Methodology) Vol. 61, No. 3 (1999), pp. 611-622.

    • A theoretical introduction to algorithms for SVD (and applications to PCA) is Chapter 3 of the book titled Foundations of Data Science by Blum, Hopcroft and Kannan. A free online copy is available here

    • Project idea: Sparse PCA, algorithms, theoretical guarantees.

  • Lecture 2

    • An initial introduction is available using the slides of Prof Ryan Tibshirani here.

    • An early paper highlighting the practical importance of CCA in fusing multimodal signals is this paper.

    • The probabilistic view of CCA is via the inter-batter factor models studied in this paper by Browne in 1979. This model and interpretation of CCA was rediscovered by Bach and Jordan in this manuscript a quarter century later.

    • Project idea: Modern applications of CCA. Examples: SVCCA as a way of understanding what neural network layers are doing (see if you can spot the bug in this NIPS 2017 paper).

  • Lecture 3

    • An initial introduction is available using the slides of Prof Ryan Tibshirani here.

    • Maximal correlation was introduced by Herschfield, Gebelin and Renyi in independent works. The SVD solution to maximal correlation was by provided by Witsenhausen here . The discrete alphabet case is nicely recovered in these notes.

    • Kernel CCA was introduced by Bach and Jordan in this paper.

    • Project idea: Generalizing CCA to multiple random variables, see this paper.

  • Lecture 4

    • ACE was developed by Breiman and Friedman in 1985 in this paper. Global proof of convergence was the key technical contribution. The end of these slides also provide a high level description.

    • ISOMAP was introduced in this Science paper by Tenenbaum et. al.

    • SNE (Stochastic neighbor embedding) was introduced in this paper and t-SNE here.

    • Project idea: Literature survey of the power method (idea behind ACE), starting with this paper and arriving at this one.

  • Lecture 5

    • Nonparametric density estimation from iid samples is a very classical topic in Statistics. The book `All of Nonparametric Statistics' by Wasserman is an excellent resource. Chapter 6 is an excellent introduction to kernel density estimation, especially Sections 6.2 and 6.3.

    • Sample specific kernel shapes are considered in this excellent book titled `Local Regression and Likelihood' by C. R. Loader.

    • Nearest neighbor density estimation and using nearest neighbor distances as bandwidth along with sample specific kernel shapes is considered here in the context of estimating a functional of the density (differential entropy).

    • Project idea: Estimating information measures (entropy and mutual information) in continuous spaces is a topical subject. Take a look at this and this paper.

  • Lecture 6

    • A very elementary introduction to the EM algorithm in the context of Gaussian mixtures is here . A textbook on the EM algorithm is the one by McLachlan and Krishnan.

    • Convergence analyses of the EM algorithm are few, but very recent works on global convergence for recovering means of two Gaussian mixtures are here and here .

    • The first paper to study Gaussian mixtures was by Pearson in 1894 in this paper titled `Contributions to the mathematical theory of evolution' and is extremely readable even today. This blogpost on this topic covering a very recent result on two Gaussian mixtures is as informative as it is nicely readable.

  • Lecture 7

    • A nice introduction to tensor methods in machine learning is this blogpost.

    • Tensor methods for Gaussian mixtures are well covered in this paper in Section 3.2. These student scribed notes also do it, but not as thoroughly or accurately.

    • Gaussian mixture models and algorithms are comprehensively covered from a theoretical perspective in Chapter 6 of this research monograph.

    • Project idea: A survey of modern tensor decomposition algorithms with a good starting point being this ICML 2017 paper.

  • Lecture 8

    • A survey of various approaches to the mixture of experts model is here.

    • Early papers on the mixture of experts model are here and here.

    • Recent work on the mixture of experts as sparsely gated neural networks are here and here.

  • Lecture 9

    • A modern textbook on deep learning is the one by Goodfellow, Bengio and Courveille, available here. You can learn about feedforward, convolutional, scattering and recurrent networks in more detail here than at the Zoo.

    • Interpretable deep learning is a new topic with computer vision applications here (full paper here ) and NLP ones here.

    • Blogs are a nice way to learn about neural networks. There are plenty online: general purpose one and this one is on CNNs. A general introduction to RNNs and LSTMs is here.

    • Project idea: Build and train a rudimentary neural network translation system for your native language (this is not very useful for monolingual speakers) vis-a-vis English. A good starting point is here.

  • Lecture 10

    • Backpropagation is explained in detail in the textbook on deep learning by Goodfellow, Bengio and Courveille, and this blogpost also does a very fine job with explanations as well as historical references. Chapter 3 of this thesis works too.

    • There are many references online for stochastic gradient descent and their versions (including adaptive step sizes, momentum methods and minibatching). This blog has comparisons of variant gradient based algorithms as does this article; momentum methods for accelerated gradient descent are explained in this blog. A recent and popular variant is Adam.

    • Project idea: A careful study of Nesterov's momentum method including recent efforts at formally understanding it. This is a simple online script to understand momentum at play. Also interesting is the study of adaptive gradient step size methods, including Adagrad.

  • Lecture 11

    • Autoencoders in their original format as unsupervised learning algorithms is explained here .

    • Variational auto encoders were introduced concurrently by Kingma and Welling ( here ) and Rezende and Wierstra (see here ). A survey article on variational auto encoders is here.

    • A practical implementation of VAE on the MNIST dataset is discussed here .
  • Lecture 12

    • The authoritative account of GANs is this tutorial by Ian Goodfellow. A corresponding reference on practical implementation tips on training your own GAN is Soumith Chintala's git page.

    • The original GAN paper by Goodfellow et. al. is here where you will also see the connection to minimizing Shannon-Jensen divergence. This paper by Arjovsky, Chintala and Bottou introduced Wasserstein GAN where a connection is made to minimizing the Wasserstein distance to the true distribution.

    • There are a very large number of blogposts on the topic, but I recommend this, this and this.
    • Implementation project idea: Build your own GAN on a new image dataset (trained GANs are few, but this seems to be well done). Experiment with your trained GAN and see if you can spot interesting generative properties.

  • Lecture 13

    • A summary of language models as of 1998 is by Chen and Goodman . The Bayesian interpretation of KN smoothing via Pitman Yor process is by Yee Whye Teh. The modern nonparametric understanding of Good-Turing smoothing and discounting-method smoothing is by recent NIPS papers.

    • Neural network approaches to language modeling have been significantly more successful than the statistical smoothing approaches based on n-grams. Key papers include this and this.

    • A concrete hands-on tutorial to building a language model via RNNs is here.
    • Implementation project idea: use an RNN to build a language model in your native language (including English, although that is already widely available). A step-by-step approach is provided here although you can find plenty of other good resources online.

  • Lecture 14

    • Word2vec was introduced in 2013 in this paper by Mikolov et. al. This report and this blogpost as well as this blogpost are efforts to get to the heart of the ideas behind word2vec.

    • GloVe was introduced by Stanford NLP researchers in this paper and this site maintains the source code and downloadable vectors as well as interface for you to play with.

    • Implementation project idea: Play with word2vec vectors for your native language and see what syntactic and semantic similarities are caught. You will be surprised by how creative you have to get to see structure in languages with complicated morphologies. Pre-trained vectors for many languages is here although you may want to train your own for better quality ones.

  • Lecture 15

    • Sentence representations are an ongoing research topic. Easy to read and understand are this and this and this. They all use the plain word embeddings in a word-order independent manner to come up with a sentence representation.

    • Neural network models, including autoencoders, are here and here and here.

  • Lecture 16

    • Polysemous word vector representations are nicely covered here and here.

    • Compositionality detection is covered here.

  • Lecture 17

    • A historical view of document representations leading to a modern NLP pipeline was covered. Using counts and tf-idf are text book material in information retrieval.

    • Latent Dirichlet Allocation was introduced here. There are many videos online of this topic being presented. I recommend Michael Jordan's talk at the 250th year of Bayes.

    • Modern NLP-based document representations are covered here and here. I recommend this Q&A contest to get a feel for a modern NLP pipeline at work.

  • Lecture 18

    • Graph representations on Euclidean spaces is an old topic. The standard method is for the Euclidean distances (L_1 or L_2) to closely mirror the edge capacities. The standard result in this area is called the Johnson-Lindenstrauss lemma.

    • Using the idea behind convolutional neural networks to represent Graphs is new -- there are two variants, called spectral and spatial, and is a rapidly developing area. A good reference for these representations is Appendix A of this paper, which itself develops more general representations.

    • Very recent papers (under review in ICLR 2017) are under the rubric of `Graph Attention Networks'. Check out this and this . Also of interest is this paper, to appear in NIPS 2017.

  • Lecture 19

    • Program representations are a very new topic, with significant progress made in this paper from ICLR 2016. A recent paper brings recursion to this architecture allowing for significant generalization capabilities. These slides are a good representation of this latter material.

    • Representing natural language text as programs via representations is the goal of this project from Stanford NLP group.

  • Lecture 20

    • Interpretable Machine Learning is a new and emerging area. We covered these topics: LIME and Calibration.

    • Interpreting the weights of neural network is explored here and here. But caveat lector -- these are works in progress and the noise level is very high. I like the overall direction of research, though.