MP7 option c: Neural Networks

Due date: May 5 before midnight

Useful Files

I. Grokking the problem

Download the data file, hard2d.txt. This file contains one matrix, D=[ZETA',X'], ZETA=[zeta_1,...,zeta_n], X=[x_1,...,x_n], where zeta_i in (-1,1) is a binary class label, and x_i in [[0,1],[-1,1]] is a 2D real vector drawn randomly from the right half of the signum square.

Load the dataset into matlab, for example, use D=load('hard2d.txt'); X=D(:,2:3)'; ZETA=D(:,1)';.

The first thing you should do, with any dataset, is seek to gain a little intuitive understanding of the data. With a low-dimensional dataset like this one, you can get excellent intuition by drawing a scatter plot. Do that. In matlab, this is plot(X(ZETA==1,1),X(ZETA==1,2),'r.'); hold on; plot(X(ZETA==-1,1),X(ZETA==-1,2),'b.'); title('Scatter plot of the hard2d dataset');

Print your scatter plot as an image (print -dpng fig1.png).

II. Neural Networks

In this part of the problem you will create a two-layer neural net with 24 hidden nodes, with tanh activation functions on the outputs of the hidden nodes, and a tanh activation on the output node. You will try several different methods for training the network, in order to see how initial conditions can dramatically change the quality of the resulting classifier.

II.A Knowledge-Based Initialization

Write a function [Z,Y]=nnclassify(X,U,V) that takes as input: (1) a matrix of observations, X, whose dimension is p by n, (2) the first-level matrix of weights and biases, U, whose dimension is q by (p+1), and (2) the second level vector of weights and biases, V, whose dimension is 1 by (q+1). It should compute, as output, the axon values of the hidden and output layers (Y and Z). For example, your function might include lines of code that look like A=U*[ones(1,n);X];, Y=tanh(A), B=V*[ones(1,n);Y];, and Z=tanh(B);

Look at the scatter plot (print it out, if you can). Use ginput to find some points on the boundary between the two classes. Use some basic algebra to find an equation for the line that passes through each pair of points (in the form 0=u0+u1*x1+u2*x2), thus giving yourself a piece-wise linear formula for the boundary between the two classes. Because of the symmetry of the problem, if you want, say, a 16-piece or 24-piece boundary, you should only need the equations for 2 or 3 lines (3-4 points); the other 14-21 lines should be be reflections and translations of the basic 2-3. Now, take the equations for your 16-24 lines, and use the coefficients of each equation to determine one row of the U matrix.

Set V=[-THRESHOLD,ones(1,q)], where THRESHOLD is a number you choose so that the classifier gives the right answer.

Run nnclassify to generate Z, and plot a scatter plot showing the Z labeling of each token. Make sure it looks the way you expected; if it doesn't, find out why. Compute the error rate of your classifier (ER=sum(ZETA.*Z<0)/n;), and list the error rate in the title of your figure. Print a copy of the figure as an image.

II.B Train Using Back-Propagation

Create a function [EPSILON,DELTA]=nnbackprop(X,Y,Z,ZETA,V) calculating the labeling mean-squared error (MSE) back-propagated to the second and first layers of the network, respectively; thus EPSILON is a 1xn vector, and DELTA is a qxn matrix. The function backprop should probably contain lines of code that look something like EPSILON=2*(1-Z.^2).*(Z-ZETA); and DELTA=(1-Y.^2).*(V(:,2:(q+1))'*EPSILON);

Create a function [dU,dV]=nngradient(X,Y,DELTA,EPSILON); that finds the gradient of MSE with respect to U and V. Your function will probably include lines like dU=(1/n)*DELTA*[ones(1,n);X]'; and dV=(1/n)*EPSILON*[ones(1,n);Y].

Create a function [U,V]=nndescent(X,ZETA,U0,V0,ETA,T) that performs T iterations of gradient descent, with a learning rate of ETA, from a starting point of U0,V0.

Start with the knowledge-based neural net of the previous subsection. Perform 1000 iterations of gradient descent, with a learning rate of 0.1. You should wind up with an error rate similar to last time, but slightly better; what is it? Create a scatter plot of the new labeling, and put the error rate in the title.

II.C Back-Prop from Random Initial Conditions

Discard your previous weight matrices. Instead, randomly initialize the weight matrices, then perform 1000 iterations of gradient descent with a learning rate of 0.002. What is the error rate now? What is the scatter plot? Try running gradient descent for another 1000 iterations, and another. Do the results get better?

III. Extra Credit: Simulated Annealing

EXTRA CREDIT: Use simulated annealing to find the global optimum. Use either Gaussianized siulated annealing, or discretized simulated annealing.

I used Gaussianized annealing with 100,000 iterations. Gaussian simulated annealing works like this: compute a step in the direction of the gradient descent, using your back-propagation code from II.C. Compute the training error rate at the new location. Then add a Gaussian random component to the weight vector, and compute the training error criterion at that new random location. If the random location is better than the gradient descent location, go there. If it's worse, then go there with probability P=exp(-(error_difference)*log(t+1)/ridge), where ridge is the difference between the largest possible error and the smallest possible error that could be achieved by any neural net applied to your training corpus, and t is the iteration number. Every time you encounter a weight matrix that is better than the best previous weight matrix (lowest error), you should append that weight matrix to the back of auxiliary rank-3 tensor, cell array, or file, and store its associated cost in an auxiliary cost vector. Remember that the very last weight matrix you test might be the best one.

Discretized simulated annealing is random in a discrete state space, rather than a continuous state space. The basic idea is to discretize the space of all possible weight vectors (e.g., by quantizing every weight into bins of width 0.1). Compute a gradient update using back-propagation. Then move to a different grid in the weight space, uniformly distributed over all possible grid locations. Compute the training error in the old location and the new location. If the new location has lower error, go there; if not, go there anyway, with probability P=exp(-(error_difference)*log(t+1)/ridge). Every time you find a local optimum (better than the iteration that preceded it, and better than the iteration that follows it), save the weight vector.

Create a scatter plot of the labeling function achieved in this way, and compute its error rate. If you run simulated annealing long enough, it should actually beat the knowledge-based initialization; but that might take a very, very long time.