ECE398BD: Fundamentals of Machine Learning (Labs)
Schedule
 Topic  Lab  Assigned  Due  Feedback for the Week 
Lab 1  Introduction to Python  [link]  None  Not Graded  No Feedback 
Lab 2  Classification, Part 1  [link]  Jan 25  Feb 2, 12:00 AM  [link] 
Lab 3  Classification, Part 2  [link]  Feb 1  Feb 9, 12:00 AM  [link] 
Lab 4  Linear Regression and Clustering  [link]  Feb 8  Feb 16, 12:00 AM  [link] 
Lab 5  Principal Component Analysis  [link]  Feb 15  Feb 23, 12:00 AM  [link]

Here are two sample quizzes from previous years: [sample 1], [sample 2]. Sample 2 is better representative of the level of the first quiz.
Note that 12:00 AM on a date is the start of the date, i.e. 12:00 AM on a Thursday refers to the transition between Wednesday and Thursday.
Hints, Errata and Feedback
If any changes, hints or commentary are needed for the labs, they will be provided here.
Lab 5
This is the last week of this section of the course. Please have your lab completed before the start of next lab session, so that you do not fall behind in the second part of the course.
The CSL Student Conference has a few talks relevant to this course (which you may choose to go to – I will be in the lab for the full duration):
Opening Plenary: Dr. Arvind Krishna (SVP IBM Cloud, MSEE '87+ PhD '91), “Accelerating Technology Disruption”, Feb. 15, 4:305:30 NCSA Auditorium
Bioinformatics Keynote: Dr. Filippo Utro (IBM T.J. Watson), “Plant, Population and Cancer Genomics in the DNA revolution”, Feb. 16, 9:009:50 CSL B02
Machine Learning Keynote, Prof. Richard Baraniuk (Rice U., PhD ’92), “A Probabilistic Theory of Deep Learning”, Feb. 17, 2:002:50 CSL B02
Student presentations on ML, Bioinformatics, Etc. (see the website)
Lab 4
When calculating J*(K), you call kMeans 100 times (with niter=100) and take the minimum value of J_K as your estimate for J*(K). This is so that in case you get bad initial cluster centers, you still get a good estimate of J*(K).
This lab shouldn't take that long to run; if you're having issues with running time, chances are the hint from Lab2 on scipy.spatial.distance.cdist will help.
Note that scipy.spatial.distance.cdist returns the Euclidean distance, not its square, unless you pass 'sqeuclidean’.
Lab 3
There is an interesting keynote at ECE Pulse during the lab period. You may choose to go – I will be in the lab for the full duration.
Read the lab directions carefully. Make sure you are not training on your test data! As stated at the top of the lab, this will be penalized heavily. If you are calling .fit() on something that doesn't have train in the name, you're doing something wrong.
In the last problem, your error in the second to last part may come out to be zero depending on which algorithm you pick. This is an (unintentional) peculiarity of this data set (which is, in fact, in its documentation, but I missed; it will rarely happen otherwise – had I noticed this, I would have picked a different dataset). So, for the last part of the last problem, just pretend that the error was something small but nonzero when writing your answer.
The point count totals are off in problem 1 – the first and third part of problem 1 should have a total of 25 points each.
There are many ways to split up the data into folds in problem 2. One simple way is to make a vector with indices 0,…,N1, and remove the indices corresponding to the fold with numpy.setdiff1d, and use these to index the data. Another straightforward way is to make an array of size (4/5*N,d) and fill it in with the folds by slicing. Worst case, you can hardcode the folds and the data outside the folds.
Do not email me the data sets.
Lab 2
Feedback is up.
Hints:
If you're having trouble with broadcasting, read the help page (or search the internet for examples). Basically, dimensions have to match according to a certain set of rules (described in the links prior).
A vector (in the notes, or in math in general) is a column vector. You can't just take an equation in the notes (which takes in one feature vector and classifies it) and plug in a matrix full of data and expect it to work (the dimensions of the resulting expressions will make no sense, for one thing).
In problem 2, the prior is close in a way that might be somewhat confusing, since you should get [0.5,0.33,0.17]. Just the nature of this particular training set.
In problem 3, scipy.spatial.distance.cdist can calculate out all the distances between training data and the testing data in one call.
Read the problems carefully, and make you answer each part of what needs to be done.
Lab 1
Errata:
Hints:
Exercises 5 and 6 will be building blocks for the first problem in Lab 2 (where you can use part (a) or part (b) of both exercises). You should be able to do part (a) of both exercises in a straightforward manner. As stated in the lab, part (b) is optional, but good to know. If you're stuck on part (b), make sure to write out the matrices and you should be able to construct the appropriate matrix multiplication. If you do not solve part (b), do not worry about it. But, you really should solve part (a) of both Exercises 5 and 6.
A better hint for Exercise 6(b) might be: “You can do this with the np.dot, elementwise multiplication and np.sum (along an axis) operations.”
Please follow the Python instructions to get started with Jupyter notebooks. You should not need to install any additional packages for this portion of the course if you have installed Anaconda or Canopy.
The following other Python tutorials may be helpful:
And a few links to write code concisely:
