Homework 10

Due April 23 at 11:59PM

Instructions

You should do this homework on your own -- one submission per student, and by submitting you are certifying the homework is your work.

Submission: Homework 9 submission will be via Compass (you should have been signed up automatically, if not please email Rick) Submit your answers, graphs, and other responses as a PDF

The homework should be done using python

Problems

  1. 11.1 (35 points)
    1. Only do part A
  2. 11.4 (35 points)
  3. 11.7 (30 points)

This homework deviates from the textbook in multiple ways. First, we will be using sci-kit learn as opposed to the R packages.

--------------11.1 NOTES--------------

For 11.1 there are multiple important changes. First, we would like you to use the following data:

pima-indians-diabetes.data

The description for the data is provided here:

pima-indians-diabetes.names

For 11.1 (A) we would like you to use the first 80% of the data file for your training set, and the last 20% of the data file for your evaluation set. These two sets should be obtained in the order with which it is listed in the data file itself.

For 11.1 (A) we would like you to report your class confusion matrix that you obtain when evaluating your Naive Bayes classifier on your evaluation set. In addition to including the class confusion matrix in your report, we would also like you to report the accuracy and error rate of your classifier.

--------------11.4 NOTES--------------

For the SVM dataset, use the data file with the name "wdbc.data". Use the file "wdbc.names" to guide your data cleaning and preprocessing process. The choice of which columns to drop will become apparent if you carefully read through "wdbc.names"

--------------11.7 NOTES--------------

For the random forest question, please use an 80-20 training set - test set split of your data. This means that 80% of your data will be used for fitting/training your random forest classifier, and 20% of your data will be used to determine the accuracy and the class confusion matrix of your classifier. The ordering of this split is left up to you.

Since we are not using R's random forest classifier, please use scikit learn's random forest classifier which can be imported using the following:

from sklearn.ensemble import RandomForestClassifier

which has an API guide here: http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

 

--------------SUBMISSION/UPLOAD NOTES--------------

For 11.1, 11.4, and 11.7, please upload all python file(s) related to your homework in addition to your pdf report. Please do not compress the pdf report along with your code files. It is up to you if you would like to upload the code as a compressed file or not as long as the pdf file is uploaded separate from the the compressed file. We will be running your code using either python 2 or 3, so if you choose not to use jupyter notebook, then please expect us to run your code as follows:

"python your-python-code-file-name.py"

When we're grading, we will handle modifying your code to load data correctly, so you do not need to concern yourself with standardizing where you place your data files while completing the assignment.

 

Good luck, and please get started early!