CS440/ECE448 Spring 2019

Assignment 3: Naive Bayes/Perceptron Classification

Due date: Monday April 1st, 11:59pm

Updated By: Dhruv Agarwal, Yijia Xu, Mark Craft

In this assignment you will apply machine learning techniques for image and text classification task.

Programming language

You may only import additional modules from the Python standard library.

Part 1:Image Classification
Part 2: Text Classification
Extra Credit
Provided Code Skeleton
Deliverables
Report checklist

Part 1: Fashion image classification

Data: You are provided with part of the Fashion Mnist dataset. There are 50000 training examples and 10000 test examples. The labels are from 0 to 9, representing fashion items of T-shirt, Trouser, Pullover, Dress, Coat, Sandal, Shirt, Sneaker, Bag, and Ankle boot.

Part 1.1: Naive Bayes model

Features: Each image consists of 28*28 pixels which we represent as a flattened array of size 784, where each feature/pixel F_i takes on intensity values from 0 to 255 (8 bit grayscale).
Training: The goal of the training stage is to estimate the likelihoods P(F_i | class) for every pixel location i and for every fashion item class (T-shirt, Trouser, Pullover, Dress, Coat, Sandal, Shirt, Sneaker, Bag, Ankle boot). The likelihood estimate is defined as

P(F_i = f | class) = (# of times pixel i has value f in training examples from this class) / (Total # of training examples from this class)

In addition, as discussed in the lecture, you have to smooth the likelihoods to ensure that there are no zero counts. Laplace smoothing is a very simple method that increases the observation count of every value f by some constant k. This corresponds to adding k to the numerator above, and k*V to the denominator (where V is the number of possible values the feature can take on). The higher the value of k, the stronger the smoothing. Experiment with different values of k (say, from 0.1 to 10) and find the one that gives the highest classification accuracy.

You should also estimate the priors P(class) by the empirical frequencies of different classes in the training set.
Testing: You will perform maximum a posteriori (MAP) classification of test fashion class according to the learned Naive Bayes model. Suppose a test image has feature values f₁, f₂, ... , f₇₈₄. According to this model, the posterior probability (up to scale) of each class given the digit is given by

P(class) ⋅ P(f₁ | class) ⋅ P(f₂ | class) ⋅ ... ⋅ P(f₇₈₄ | class)

Note that in order to avoid underflow, it is standard to work with the log of the above quantity:

log P(class) + log P(f₁ | class) + log P(f₂ | class) + ... + log P(f₇₈₄ | class)

After you compute the above decision function values for all ten classes for every test image, you will use them for MAP classification.
Evaluation: Report your performance in terms of average classification rate and the classification rate for each fashion item (percentage of all test images of a given item correctly classified). Also report your confusion matrix. This is a 10x10 matrix whose entry in row r and column c is the percentage of test images from class r that are classified as class c. In addition, for each class, show the test examples from that class that have the highest and the lowest posterior probabilities according to your classifier. You can think of these as the most and least "prototypical" instances of each fashion class (and the least "prototypical" one is probably misclassified).
Likelihood visualization: When using classifiers in real domains, it is important to be able to inspect what they have learned. One way to inspect a naive Bayes model is to look at the most likely features for a given label. Another tool for understanding the parameters is to visualize the feature likelihoods for high intensity pixels of each class. Here high intensities refer to pixel values from 128 to 255. Therefore, the likelihood for high intensity pixel feature F_i of class c₁ is sum of probabilities of the top 128 intensities at pixel location i of class c₁.

\[ \mathrm{feature\, likelihood}(F_{i}, c_1) = \sum_{k=128}^{255} P(F_{i} = k | c_1) \] For each of the ten classes, plot their trained likelihoods for high intensity pixel features to see what likelihood they have learned.

Part 1.2: Perceptron model

Features: Each image consists of 28*28 pixels which we represent as a flattened array of size 784, where each feature/pixel F_i takes on intensity values from 0 to 255 (8 bit grayscale).
Training: You will train a multiclass perceptron model to classify the 10 fashion classes. The goal of the training stage is to estimate the weight W_ij where i is the feature dimension appended by a bias term and j is for 10 fashion classes. You will train the weight using the multi-class perceptron training rule described in class. Simply, you can start with zero weights, go through training examples one by one, classify that training example using the current weight. If the classification is wrong, lower the weight value of wrong predicted class and raise the weight value of the true class, by the feature value of that example appended with the bias term. You could initialize the bias term as 1.
Testing: You will perform classification of test fashion images according to the learned perceptron weight. Simply, you can go through each test examples and compute the product of its feature vector and the trained weights to get 10 goodness scores. Then choose the class with highest score as the predicted class for that test example.
Evaluation: Same as 1.1, report your performance in terms of the average classification rate and classification rate for each fashion item, your confusion matrix, and for each class, show the test examples from that class that have the highest and the lowest scores according to your classifier.
Weight visualization: Plot the ten weight vectors to inspect what your perceptron model has learned.

Part 2: Text Classification

You are given a dataset consisting of texts which belong to 14 different classes.We have split the dataset into a training set and a development dataset. The training set consists of 3865 texts and their corresponding class labels from 1-14, with instances from each of the classes and the development set consists of 483 test instances and their corresponding labels. We have already done the preprocessing of the dataset and extracted into a Python list structure in text_main.py. Using the training set, you will learn a Naive Bayes classifier that will predict the right class label given an unseen text. Use the development set to test the accuracy of your learned model. Report the accuracy, recall, and F1-Score that you get on your development set. We will have a separate (unseen) train/test set that we will use to run your code after you turn it in. No other outside non-standard python libraries can be used.

Unigram Model

The bag of words model in NLP is a simple unigram model which considers a text to be represented as a bag of independent words. That is, we ignore the position the words appear in, and only pay attention to their frequency in the text. Here each text consists of a group of words. Using Bayes theorem, you need to compute the probability of a text belonging to one of the 14 classes given the words in the text. Thus you need to estimate the posterior probabilities: \[ P( \mathrm{Class} = \mathrm{C_i} | \mathrm{Words}) = \frac{P(\mathrm{Class}=\mathrm{C_i})}{P(\mathrm{Words})} \prod_{\mathrm{All}~\mathrm{words}} P(\mathrm{Word}|\mathrm{Class}=\mathrm{C_i}) \] It is standard practice to use the log probabilities so as to avoid underflow. Also, \(P(\mathrm{words})\) is just a constant, so it will not affect your computation.

Training and Development

Training: To train the algorithm you are going to need to build a bag of words model using the texts. After you build the model, you will need to estimate the log-likelihoods \(\log P(\mathrm{Word}|\mathrm{Type}=\mathrm{C_i})\). The variable \(\mathrm{C_i}\) can only take on 14 values, 1-14. Additionally, you will need to make sure you smooth the likelihoods to prevent zero probabilities. In order to accomplish this task, use Laplace smoothing.
Development: After you have computed the log-likelihoods, you will have your model predict class labels of the text from the development set. In order to do this, you will do MAP estimatation classification using the equation shown above.

Use only the training set to learn the individual probabilities. The following results should be put in your report:

Plot your confusion matrix. This is a 14x14 matrix whose entry in row r and column c is the percentage of test text from class r that are classified as class c.
Accuracy, recall, and F1 scores for each of the classes on the development set.
Top 20 feature words of each of the classes .
Calculate your accuracy without including the class prior into the Naive Bayes equation i.e. Only computing the ML inference of each instance. Report the change in accuracy numbers, if any. Also state your reasoning for this observation. Is including the class prior always beneficial? Change your class prior to a uniform distribution. What is the change in result?

Extra Credit Suggestion

Implement the naive Bayes algorithm over a bigram model as opposed to the unigram model. Bigram model is defined as follows: \[ P(w_1..w_n) = P(w_1)P(w_2|w_1)..P(w_n|w_{n-1}) \] Then combine the bigram model and the unigram model into a mixture model defined with parameter \(\lambda\): \[ (1-\lambda)P(Y) \prod_{i=1}^n P(w_i|Y) + \lambda P(Y) \prod_{i=1}^m P(b_i|Y) \] Did the bigram model help improve accuracy? Find the best parameter \(\lambda\) that gives the highest classification accuracy. Report the optimal parameter \(\lambda\) and report your results(Accuracy number) on the bigram model and optimal mixture model, and answer the following questions:

Running naive Bayes on the bigram model relaxes the naive assumption of the model a bit. However, is this always a good thing? Why or why not?
What would happen if we did an \(N\)-gram model where \(N\) was a really large number?

Provided Code Skeleton

We have provided ( zip file) all the code to get you started on your MP.

For part 1, you are provided the following. The doc strings in the python files explain the purpose of each function.

image_main.py- This is the main file which loads the dataset and calls your Naive Bayes and perceptron algorithms.
naive_bayes.py- This is the only file that needs to be modified.
perceptron.py- This is the only file that needs to be modified.
x_train.npy, y_train.npy, x_test.npy and y_test.npy- These files contain the training and testing examples.

For part 2, you are provided the following. The doc strings in the python files explain the purpose of each function

text_main.py- This is the main file which loads the dataset and calls your Naive Bayes Algorithm.
TextClassifier.py- This is the only file that needs to be modified.
train_text.csv- This file contains the training examples.
dev_text.csv- This file contains the development examples for testing your model.
stop_words.csv- This file contains the stop words which are required for preprocessing the dataset.

Deliverables

This MP will be submitted via compass.

Please upload only the following files to compass.

naive_bayes.py - your solution python file to part 1
perceptron.py - your solution python file to part 1
TextClassifier.py - your solution python file to part 2
report.pdf - your project report in pdf format

Report Checklist

Your report should briefly describe your implemented solution and fully answer the questions for every part of the assignment. Your description should focus on the most "interesting" aspects of your solution, i.e., any non-obvious implementation choices and parameter settings, and what you have found to be especially important for getting good performance. Feel free to include pseudocode or figures if they are needed to clarify your approach. Your report should be self-contained and it should (ideally) make it possible for us to understand your solution without having to run your source code.
Kindly structure the report as follows:

Title Page:
List of all team members, course number and section for which each member is registered, date on which the report was written
Section I:
Image Classification. For both part 1.1 and part 1.2, report average classification rate, the classification rate for each class and the confusion matrix. For each class, show the test examples from that class that have the highest and lowest posterior probabilities or perceptron scores according to your classifier. Show the ten visualization plots for both feature likelihoods and perceptron weights.
Section II:
Text Classification. Report all your results, confusion matrix ,recall ,precision, F1 score for all the 14 classes. Include the top feature words for each of the classes. Also, report the change in accuracy results when the class prior changes to uniform distribution and when its removed. Provide the reasoning for these observations
Extra Credit:
If you have done any work which you think should get extra credit, describe it here
Statement of Contribution:
Specify which team-member performed which task. You are encouraged to make this a many-to-many mapping, if applicable. e.g., You can say that "Rahul and Jason both implemented the BFS function, their results were compared for debugging and Rahul's code was submitted. Jason and Mark both implemented the DFS function, Mark's code never ran successfully, so Jason's code was submitted. Section I of the report was written by all 3 team members. Section II by Mark and Jason, Section III by Rahul and Jason.".. and so on.

WARNING: You will not get credit for any solutions that you have obtained, but not included in your report! Make only ONE submission per team.

Only attach files that are the required deliverables in compass. Your report must be a formatted pdf document. Pictures and example outputs should be incorporated into the document. Exception: items which are very large or unsuitable for inclusion in a pdf document (e.g. videos or animated gifs) may be put on the web and a URL included in your report.

Extra credit:

We reserve the right to give bonus points for any advanced exploration or especially challenging or creative solutions that you implement. This includes, but is not restricted to, the extra credit suggestion given above.