In this assignment you will apply machine learning techniques for image and text classification task. Programming languageYou may only import additional modules from the Python standard library. Contents
Part 1: Fashion image classificationPart 1.1: Naive Bayes model
Part 1.2: Perceptron model
Part 2: Text ClassificationYou are given a dataset consisting of texts which belong to 14 different classes.We have split the dataset into a training set and a development dataset. The training set consists of 3865 texts and their corresponding class labels from 1-14, with instances from each of the classes and the development set consists of 483 test instances and their corresponding labels. We have already done the preprocessing of the dataset and extracted into a Python list structure in text_main.py. Using the training set, you will learn a Naive Bayes classifier that will predict the right class label given an unseen text. Use the development set to test the accuracy of your learned model. Report the accuracy, recall, and F1-Score that you get on your development set. We will have a separate (unseen) train/test set that we will use to run your code after you turn it in. No other outside non-standard python libraries can be used. Unigram ModelThe bag of words model in NLP is a simple unigram model which considers a text to be represented as a bag of independent words. That is, we ignore the position the words appear in, and only pay attention to their frequency in the text. Here each text consists of a group of words. Using Bayes theorem, you need to compute the probability of a text belonging to one of the 14 classes given the words in the text. Thus you need to estimate the posterior probabilities: \[ P( \mathrm{Class} = \mathrm{C_i} | \mathrm{Words}) = \frac{P(\mathrm{Class}=\mathrm{C_i})}{P(\mathrm{Words})} \prod_{\mathrm{All}~\mathrm{words}} P(\mathrm{Word}|\mathrm{Class}=\mathrm{C_i}) \] It is standard practice to use the log probabilities so as to avoid underflow. Also, \(P(\mathrm{words})\) is just a constant, so it will not affect your computation.Training and Development
Use only the training set to learn the individual probabilities. The following results should be put in your report:
Extra Credit SuggestionImplement the naive Bayes algorithm over a bigram model as opposed to the unigram model. Bigram model is defined as follows: \[ P(w_1..w_n) = P(w_1)P(w_2|w_1)..P(w_n|w_{n-1}) \] Then combine the bigram model and the unigram model into a mixture model defined with parameter \(\lambda\): \[ (1-\lambda)P(Y) \prod_{i=1}^n P(w_i|Y) + \lambda P(Y) \prod_{i=1}^m P(b_i|Y) \] Did the bigram model help improve accuracy? Find the best parameter \(\lambda\) that gives the highest classification accuracy. Report the optimal parameter \(\lambda\) and report your results(Accuracy number) on the bigram model and optimal mixture model, and answer the following questions:
Provided Code SkeletonWe have provided ( zip file) all the code to get you started on your MP. For part 1, you are provided the following. The doc strings in the python files explain the purpose of each function.
For part 2, you are provided the following. The doc strings in the python files explain the purpose of each function
DeliverablesThis MP will be submitted via compass. Please upload only the following files to compass.
Report ChecklistYour report should briefly describe your implemented solution and fully answer the questions for every part of the assignment. Your description should focus on the most "interesting" aspects of your solution, i.e., any non-obvious implementation choices and parameter settings, and what you have found to be especially important for getting good performance. Feel free to include pseudocode or figures if they are needed to clarify your approach. Your report should be self-contained and it should (ideally) make it possible for us to understand your solution without having to run your source code.
WARNING: You will not get credit for any solutions that you have obtained, but not included in your report! Make only ONE submission per team. Only attach files that are the required deliverables in compass. Your report must be a formatted pdf document. Pictures and example outputs should be incorporated into the document. Exception: items which are very large or unsuitable for inclusion in a pdf document (e.g. videos or animated gifs) may be put on the web and a URL included in your report. Extra credit:We reserve the right to give bonus points for any advanced exploration or especially challenging or creative solutions that you implement. This includes, but is not restricted to, the extra credit suggestion given above. |