MP 6: Spark MLlib

Introduction

This week we’ll be diving into another aspect of Spark: MLlib. Spark MLlib provides an API to run machine learning algorithms on RDDs so that we can do ML on the cluster with the benefit of parallelism / distributed computing.

Note: Spark has 2 APIs for MLlib – one based on RDDs and one based on DataFrames. You can choose to use either one for this MP. The DataFrame-based API will be better supported in the future, but the RDD API has been around longer, so you may find more useful information about it on sites like StackOverflow than you will for the DataFrame API.

Machine Learning Crash Course

We’ll be considering 3 types of machine learning in the MP this week:

For most of the algorithms / feature extractors that MLlib provides, there is a common pattern:

  1. Fit: Trains the model, using training data to adjust the model’s internal parameters.
  2. Transform: Use the fitted model to predict the MPel/value of novel data (data not used in the training of the model).

If you go on to do more data science work, you’ll see that this 2-phase ML pattern is common in other ML libraries, like scikit-learn.

Things are a bit more explicit in MLlib than in other ML libraries like scikit-learn, because part of the way RDDs are handled (i.e. lazy evaluation). We often have to explicitly note when we want to predict data or when we’re piping data through different steps of our model’s setup.

It’ll be extremely valuable to look up PySpark’s documentation and MLlib examples when working on this week’s MP. The MP examples we’ll be giving you do not require deep knowledge of Machine Learning concepts to complete. However, you will need to be a good documentation reader to navigate MLlib’s nuances.

Examples

TF-IDF Naive Bayes Yelp Review Classification

Extracting Features

Remember last week when you found out which words were correlated with negative reviews by calculating the probability of a word occurring in a review? PySpark lets you do something like this extremely easily to calculate the Term Frequency - Inverse Document Frequency (tf-idf) features of a set of texts.

Let’s define some terms:

TF-IDF combines the previous two concepts. Suppose we had a few sentences that refer to “cats” in a large book. We’d rank those rare sentences then by the frequency of the “cats” in each of those sentences.

There’s a bit of extra math behind calculating TF-IDF, but for this MP it is sufficient to know that it is a relatively reliable way of guessing the relevance of a word in the context of a large body of data.

You’ll also note that we’re making use of a HashingTF. This is just a really quick way to compute the term-frequency of words. It uses a hash function to represent a long string with a shorter hash, and can use a data structure like a hash map to quickly count the frequency with which words appear.

Classifying Features

We’ll also be using a Naive Bayes classifier. This type of classifier looks at a set of data and labels, and constructs a model to predict the label given the data using probabilistic means.

Again, it’s not necessary to know the inner workings of Naive Bayes, just know that we’ll be using it to build a classifier.

Constructing a model

To construct a model, we’ll need to construct an RDD that has (key, value) pairs with our labels as keys, and our features as values. First, we’ll need to extract those features from the text. We’re going to use TF-IDF to create features, so we’ll calculate that for all of our text data first.

We’ll start with the assumption that you’ve transformed the data so that we have (label, array_of_words) as the RDD. To start with, we’ll have label be 0 if the review is negative and 1 if the review is positive.

Here’s how we’ll extract the TF-IDF features:

# Feed HashingTF the array of words
tf = HashingTF().transform(labeled_data.values())

# Pipe term frequencies into the IDF
idf = IDF(minDocFreq=5).fit(tf)

# Transform the IDF into a TF-IDF
tfidf = idf.transform(tf)

# Reassemble the data into (label, feature) K,V pairs
zipped_data = (labels.zip(tfidf)
                     .map(lambda x: LabeledPoint(x[0], x[1]))
                     .cache())

Now that we have our labels and our features in one RDD, we can train our model:

# Do a random split so we can test our model on non-trained data
training, test = zipped_data.randomSplit([0.7, 0.3])

# Train our model with the training data
model = NaiveBayes.train(training)

Then, we can use this model to predict new data:

# Use the test data and get predicted labels from our model
test_preds = (test.map(lambda x: x.label)
                  .zip(model.predict(test.map(lambda x: x.features))))

If we look at this test_preds RDD, we’ll see our text and the label the model predicted.

However, if we want a more precise measurement of how our model faired, PySpark gives us MulticlassMetrics, which we can use to measure our model’s performance.

trained_metrics = MulticlassMetrics(train_preds.mapValues(float))
test_metrics = MulticlassMetrics(test_preds.mapValues(float))

print(trained_metrics.confusionMatrix().toArray())
print(trained_metrics.precision())

print(test_metrics.confusionMatrix().toArray())
print(test_metrics.precision())

Analyzing our Results

MulticlassMetrics let’s us see the “confusion matrix” of our model, which shows us how many times our model chose each label given the actual label of the data point.

The meaning of the columns is the predicted value, and the meaning of the rows is the actual value. So, we read that confusion_matrix[0][1] is the number of items predicted as having label[1] that were in actuality label[0].

Thus, we want our confusion matrix to have as many items on the diagonals as possible, as these represent items that were correctly predicted.

We can also get precision, which is a more simple metric of “how many items we predicted correctly”.

Here’s our results for this example:

# Training Data Confusion Matrix:
[[ 2019245.   115503.]
 [  258646.   513539.]]
# Training Data Accuracy:
0.8712908071840665

# Testing Data Confusion Matrix:
[[ 861056.   55386.]
 [ 115276.  214499.]]
#Testing Data Accuracy:
0.8630559525347512

Not terrible. As you see, our training data gets slightly better prediction precision, because it’s the data used to train the model.

Extending the Example

What if instead of just classifying on positive and negative, we try to classify reviews based on their 1-5 stars review score? We can do this by changing the labels to be values from 1 to 5 instead of just 0 or 1.

# Training Data Confusion Matrix:
[[ 130042.   38058.   55682.  115421.  193909.]
 [  27028.   71530.   26431.   55381.   95007.]
 [  35787.   22641.  102753.   71802.  122539.]
 [  72529.   45895.   69174.  254838.  246081.]
 [ 113008.   73249.  108349.  225783.  535850.]]
# Training Data Accuracy:
0.37645263439801124

# Testing Data Confusion Matrix:
[[  33706.   20317.   27553.   54344.   90325.]
 [  15384.   10373.   14875.   28413.   46173.]
 [  18958.   13288.   19389.   37813.   59746.]
 [  36921.   25382.   37791.   76008.  120251.]
 [  57014.   37817.   55372.  112851.  194319.]]
#Testing Data Accuracy:
0.268241369417615

Ouch. What went wrong? Well, a couple things. One thing that hurts us is that Naive Bayes is, well, Naive. While we intuitively know that the meanings 1, 2, 3, 4, 5 have a specific value, NB doesn’t have any concept that items labeled 4 and 5 are probably going to be closer than a pair labeled 1 and 5.

Also, in this example we see a case where testing our training data doesn’t have much utility. While an accuracy of 0.376 isn’t great, it’s still a lot better than 0.268. Validating on the training data would lead us to think that our model is substantially more accurate than it actually is.

Conclusion

The full code of the first example is in bayes_binary_tfidf.py, and the second “extended” example is in bayes_tfidf.py.

Full Example Code

MP Activities

NOTE:

1. Amazon Review Score Classification

This week, we’ll be using an Amazon dataset of food reviews. You can find this dataset in HDFS at /data/amazon/amazon_food_reviews.csv. The dataset has the following columns:

Id, ProductId, UserId, ProfileName, HelpfulnessNumerator, HelpfulnessDenominator, Score, Time, Summary, Text

Similar to the Yelp Dataset, Amazon’s food review dataset provides you with some review text and a review score. Use MLlib to classify these reviews by score. You can use any classifiers and feature extractors that are available. You may also choose to classify either on positive/negative or the more granular stars rating.

Notes:

2. Amazon Review Helpfulness Regression

Amazon also gives a metric of “helpfulness”. The dataset has the number of users who marked a review as helpful, and the number of users who voted either up or down on the review.

Define a review’s helpfulness score as HelpfulnessNumerator / HelpfulnessDenominator.

Construct and train a model that uses a regression algorithm to predict a review’s helpfulness score from it’s text.

Notes:

3. Yelp Business Clustering

Going back to the Yelp dataset, suppose we want to find clusters of business in the Urbana/Champaign area. Where do businesses aggregate geographically? Could we predict from a set of coordinates which cluster of business a given business is in? Use K-Means to come up with a clustering model for the U-C area.

How can we determine how good our model is? The simplest way is to just graph it, and see if the clusters match what we would expect. More formally, we can use Within Set Sum of Squared Error (WSSSE) to determine the optimal number of clusters. If we plot the error for multiple values of k, we can see the point of diminishing returns to adding more clusters. You should pick a value of k that is around this point of diminishing return.

Notes:

Report Component

This assignment will not be autograded. For each problem, answer the following questions and submit your answers to Moodle as a single PDF file for the entire assignment. Note: You must submit your report on Moodle to get a grade for this MP.

  1. What features of the data did you use to create your model?
  2. How did you evaluate the performance of your model?
  3. What was the performance of your model? (i.e. using RegressionMetrics, MulticlassMetrics, or WSSSE)
  4. What parameters did you change to try to improve the performance of your model? What was the effect of changing these parameters?

In addition, be sure to include these components in your report:

Submission

MP 6 is due on Tuesday, March 13th, 2018 at 11:59PM.

Grading Note: This week’s MP will not be graded by an autograder. Your grade will come from the submitted report materials.

You can find starter code and the Git repository where you should submit your code here:

If you have issues accessing your repository, make sure that you’ve signed into Gitlab. If you still do not have access, please make a private post on the course Piazza.

Please write your solutions as indicated in the following files:

… and commit your code to Gitlab like this:

git add .
git commit -m "Completed MP" # Your commit message doesn't need to match this
git push

WARNING: Our autograder runs on Python3. If you write code for this assignment that only works on Python2, we will be unable to grade your code. Code that contains syntax errors (i.e. unable to be interpreted by the Python interpreter) will receive no credit.