This week we’ll be diving into another aspect of Spark: MLlib. Spark MLlib provides an API to run machine learning algorithms on RDDs so that we can do ML on the cluster with the benefit of parallelism / distributed computing.
Note: Spark has 2 APIs for MLlib – one based on RDDs and one based on DataFrames. You can choose to use either one for this MP. The DataFrame-based API will be better supported in the future, but the RDD API has been around longer, so you may find more useful information about it on sites like StackOverflow than you will for the DataFrame API.
We’ll be considering 3 types of machine learning in the MP this week:
For most of the algorithms / feature extractors that MLlib provides, there is a common pattern:
If you go on to do more data science work, you’ll see that this 2-phase ML pattern is common in other ML libraries, like scikit-learn
.
Things are a bit more explicit in MLlib than in other ML libraries like scikit-learn
, because part of the way RDDs are handled (i.e. lazy evaluation). We often have to explicitly note when we want to predict data or when we’re piping data through different steps of our model’s setup.
It’ll be extremely valuable to look up PySpark’s documentation and MLlib examples when working on this week’s MP. The MP examples we’ll be giving you do not require deep knowledge of Machine Learning concepts to complete. However, you will need to be a good documentation reader to navigate MLlib’s nuances.
Remember last week when you found out which words were correlated with negative reviews by calculating the probability of a word occurring in a review? PySpark lets you do something like this extremely easily to calculate the Term Frequency - Inverse Document Frequency (tf-idf) features of a set of texts.
Let’s define some terms:
TF-IDF combines the previous two concepts. Suppose we had a few sentences that refer to “cats” in a large book. We’d rank those rare sentences then by the frequency of the “cats” in each of those sentences.
There’s a bit of extra math behind calculating TF-IDF, but for this MP it is sufficient to know that it is a relatively reliable way of guessing the relevance of a word in the context of a large body of data.
You’ll also note that we’re making use of a HashingTF
. This is just a really quick way to compute the term-frequency of words. It uses a hash function to represent a long string with a shorter hash, and can use a data structure like a hash map to quickly count the frequency with which words appear.
We’ll also be using a Naive Bayes classifier. This type of classifier looks at a set of data and labels, and constructs a model to predict the label given the data using probabilistic means.
Again, it’s not necessary to know the inner workings of Naive Bayes, just know that we’ll be using it to build a classifier.
To construct a model, we’ll need to construct an RDD that has (key, value)
pairs with our labels as keys, and our features as values. First, we’ll need to extract those features from the text. We’re going to use TF-IDF to create features, so we’ll calculate that for all of our text data first.
We’ll start with the assumption that you’ve transformed the data so that we have (label, array_of_words)
as the RDD. To start with, we’ll have label be 0
if the review is negative and 1
if the review is positive.
Here’s how we’ll extract the TF-IDF features:
# Feed HashingTF the array of words
tf = HashingTF().transform(labeled_data.values())
# Pipe term frequencies into the IDF
idf = IDF(minDocFreq=5).fit(tf)
# Transform the IDF into a TF-IDF
tfidf = idf.transform(tf)
# Reassemble the data into (label, feature) K,V pairs
zipped_data = (labels.zip(tfidf)
.map(lambda x: LabeledPoint(x[0], x[1]))
.cache())
Now that we have our labels and our features in one RDD, we can train our model:
# Do a random split so we can test our model on non-trained data
training, test = zipped_data.randomSplit([0.7, 0.3])
# Train our model with the training data
model = NaiveBayes.train(training)
Then, we can use this model to predict new data:
# Use the test data and get predicted labels from our model
test_preds = (test.map(lambda x: x.label)
.zip(model.predict(test.map(lambda x: x.features))))
If we look at this test_preds
RDD, we’ll see our text and the label the model predicted.
However, if we want a more precise measurement of how our model faired, PySpark gives us MulticlassMetrics
, which we can use to measure our model’s performance.
trained_metrics = MulticlassMetrics(train_preds.mapValues(float))
test_metrics = MulticlassMetrics(test_preds.mapValues(float))
print(trained_metrics.confusionMatrix().toArray())
print(trained_metrics.precision())
print(test_metrics.confusionMatrix().toArray())
print(test_metrics.precision())
MulticlassMetrics
let’s us see the “confusion matrix” of our model, which shows us how many times our model chose each label given the actual label of the data point.
The meaning of the columns is the predicted value, and the meaning of the rows is the actual value. So, we read that confusion_matrix[0][1]
is the number of items predicted as having label[1]
that were in actuality label[0]
.
Thus, we want our confusion matrix to have as many items on the diagonals as possible, as these represent items that were correctly predicted.
We can also get precision, which is a more simple metric of “how many items we predicted correctly”.
Here’s our results for this example:
# Training Data Confusion Matrix:
[[ 2019245. 115503.]
[ 258646. 513539.]]
# Training Data Accuracy:
0.8712908071840665
# Testing Data Confusion Matrix:
[[ 861056. 55386.]
[ 115276. 214499.]]
#Testing Data Accuracy:
0.8630559525347512
Not terrible. As you see, our training data gets slightly better prediction precision, because it’s the data used to train the model.
What if instead of just classifying on positive and negative, we try to classify reviews based on their 1-5 stars review score? We can do this by changing the labels to be values from 1
to 5
instead of just 0
or 1
.
# Training Data Confusion Matrix:
[[ 130042. 38058. 55682. 115421. 193909.]
[ 27028. 71530. 26431. 55381. 95007.]
[ 35787. 22641. 102753. 71802. 122539.]
[ 72529. 45895. 69174. 254838. 246081.]
[ 113008. 73249. 108349. 225783. 535850.]]
# Training Data Accuracy:
0.37645263439801124
# Testing Data Confusion Matrix:
[[ 33706. 20317. 27553. 54344. 90325.]
[ 15384. 10373. 14875. 28413. 46173.]
[ 18958. 13288. 19389. 37813. 59746.]
[ 36921. 25382. 37791. 76008. 120251.]
[ 57014. 37817. 55372. 112851. 194319.]]
#Testing Data Accuracy:
0.268241369417615
Ouch. What went wrong? Well, a couple things. One thing that hurts us is that Naive Bayes is, well, Naive. While we intuitively know that the meanings 1, 2, 3, 4, 5 have a specific value, NB doesn’t have any concept that items labeled 4 and 5 are probably going to be closer than a pair labeled 1 and 5.
Also, in this example we see a case where testing our training data doesn’t have much utility. While an accuracy of 0.376
isn’t great, it’s still a lot better than 0.268
. Validating on the training data would lead us to think that our model is substantially more accurate than it actually is.
The full code of the first example is in bayes_binary_tfidf.py
, and the second “extended” example is in bayes_tfidf.py
.
NOTE:
rdd.randomSplit([0.8, 0.2])
)This week, we’ll be using an Amazon dataset of food reviews. You can find this dataset in HDFS at /data/amazon/amazon_food_reviews.csv
. The dataset has the following columns:
Id, ProductId, UserId, ProfileName, HelpfulnessNumerator, HelpfulnessDenominator, Score, Time, Summary, Text
Similar to the Yelp Dataset, Amazon’s food review dataset provides you with some review text and a review score. Use MLlib to classify these reviews by score. You can use any classifiers and feature extractors that are available. You may also choose to classify either on positive/negative or the more granular stars rating.
Notes:
csv.reader
to parse your data if you use the RDD api. Using str.split(',')
is insufficient, as there will be commas in the Text field of the review.HelpfulnessNumerator
or HelpfulnessDenominator
for feature extraction.MulticlassMetrics
to output the confusionMatrix
and precision
of your model. Include this output in your submission.Amazon also gives a metric of “helpfulness”. The dataset has the number of users who marked a review as helpful, and the number of users who voted either up or down on the review.
Define a review’s helpfulness score as HelpfulnessNumerator / HelpfulnessDenominator
.
Construct and train a model that uses a regression algorithm to predict a review’s helpfulness score from it’s text.
Notes:
csv.reader
to parse your data. Using str.split(',')
is insufficient, as there will be commas in the Text field of the review.Score
for feature extraction.pyspark.mllib.regression.LinearRegressionWithSGD
as your regression model.pyspark.mllib.evaluation.RegressionMetrics
to output the explainedVariance
and rootMeanSquaredError
. You want to minimize the error.Going back to the Yelp dataset, suppose we want to find clusters of business in the Urbana/Champaign area. Where do businesses aggregate geographically? Could we predict from a set of coordinates which cluster of business a given business is in? Use K-Means to come up with a clustering model for the U-C area.
How can we determine how good our model is? The simplest way is to just graph it, and see if the clusters match what we would expect. More formally, we can use Within Set Sum of Squared Error (WSSSE) to determine the optimal number of clusters. If we plot the error for multiple values of k, we can see the point of diminishing returns to adding more clusters. You should pick a value of k that is around this point of diminishing return.
Notes:
pyspark.mllib.clustering.KMeans
or pyspark.ml.clustering.KMeans
as your clustering algorithm.This assignment will not be autograded. For each problem, answer the following questions and submit your answers to Moodle as a single PDF file for the entire assignment. Note: You must submit your report on Moodle to get a grade for this MP.
In addition, be sure to include these components in your report:
MP 6 is due on Tuesday, March 13th, 2018 at 11:59PM.
Grading Note: This week’s MP will not be graded by an autograder. Your grade will come from the submitted report materials.
You can find starter code and the Git repository where you should submit your code here:
If you have issues accessing your repository, make sure that you’ve signed into Gitlab. If you still do not have access, please make a private post on the course Piazza.
Please write your solutions as indicated in the following files:
amazon_classification.py
amazon_regression.py
yelp_clustering.py
… and commit your code to Gitlab like this:
git add .
git commit -m "Completed MP" # Your commit message doesn't need to match this
git push
WARNING: Our autograder runs on Python3. If you write code for this assignment that only works on Python2, we will be unable to grade your code. Code that contains syntax errors (i.e. unable to be interpreted by the Python interpreter) will receive no credit.