MP6: MapReduce

CS 241


IMPORTANT

You will be using fork() in this MP. You should understand what it means to fork bomb.

If you fork bomb the nightly autograder, you may be excluded from all future nightly autograders for this MP.
If you fork bomb the final autograder, you will get a 0 for this MP.
...once a fork bomb is detected, the autograder and all the other processes die. So nightly autograders may not always get out overnight.

Make sure to always run all five testers before committing your code.

Introduction

In 2004, Google developed a general framework for processing large data sets on clusters of computers. You can read more about MapReduce on Wikipedia, but we will explain everything you need to know below.

To explain what MapReduce does, we'll use a small dataset (the first few lines of a famous poem by Robert Frost):

Two roads diverged in a yellow wood,
And sorry I could not travel both
And be one traveler, long I stood
And looked down one as far as I could.


To run MapReduce, we first split the dataset into small pieces. For this example, we will split the dataset by the four lines of the poem:
Data Set #1
Input: "Two roads diverged in a yellow wood,"
Data Set #2
Input: "And sorry I could not travel both"
Data Set #3
Input: "And be one traveler, long I stood"
Data Set #4
Input: "And looked down one as far as I could."


As part of the input to any MapReduce program, a user will provide a map() function. This function will map the input into a series of (key, value) pairs. For this example, let the map() function simply count the number of each letter (a-z) in the data set.

This MapReduce algorithm will spawn 1 process per data set and run the map() function on each dataset:
Data Set #1
Input: "Two roads diverged in a yellow wood,"
pid = 1001
map(): a: 2, b: 0, c: 0, d: 4, e: 3, f: 0, g: 1, h: 0, i: 2, j: 0, k: 0, l: 2, m: 0, n: 1, o: 5, p: 0, q: 0, r: 2, s: 1, t: 1, u: 0, v: 1, w: 3, x: 0, y: 1, z: 0
Data Set #2
Input: "And sorry I could not travel both"
pid = 1002
map(): a: 2, b: 1, c: 1, d: 2, e: 1, f: 0, g: 0, h: 1, i: 1, j: 0, k: 0, l: 2, m: 0, n: 2, o: 4, p: 0, q: 0, r: 3, s: 1, t: 3, u: 1, v: 1, w: 0, x: 0, y: 1, z: 0
Data Set #3
Input: "And be one traveler, long I stood"
pid = 1003
map(): a: 2, b: 1, c: 0, d: 2, e: 4, f: 0, g: 1, h: 0, i: 1, j: 0, k: 0, l: 2, m: 0, n: 3, o: 4, p: 0, q: 0, r: 2, s: 1, t: 2, u: 0, v: 1, w: 0, x: 0, y: 0, z: 0
Data Set #4
Input: "And looked down one as far as I could."
pid = 1004
map(): a: 4, b: 0, c: 1, d: 4, e: 2, f: 1, g: 0, h: 0, i: 1, j: 0, k: 1, l: 2, m: 0, n: 3, o: 5, p: 0, q: 0, r: 1, s: 2, t: 0, u: 1, v: 0, w: 1, x: 0, y: 0, z: 0


As the map() processes produce output, the MapReduce algorithm must collect the output and reduce() the output of each process into a single data set to return to the user. The reduce() function is called every time the same key is seen from two different processes. In this example, our reduce function will simply add the values of the two keys (eg: a: 2 and a: 4 will result in a: 6).

The reduce() is being done in a worker thread of the process that called the MapReduce framework initially, and not a new or recycled map() process. Adding to our diagram:
Data Set #1
Input: "Two roads diverged in a yellow wood,"
pid = 1001
map(): a: 2, b: 0, c: 0, d: 4, e: 3, f: 0, g: 1, h: 0, i: 2, j: 0, k: 0, l: 2, m: 0, n: 1, o: 5, p: 0, q: 0, r: 2, s: 1, t: 1, u: 0, v: 1, w: 3, x: 0, y: 1, z: 0
Data Set #2
Input: "And sorry I could not travel both"
pid = 1002
map(): a: 2, b: 1, c: 1, d: 2, e: 1, f: 0, g: 0, h: 1, i: 1, j: 0, k: 0, l: 2, m: 0, n: 2, o: 4, p: 0, q: 0, r: 3, s: 1, t: 3, u: 1, v: 1, w: 0, x: 0, y: 1, z: 0
Data Set #3
Input: "And be one traveler, long I stood"
pid = 1003
map(): a: 2, b: 1, c: 0, d: 2, e: 4, f: 0, g: 1, h: 0, i: 1, j: 0, k: 0, l: 2, m: 0, n: 3, o: 4, p: 0, q: 0, r: 2, s: 1, t: 2, u: 0, v: 1, w: 0, x: 0, y: 0, z: 0
Data Set #4
Input: "And looked down one as far as I could."
pid = 1004
map(): a: 4, b: 0, c: 1, d: 4, e: 2, f: 1, g: 0, h: 0, i: 1, j: 0, k: 1, l: 2, m: 0, n: 3, o: 5, p: 0, q: 0, r: 1, s: 2, t: 0, u: 1, v: 0, w: 1, x: 0, y: 0, z: 0
pid = 1000, worker thread
reduce(), reduce(), reduce(), and more reduce()'ing
... ... ...
Result: a: 10, b: 2, c: 2, d: 12, e: 10, f: 1, g: 2, h: 1, i: 5, j: 0, k: 1, l: 8, m: 0, n: 9, o: 18, p: 0, q: 0, r: 8, s: 5, t: 6, u: 2, v: 3, w: 4, x: 0, y: 2, z: 0


In this MP, you will be provided the data sets (already split up for you) and you will need to:


What you must do...

What we provide for you:

Besides the tester programs, we provide you a libdictionary much line you made in MP1. You will find there are six functions:

Additionally, all the libdictionary except _init() and _destroy() functions are thread-safe.

You will find we also provide a read_from_fd() helper function, which we will explain later in this file.

What you must do for this MP:

You must complete the five core functions that make up libmapreduce:

Compile and Run

As always, compile using:
make clean
make
We provide five test cases: The expected outputs are:
[netid@linux1 mp6]$ ./test1
letters: 8
[netid@linux1 mp6]$ ./test2
letters: 21
[netid@linux1 mp6]$ ./test3
a: 10
b: 2
c: 2
d: 12
e: 10
f: 1
g: 2
h: 1
i: 5
j: 0
k: 1
l: 8
m: 0
n: 9
o: 18
p: 0
q: 0
r: 8
s: 5
t: 6
u: 2
v: 3
w: 4
x: 0
y: 2
z: 0
[netid@linux1 mp6]$ ./test4
the: 1804
and: 912
alice: 385
some-word-that-wont-exist: (null)
[netid@linux1 mp6]$ ./test5
Sleeping for 6 seconds (job #1)...
Sleeping for 5 seconds (job #2)...
Sleeping for 4 seconds (job #3)...
Sleeping for 3 seconds (job #4)...
value: (null)
value: (null)
value: (null)
value: 1
value: 2
value: 3
value: 4
value: 4
value: 4
value: 4
(NOTE: ./test5 may be slightly different depending on how you create your IPC mechanism.)

Grading

For valgrind memory grading, we will only be testing if your program cleaned up all memory in your original, parent process. You should run:

valgrind --child-silent-after-fork=yes --leak-check=full ./test#
...when running your valgrind tests.

Finally, remember that big red warning at the top of this page.

Lab Report

In CS 241, you will need to submit a written report to explain how you went about implementing the MP assignment. In the report for MP6, you should at least answer these questions: This lab report must be named NETID-mp6.pdf and be in PDF format. See cs241.html for full details.

Grading, Submission, and Other Details

Please fully read cs241.html for more details on grading, submission, and other topics that are shared between all MPs in CS 241.