MP4: Multithreaded Sorting Program

CS 241, Spring 2012

Due: Mar 2, Friday 11:59 PM


PURPOSE

More and more programs today are being programmed as multithread applications. This MP is to help you understand how a multithreaded program works, what the advantage of multithreading is in comparison to a single-threaded program, and the pitfalls that occur while designing a program to work in a parallel manner. You will become familiar with various pthread library functions during the MP(e.g., for thread creation/join, synchronization among multiple threads, etc.).

You are tasked with writing an application which will sort its input in a parallel fashion. Initially, you assume that the entire data set has been given to you in the form of separate independent files. You will proceed sorting these files by spawning a worker thread to sort each file. After the worker threads complete their execution, you will merge the files into one, by taking two files at a time. In the end you will have one large file which contains all the input data sorted. This method significantly improves the execution time, as the the threads tend to work in parallel.

BACKGROUND

The pthread (POSIX thread) libraries are a standard based thread API for C/C++. It allows one to spawn a new concurrent process flow. It is most effective on multi-processor or multi-core systems where the process flow can be scheduled to run on another processor thus gaining speed through parallel or distributed processing. Threads require less overhead than "forking" or spawning a new process because the system does not initialize a new system virtual memory space and environment for the process.

TASK

[Part 1]

The main thread of the program reads the input file names from the command line. Upon reading the file names, the main thread creates one worker thread for each file and passes the file name or the handle of the file to the worker thread. You may assume that all file names are valid and distinct. You should not make any assumptions on the total number of files.

[Part 2]

The file names referred to in [Part 1] are files containing a list of random numbers. Each line will contain a single integer, the integer will fit within a standard C signed `int`, and the total number of lines will fit within a standard C `unsigned int`.

While the main thread continues to spawn worker threads, the worker threads in the meanwhile read the data from the named file, sort the data in increasing numeric order (e.g., 1 is before 6, and 6 is before 14, etc), then writes the sorted sequence of numbers out to another file. The file should be named .sorted (i.e. file1.txt.sorted or file2.txt.sorted). Note that it does not modify the original file, but stores the result in a new file.

As with the input file, the sorted file is a text file with one number per line. You are required to use the C standard function qsort() for sorting. If you are not familiar with qsort(), please use "man qsort" for details. Having finished writing the sorted file, the worker thread terminates. Before termination, each file prints the number of integers it has sorted and the name of the file it writes to in the following format:

This worker thread writes XXXXX lines to "YYYYY".

This can be accomplished by the following printf() command:

printf("This worker thread writes %d lines to \"%s\".\n", ...);

[Part 3]

The multiple sorted files created in [Part 2] are now to be merged into a single sorted file. You should take two consecutively listed files (depending on the order in which the file names were entered on the command line) and merge them maintaining the sorted order of integers. This should be done by a new thread. Do this merge step iteratively until you have a single file left. When merging the two files, if a number appears in both files, it should appear in the result file only once. You should delete all the intermediate files that you create during this procedure. (The system file tmpfile() will be helpful here; it returns a file pointer to a temporary created file with "w+" access permissions and ensures that the file is deleted at the end of the program. Please refer to man pages for more details.) This single file should contain all the numbers in sorted order from different files and must be named "sorted.txt". If you choose not to use tmpfile(), all files except the orignal input files, the sorted versions of all the orignal input files, and the final sorted.txt file must be deleted.

At each particular level of the merging tree (shown in the figure below), remember to wait for all the threads of the previous level to return before spawning new ones. The merge thread should display the total lines in each of the two files being merged together and the total number of lines after the merge.

Merged XXXXX lines and YYYYY lines into ZZZZZ lines.

...which can be accomplished with the following printf() command:

printf("Merged %d lines and %d lines into %d lines.\n", ...);

[Example Run]

Let us consider an example demonstrating this problem

NOTES

COMPILING AND RUNNING MP4

To compile, run the following commands from a command prompt on a Linux machine:

%> make clean
%> make

To run the executable,

%> ./mp4 [ ...]

Grading, Submission, and Other Details

Please fully read mp_grading.html for more details on grading, submission, and other topics that are shared between all MPs in CS 241.

If pthread_exit() is used, it is acceptable to have 5 blocks still reachable in valgrind's output.