Homework 1

Posted: 2018-01-24   |   Due: 2018-02-01 (by 10am)


Homework Files

File Download
Dataset Link
Jupyter Template Link
Example CSV Link

Instructions

In this assignment, you will parse & analyze a page fault trace obtained from a home desktop computer.
Your job is to convert the raw log to a CSV file and sort/filter/slice the resulting dataframe.

Log Format

Raw page fault entries in pf.log are formatted as follows:

<timestamp>:<process>:<pid>:<address>:<read/write>:<major/minor>:<time to resolve>
	<library+offset/function addr>
	<library+offset/function addr>
	<library+offset/function addr>

where
* <timestamp> indicates the Unix time when the page fault occurred
* <process> indicates the name of the process causing the page fault
* <pid> indicates the process ID of the process causing the page fault
* <address> indicates the address (in hex) where the page fault occurred
* <read/write> indicates whether the page fault was caused by a read or write access
* <major/minor> indicates whether the page fault was major or minor
* <time to resolve> indicates the amount of time (in milliseconds) the operating system took to resolve the page fault

Additionally, a backtrace is provided for each page fault with one or more entries:
* library indicates the name and version of the library in the trace entry
* function addr indicates the address of the first instruction in the function
* offset indicates the offset of instruction within the function causing the page fault

Dataframe Format

You are required to convert pf.log to an intermediary CSV (Comma Separated Values) file with the following headers:
1. index - an auto incremental value unique for each page fault
2. time - timestamp of page fault
3. proc_name - name of process causing page fault
4. pid - process ID of process causing page fault
5. pfaddr - page fault address (converted to int)
6. rw - read/write access
7. major_minor - major/minor access
8. resolve_time - time in milliseconds to resolve page fault
9. lib - full path of library causing page fault
10. addr - address of function within backtrace (converted to int)
11. offset - offset (within function) of instruction causing page fault (converted to int)

For page faults with multiple backtrace entries, write out one line per trace entry, while keeping points 1-8 constant.
Values should be separated by tabs (’\t’) and lines should be terminated with a single new line delimiter (’\n’).
Save your file as pf.csv.

An example of the required CSV file is provided above. It contains entries for the first two page faults in pf.log.

Making sense of your dataframe

Import pf.csv to a pandas dataframe using the read_csv function.
Remember to convert timestamps to pandas datetimes and set the dataframe index appropriately.

Before you begin answering questions below, play around with your dataframe using functions covered in this tutorial.

Questions

Answer the following questions based on your analysis above.
Format your answers in a slide show presentation as bullet points.
Remember to include a title & axes for each plot.

A. Background (~2 slides)
You may use any resources available to answer the following questions.
However, you must cite your sources at the end of your report.

a. What is a page fault? When is the page fault exception raised on x86 processors?
b. How is a page fault handled by an operating system?
c. What is the difference between major and minor page faults?
d. What does an *.so file represent on Linux?

B. Describe the data structure you used to parse in the raw log file in terms of python dictionaries, lists, sets, etc. (1 slide)

C. Data Analysis (1 or 2 slides per question)
a. What time range does this data cover?
b. How many unique programs were executed over this period? How many times was each program executed?
c. Compare the number of major & minor page faults for each program (averaged over all runs). Plot a bar chart with two categories - major & minor, to demonstrate your results. d. Provide a distribution for the time to handle for major & minor page faults. Report the mean and standard deviation for each program.
e. Provide a list of unique libraries present on the system being analyzed. For each library, provide a list of functions & offsets where page faults occur. Hint: Use the groupby function.

What to turn in?

Submit your ipynb notebook and PDF report via Compass.
You do not need to submit your CSV or log file.