Mini Project 1

Posted: 2018-02-02   |   Due: 2018-02-18


Changelog

This page will be continuously updated with more information. A list of changes so far include:
* 2/13/2018 - Task 4 Released
* 2/4/2018 - Pandas tutorial uploaded
* 2/3/2018 - Discussion material uploaded
* 2/2/2018 - Initial release

Introduction

The goal of this project is to quantify the reliability of the Blue Waters memory subsystem i.e., if it fails, how frequently it fails, and how it fails. Syslog data (particularly pertaining to Machine Check Exceptions) from the system has been chunked into 2 month periods, pre-processed (parsed and cleaned up) and tabulated for use in this project. Links to datasets with 2 months of log-data have been provided below.

Please read this entire document and plan ahead before starting with the project. In particular, Task 3 will have you repeat all previous tasks in addition to spending a lot of time (computationally) processing datasets. Start early and ask questions!!

Project Deliverables

We will update this section with more details soon.

  1. Jupyter notebook containing code for Tasks 0, 1, and 2 by the checkpoint (11:59pm on Feb 11, 2018)
  2. Jupyter notebook containing code for all Tasks (by 11:59pm on Feb 18, 2018)
  3. A report containing summarized answers for each task of the mini project. Turn in a PDF of a powerpoint presentation. No more than 30 slides.
  4. In-class presentation (10 mins) on Wednesday 2/21/2018. No more than 8 slides.
    Guidelines for the presentation are:
    • Slide 1: Title - Project Title, Group Members, Dataset range
    • Slide 2: Task 0
    • Slide 3: Task 1
    • Slide 4: Task 2
    • Slide 5: Coalescing your data - present your knee curve
    • Slide 6: What major changes did you see after coalescing your data?
    • Slide 7: Task 4 - Both hypothesis parts. Present two tables + your MAP decisions.
    • Slide 8: Division of labor. You can leave this slide on while the audience asks questions.

Important Dates

  1. Checkpoint 1 is due 11:59pm on Feb 11, 2018
  2. The Final Checkpoint is due 11:59pm on Feb 18, 2018. The report is due at the same time.
  3. We will announce a time and place for group presentations.

Project Materials

File Download
Datasets via email and Task 4 below
Discussion Examples Link
Pandas Tutorial Link
Introduction to the Blue-Waters system. Link
AMD Manual for XE and XK Nodes Link
AMD Manual for Service Nodes Link
Syndrome Table x8 Link
Syndrome Table x4 Link

Information about Dataset

The data provided to you has the following fields:
NOTE: These fields are explained in the AMD Processor Manuals. You will get detailed descriptions about the logged errors and the architecture/system components that they are related to. Numbers with ticks in dataset are in binary.

  • NodeID - NOT USED IN THIS STUDY
  • Date Time - Date in format yyyy-mm-dd HH:MM:SS
  • Complete Node - Complete node id in the form of cX-YcZsKnT where X-Y is the cabinet coordinate, Z is the chassis in the cabinet, K the slot and T the node number
  • Cabinet - Cabinet coordinate for the node from Complete Node
  • Chassis - Chassis coordinate for the node from Complete Node
  • Slot - Slot coordinate for the node from Complete Node
  • Node - Node coordinate for the node from Complete Node
  • Node Type - Type of node generating the machine check exception
  • Processor - Processor model number (Defines which AMD manual will carry related error information)
  • Time - UNIX Timestamp when the exception was generated
  • Socket - NOT USED IN THIS STUDY
  • Apic - NOT USED IN THIS STUDY
  • Bank - Bank Generating the machine check exception
  • Err Val - This bit indicates that a valid error has been detected (NOT USED IN THIS STUDY)
  • OV - Overflow
  • UC - Error uncorrected
  • PCC - Process context corrupt
  • CECC - Correctable ECC/Chipkill error
  • UECC - Uncorrectable ECC/Chipkill error
  • DEF - Deferred Error
  • POISON - Poison Error
  • L3 Subcache - L3 Subcache in error
  • Sub Link - Bit indicates error in upper or lower byte of Sub link or DRAM channel.
  • LDT Link - For errors associated with a hypertransport link, this field indicates which link was associated with an error
  • Scrub - ECC error detected by the scrubber
  • Link - Link in error
  • Cache way in error - Indicates that the cache-way in error
  • Syndrome - Syndrome of the corrected ECC/Chipkill error
  • Core - ID of core in which error has occurred
  • Errorcode - The MCi_STATUS error information
  • Ext_errorcode - Logs an extended error code when an error is detected. Used in conjunction with ErrorCode
  • Error Type - Type of error
  • Addr - Address generating the machine check
  • Addr Desc - Type of address
  • Errorcode Type - Type of Machine Check Exception
  • Misc - Miscellaneous data (not used)

Summary of Nodes in the System

Compute Nodes

22640 XE and 784 Service nodes

  • 2x AMD OPTERON Processor
  • 8x 8GB DIMMS DDR3
  • CHIPKILL 8x/4x

GPU Nodes

Before Aug 2013: 3072 XK nodes

  • 1x AMD OPTERON Processor
  • 4x 8GB DIMMS DDR3
  • 1 NVIDIA K20X 6GB DDR5
  • CHIPKILL 8x/4x

After Aug 2013: 4224 XK nodes

  • 1x AMD OPTERON Processor
  • 4x 8GB DIMMS DDR3
  • 1 NVIDIA K20X 6GB DDR5
  • CHIPKILL 8x/4x



Project Tasks

Task 0 - Get familiar with the analysis environment

  1. Import your assigned dataset and filter out bad entries.
    Hint: Take a look at the min/max timestamps. What time range does your data cover?
    Hint: You may need to filter data based on additional columns
  2. Summarize the following information:

    • Total number of entries
    • Unique number of nodes
    • Number of days
    • Unique node types
    • Total number of uncorrectable errors
    • Different type of machine check exceptions
  3. How would you define error and failure in this data?
    Hint: Refer to lecture 4 for an introduction to reliability engineering

  4. Count the number of MCEs per node. Provide a box plot to summarize your results.

  5. Compute the mean time between MCEs for:

    • All nodes together (the whole dataset)
    • Each of the node types (i.e. XE, XK, etc)

Task 1 - Analysis of Machine Check Exceptions Rates

  1. Plot the time to MCE distribution. Does this fit any known distribution (e.g., Gaussian, Weibull, Exponential)?

  2. What percentage of MCEs is due to memory errors?
    Hint: Which bank generates memory errors? Take a look at the AMD developers manual

  3. Provide a breakdown of the number, type (e.g., ECC, L1, L2, memory) and % of machine check for the entire dataset and per node type.
    Construct a bar chart to visualize your results.

  4. What is a correctable error, uncorrectable error and deferred error in this dataset?

  5. Are there any uncorrectable errors?
    If yes, provide a histogram for the TBF for uncorrectable errors.
    Compute a separate MTBF and FIT for uncorrectable errors.

FIT is defined as the number of failures in 109 hours of operation
Note: Your time range should be based on the entire dataset, not the filtered dataset with only uncorrectable errors. Make sure you are consistent with your time units.

Task 2 - Assessment of the error detection and correction techniques (only memory errors)

  1. Provide a breakdown of the memory errors % in Single, dual, triple, quadruple bit errors

    • Use a table to summarize the data, for all node types (ALL, XE, XK, service)
    • Use the x8 syndrome table in the AMD processor manual (section 2.13.2.5) to understand how to solve this problem
  2. How frequent (time) are multiple (>1) bit errors?

    • Provide one or two charts of your choice to motivate your answer.
    • Do different types of nodes (XE, XK, service) behave differently in terms of the frequency of multiple bit errors?
  3. Test the following hypothesis: XK nodes perform worse (have a higher rate of memory errors) than XE nodes.
    Remember to normalize rates based on memory capacities of these node types.

  4. How many uncorrectable errors would Blue Waters have if it only used ECC SEC-DED (single bit error correction)?
    Blue waters uses an improved version of ECC which can correct multi-bit errors (as seen in your dataset).
    How effective is this improved ECC over regular ECC?

    • Compare the FIT and MTBF (only for uncorrectable errors) considering the same system with regular ECC and improved ECC.
    • Summarize your answer in 2-3 sentences.

Task 3 - Data Coalescing

  1. Coalesce your dataset using the Sliding Window algorithm. Justify your window size by providing a knee curve. The Sliding Window Algorithm is provided below. Note: You should only coalesce entries that originate from the same node. Your knee curve will be based on the sum of tuples of your coalescing for each node.

    Sliding Window Algorithm:
    sliding(df_node) {
    	W = window size;
    	foreach event in the dataframe {
    		if (T(df_node.curr) - T(df_node.prev) < W)
    			add df_node.curr to the current tuple;
    		else
    			create a new tuple with df_node.curr;
    	}
    }
    
  2. Repeat Tasks 0, 1, and 2 after coalescing. How do your results differ?

  3. Are chassis location & memory errors independent? Hint: This requires a hypothesis test.

Task 4 - Application Data Analysis

In this task, you will be applying Bayesian Analysis to the MCE, OS and Application Failure model described in lecture 5.

  1. Import the dataset assigned to your group into pandas. How many entries do you have?

  2. Calculate the probabilities/conditional probabilities for each feature.

  3. Given that an application has failed, calculate the probabilities of each possible hypothesis. Apply the MAP decision rule to your hypotheses.

  4. Given each of the 4 cases below, are applications more likely to fail or continue running?

    • MCE = ‘C’, OS = ‘X’
    • MCE = ‘P’, OS = ‘R’
    • MCE = ‘U’, OS = ‘R’
    • MCE = ‘U’, OS = ‘X’

Datasets

Group Link
Group 01 CSV
Group 02 CSV
Group 03 CSV
Group 04 CSV
Group 05 CSV
Group 06 CSV
Group 07 CSV
Group 08 CSV
Group 09 CSV
Group 10 CSV
Group 11 CSV
Group 12 CSV
Group 13 CSV