This course introduces a system (hardware and software) view of design issues in reliable computing. The material represents a broad spectrum of hardware and software error detection and recovery techniques. The lectures discuss how the hardware and software techniques interplay; e.g., what techniques can be provided in hardware, operating system and network communication layers, and what can be provided via a distributed software layer and in the application itself.
The focus of the course is on basic concepts that underly the resilience of computer systems with focus on dependability including reliability, availability, and hardware and software fault models, redundancy, coding techniques, signature-base error checking (e.g., software-based control flow checking), processor-level error detection and recovery (e.g., duplicate execution and comparison), reconfiguration techniques in multiprocessor systems, checkpoint and recovery (single process and distributed environment), software fault tolerance techniques (e.g., process pair, robust data structures, recovery blocks, and N-version programming), and finally, network specific issues (e.g., providing consistent data and reliable communications). The capabilities and applicabilities of discussed techniques are illustrated with examples of real applications and systems.
The thrust of the course is to learn the aforementioned techniques through measurement and data driven data analytics for large scale computing systems and applications. We will use data on unexpected system failures and malicious security incidents to characterize resiliency and provide insights on design of current and future generation of large computing systems and applications. We will discuss methods (e.g., machine learning) and tools to perform high fidelity analytics and facilitate automation of data processing and quantification of resiliency metrics. Examples will be used to show the benefit of measurement driven analytics and their role in characterizing how system failures and/or malicious attack impact applications and how resiliency is affected by application characteristics.
Basic probability and basic computer programming skills are essential. Knowledge of Computer Organization and Design (ECE 411), or equivalent courses, or instructor consent, is required. Knowledge of Operating Systems (e.g., ECE 391), or an equivalent course, is beneficial.
About the Class
The class will consist of lectures, with the intent of building up common knowledge and grounding and then transition to discussion of both seminal and more recent research papers that outline new challenges and opportunities in dependable computing. For classes with student-led presentations, students who are not presenting in that particular session are expected to write short reviews for the papers being presented in that session. Students are expected to complete data-driven assignments/projects focused on resilience assessment and design.
We will compute the final grade using the following table:
|Mini-Projects||30%||2 throughout the semester|
|Paper Presentation + Reviews||15%|
|Class Participation||10%||May include quizzes|
Paper Presentation & Reviews
- Description: 2 pages max. 1 paragraph on the core idea of the paper, followed by list of pros and cons of the approach, and any questions/criticisms/thoughts about the paper.
- Grading Criteria: Argumentative critique (Pros/Cons), Creative comments about addressing issues or improving the paper.
- Due: Night before class at 10 p.m. Link will be posted on Piazza.
- Sign-up: TBD
- Description: 15 slides max (20 min for paper, 5 min critique, 5 min for questions). 2-3 slides on motivation and background. 5-7 slides on core ideas of the paper. 3-5 slides on experimental data. 3-5 slides on your thoughts/criticisms/questions/discussion points about the paper. Include slides summarizing Piazza discussion about paper.