Welcome Guest

Notice

Sorry, applications are no longer being accepted for this Posting.
Posting Title
Hardware/Software Resilience Co-Design Tools for Extreme-scale High-Performance Computing
Program
HBCU/MEI Faculty
Reference Code
HBCU/MEI-13-3
Eligibility Requirements
  • Affirmation: I certify that I am a full-time member of the teaching faculty at a HBCU/MEI accredited U.S. institution of higher education.  I have confirmed my institution is eligible by visiting http://www.orau.org/ornl/faculty/documents/minority-serving-institutions.pdf.
How To Apply

You must apply through the ORNL Talent and Opportunity System.  Please note the deadline to apply for this posting is January 10, 2013.

Application deadline
1/18/2013 11:59 PM Eastern Timezone
Academic Levels
  • Faculty
Disciplines
Qualifications

Experience with hardware and/or software fault tolerance in computer systems, parallel discrete event simulation of computer systems, modeling of performance and power characteristics of computer systems.

Description

 

The path to exascale computing poses several research challenges related to power, performance, resilience, productivity, programmability, data movement, and data management. Resilience, i.e., providing efficiency and correctness in the presence of faults, is one of the most important exascale computer science challenges as systems scale up in component count (100,000-1,000,000 nodes with 1,000-10,000 cores per node by 2020) and component reliability decreases (7 nm technology with near-threshold voltage operation by 2020). Several high-performance computing (HPC) resilience technologies have been developed. However, there are currently no tools, methods, and metrics to compare them and to identify the cost/benefit trade-off between the key system design factors:

performance, resilience, and power consumption. This project focuses on developing a resilience co-design toolkit with definitions, metrics, and methods to evaluate the cost/benefit trade-off of resilience solutions, identify hardware/software resilience properties, and coordinate interfaces/responsibilities of individual hardware/software components.

The primary goal of this project is to provide the tools and data needed by HPC vendors to decide on future architectures and to enable direct feedback to HPC vendors on emerging resilience threats.

back to the top