Fault-Tolerant Algorithms and Frameworks for Extreme-Scale Computing
A joint seminar with Associate Professor Linda Stals, MSI and Dr Josh Milthorpe, RSCS
On future extreme-scale computers, faults will become increasingly common as the number of individual components grows without a compensating improvement in reliability. Achieving resilience is expensive since it inevitably requires redundancy and thus more system resources and additional energy. Traditional checkpoint techniques collect and transfer the data regularly from all compute nodes and store the data to backup memory, but this will be too expensive and too slow in extreme-scale computing.
We will explore a two-pronged complementary approach that exploits application-specific features and framework support for resilience, to both reduce the amount of redundancy and to speed up the recovery process. Stals will concentrate on the mathematical properties of the algorithm to determine the minimal amount of information that needs to be stored in order to recover from a fault. Milthorpe will review support for resilience in the frameworks and programming models for high-performance computing, with the goal of providing low-overhead resilience with minimal programmer effort.
This event is free and open to the public. It will be followed by drinks and nibbles on Level 3 of the Hanna Neumann Building (145).
The Co-Lab Seminar Series is supported by the Mathematical Sciences Institute at the College of Science, the Research School of Computer Science at the College of Engineering & Computer Science and the Co-Lab at the ANU.