Scalable Fault-tolerant PDE Solvers


Partial Differential Equations (PDEs) are prevalent in mathematical analysis and modelling of natural and human systems. The solution of PDEs on a new generation of petascale supercomputers--machines capable of performing 1015 floating point operations per second--poses two key challenges, scalability and resiliance. The scalability problem is twofold: scalability to large numbers of processing elements (PEs) or "chips," and scalability to higher dimensional systems. Resiliance is the ability of an algorithm to continue functioning in the event of hardware component failures. This is of increasing interest as the high-performance computing community approaches the exascale era (machines capable of performing 1018 floating point operations per second). Exascale platforms will comprise millions of hardware components, meaning the machine as a whole will have shorter mean-time-to-failure values than previously seen. This means an application running on an exascale machine is likely to experience a hardware failure at some point during its exection. Our focus is the formulation of numerical schemes and algorithms that implement algorithmic-based fault-tolerance (ABFT). Our ABFT schemes employ multigrid strategies such as the sparse grid combination technique. These schemes are capable of producing solutions under conditions of hardware failure, with known error bounds.

For people involved in the Mathematical Modelling and Computation program, research opportunities exist in formulation of fault-tolerant numerical schemes, implementation of fault-tolerant schemes on supercomputers, simulation of hardware failure events on ultrascale supercomputers, and application of these techniques to scientific computing.
This research is funded by Fujitsu Laboratories Europe and the Australian Research Council.