Co-Lab Seminar

Using extrapolation to improve scalability and resilience of current PDE solvers

Professor Markus Hegland, MSI

With the availability of highly parallel computers, the solution of Boltzman equations for particle densities in 6 dimensional state space (plus time) becomes feasible allowing for modelling dynamics with non-Maxwellian velocity distributions. An example is the gyrokinetic equations of plasma physics with weak collisions and a state space which can be reduced to 5 dimensions.

Here I will show how a potentially non-stable extrapolation method has been used in connection with state-of-the-art GENE software, to obtain good approximations which enables the solution of larger problems and can be made tolerant of certain hardware faults.

I will give an introduction to the mathematics of sparse grids and in particular, the new variants of the combination technique developed. This work has been done in close collaboration between mathematicians and computer scientists of the ANU and the Technical University of Munich (TUM), and has been supported by an ARC Linkage grant and a TUM fellowship. The work resulted in successfully completed PhD projects at the ANU (MSI & CS) and TUM.

If time permits, I will comment on some recent PhD research at the MSI applying the sparse grid combination technique to the computation quantities of interest relating to solutions of PDEs.

 

Two approaches to highly scalable and resilient partial differential equation solvers

Associate Professor Peter Strazdins, RSCS

There is an increasing need to make large-scale scientific simulations resilient to the shrinking and growing of compute resources arising from Exascale computing and adverse operating conditions (fault tolerance). For similar reasons, this also applies to soft faults
i.e. bit flips arising in memory or CPU calculations.

In this seminar, we firstly describe how the Sparse Grid Combination Technique can make such applications resilient to shrinking compute resources. The solution of the non-trivial issues of dealing with data redistribution and on-the-fly malleability of process grid
information and ULFM MPI communicators are described. Results on a 2D advection solver indicate that process recovery time is significantly reduced from the alternate strategy where failed resources are replaced. Overall execution time is actually improved from this case and for checkpointing and the execution error remains small, even when
multiple failures occur. We will then discuss yet-to-be-resolved issues when generalizing to the context of growing and shrinking resources. Finally, open questions relating to how this technique may be applied to detect and recover from soft faults are discussed.

Secondly, we present a general technique to solve Partial Differential Equations, called robust stencils, which make them tolerant to soft faults. We show how it can be applied to a two-dimensional Lax-Wendroff solver. The resulting 2D robust stencils are derived
using an orthogonal application of their 1D counterparts. Combinations of 3 to 5 base stencils can then be created. We describe how these are then implemented in a parallel advection solver. Various robust stencil combinations are explored, representing a tradeoff between performance and robustness. The results indicate that 3-stencil robust combinations are slightly faster on large parallel workloads than Triple Modular Redundancy (TMR), and we expect significant further improvements with suitable optimizations. They also have one third of the memory footprint. Because faults are avoided each time new points are computed, the proposed stencils are also comparably robust to faults as TMR for a large range of error rates.

 

This event is free and open to the public. It will be followed by drinks and nibbles on Level 3 of the Hanna Neumann Building (145).

The Co-Lab Seminar Series is supported by the Mathematical Sciences Institute at the College of Science, the Research School of Computer Science at the College of Engineering & Computer Science and the Co-Lab at the ANU.