Adaptive refinement recovery after fault simulation

The use of adaptive refinement techniques in combination with finite element methods is well established. Furthermore, iterative techniques that incorporate information about the grid structure, such as the multigrid method, have been shown to be a very efficient approach to solving various types of partial different equations. Naturally, as soon as parallel computers became available a number of researchers studied the algorithm's behaviour in a parallel setting, and as a consequence of these studies they now form an integral part of many sophisticated parallel software packages. However, the advent of larger and larger parallel machines leads to a very modern twist of this tale, and that is how to recover if a fault occurs in one of the processors.

In this paper we present a parallel adaptive multigrid method that uses dynamic data structures to store a nested sequence of meshes and the eveloving solution. After a fail-stop fault, the data residing on the faulty processor will be lost. However, the neighboring processors contain enough information such that a consistent mesh can be reconstructed in the faulty domain with the goal of resuming the
computation without having to restart from scratch.

Warning: This is an algorithms talk, there will not be one single equation in the whole talk.