(PP15): EXAHD -- Current Work on Scalable and Fault-Tolerant Plasma Simulations
Clouds and Distributed Computing
TimeWednesday, June 19th3:15pm - 4pm
DescriptionIn this poster session, we give an overview on the SPPEXA project EXAHD. EXAHD focuses on the solution of a gyrokinetic system for plasma simulations. While the gyrokinetic formulation used in the GENE code is already reduced to five dimensions, a treatment of fully resolved tokamak geometries is still unfeasible due to the curse of dimensionality.
In EXAHD, we apply the Sparse Grid Combination Technique to decouple the problem into independent problems of lower resolution. By distributing them via a manager-worker pattern, we are able to scale GENE in our framework to up to 180225 cores. Even more, our approach allows for algorithmic fault tolerance: Missing solutions can be reconstructed from the neighboring solutions in case of silent errors and hardware failures without the need for expensive checkpoint-restart. Our results show that the Fault Tolerant Combination Technique allows for accurate results in the presence of hard and soft faults while maintaining high scalability.
The Combination Technique enables us to scale GENE even further -- we are therefore investigating the pitfalls and possibilities of distributing the computation across HPC systems.