Petaflop Seismic Simulations in the Public Cloud
Event Type
Research Paper
Clouds and Distributed Computing
Extreme-Scale Computing
HPC workflows
Parallel Applications
Scientific Software Development
TimeWednesday, June 19th9am - 9:30am CEST
LocationSubstanz 1, 2
DescriptionDuring the last decade cloud services became a popular solution for diverse applications.
Additionally, hardware support for
virtualization closed performance gaps, compared to on-premises, bare-metal
This development is driven by offloaded hypervisors and full CPU virtualization. Today's cloud service providers, such as Amazon or Google,
offer the ability to assemble application-tailored clusters to maximize performance. However, from an interconnect point of view, one has to tackle a 4-5$\times$ slow-down in terms of bandwidth and 30$\times$ in terms of latency,
compared to latest high-speed and low-latency interconnects.
Taking into account the high per-node and accelerator-driven performance of latest supercomputers, we observe that the network-bandwidth performance of recent cloud offerings is
within 2$\times$ of large supercomputers.
In order to address these challenges, we present a comprehensive application-centric approach for high-order seismic simulations utilizing the ADER discontinuous Galerkin finite element method.
This covers the tuning of the operating system, micro-benchmarking, and finally, the efficient execution of our solver in the cloud.
Due to this performance-oriented end-to-end workflow, we were able to achieve 1.09 PFLOPS on 768 AWS c5.18xlarge instances, offering 27,648 cores with 5 PFLOPS of theoretical computational power. This correlates to an achieved peak efficiency of over 20\% and a close-to 90\% parallel efficiency in a weak scaling setup.
In terms of strong scalability, we were able to strong-scale a science scenario from 2 to 64 instances with 60\% parallel efficiency. This work is, to the best of our knowledge, the first of its kind at such a large scale.