GCS Award Winning Paper: End-to-end Resilience for HPC Applications
Event Type
Hans Meuer Award Finalists
Research Paper
System Software & Runtime Systems
TimeMonday, June 17th5:30pm - 6pm CEST
LocationPanorama 1
DescriptionA plethora of resilience techniques have been investigated
ranging from checkpoint/restart over redundancy to algorithm-based
fault tolerance. Each technique works well for a different subset of application
kernels, and depending on the kernel, has different overheads,
resource requirements, and fault masking capabilities. If, however, such
techniques are combined and they interact across kernels, new vulnerability
windows are created.
This work contributes the idea of end-to-end resilience by protecting
windows of vulnerability between kernels guarded by different resilience
techniques. It introduces the live vulnerability factor (LVF), a new metric
that quantifies any lack of end-to-end protection for a given data structure.
The work further promotes end-to-end application protection across
kernels via a pragma-based specification for diverse resilience schemes
with minimal programming effort. This lifts the data protection burden
from application programmers allowing them to focus solely on algorithms
and performance while resilience is specified and subsequently embedded
into the code through the compiler/library and supported by the
runtime system. Two case studies demonstrate that end-to-end resilience
meshes well with different execution paradigms and assess its overhead
and effectiveness for different codes. In experiments with case studies and
benchmarks, end-to-end resilience has an overhead over kernel-specific
resilience of less than 3% on average and increases protection against bit
flips by a factor of three to four.