(PP21): LOWAIN Project (LOW Arithmetic INtensity Specific Architectures)
Performance Analysis and Optimization
TimeWednesday, June 19th3:15pm - 4pm CEST
DescriptionWhen running the HPCG benchmark, conceived as a representative of supercomputer simulations, most modern supercomputers aren’t using more than 1.5-2.0% of their peak computing power.
Extrapolating, a future Summit-like (peak) exascale supercomputer would execute ~15 PFlop/s when running the HPCG.
The principal reason of poor HPCG behavior is low flop/byte ratio (counting just bytes crossing the processor-memory boundary!).
SpMV product, the key HPCG component, does MPY+ADD only for each matrix element brought from the memory, i.e., double-precision flop/byte is <=0.25, single-precision flop/byte <=0.5.
E.g., NVIDIA Volta-100 has memory bandwidth 900 GB/s, enough for <=225 SpMV DP GFlop/s, its DP peak computing power 7800 GFlop/s used for <=2.88%.
The first phase of LOWAIN aims to confirm that a wide class of simulations, e.g., NWP, CFD, mechanical deformation, combustion/explosion, exhibits the flop/byte ratio not much higher than the one of the HPCG (already done for the WRF program).
Since it is unlikely the memory bandwidth will be substantially increased in the near future (the Volta-100 memory bus has extremely high width 4096!), the first LOWAIN phase would justify development of application-specific computer architectures more efficient for low flop/byte problems, in particular a highly heretic idea of “exascale-equivalent” computers with the same low-F/B performance as a Summit-like exascale computer, but having strongly sub-exascale peak performance.
The second phase would be directed to design of intelligent memory systems to guarantee the best use of the limited memory bandwidth, since the cache-miss behavior and rather rigid pre-fetch tools of the existing systems are not sufficient.