(PhD03) Design of Robust Scheduling Methodologies in High Performance Computing
Clouds and Distributed Computing
Performance Analysis and Optimization
TimeMonday, June 17th1pm - 6pm CEST
DescriptionScientific applications are often irregular and characterized by large computationally-intensive parallel loops. The performance of scientific applications on high performance computing (HPC) systems may degrade due to load imbalance. Load imbalance may be caused by irregular computational load per loop iteration or irregular and unpredictable computing system characteristics. Dynamic loop scheduling (DLS) techniques improve the performance of computationally-intensive scientific applications by balancing the load during their execution.
A number of dynamic loop scheduling (DLS) techniques have been proposed between the late 1980s and early 2000s and efficiently used in scientific applications. HPC systems have significantly advanced in recent decades and are continuing to grow in terms of computing power and memory. State-of-the-art HPC systems have several million processing cores and approximately 1 Petabyte of system memory. However, achieving a balanced load execution of scientific applications on such systems is challenging due to systems heterogeneity, unpredictable performance variations, perturbations, and faults.
My Ph.D. aims to improve the performance of computationally-intensive scientific applications on HPC systems via robust load balancing under unpredictable application and system characteristics.
Given the significant advancement of HPC systems, the computing systems on which DLS techniques have initially been tested and validated are no longer available. Therefore, this work is concerned with the minimization of the sources of uncertainty in the implementation of DLS techniques to avoid unnecessary influences on the performance of scientific applications. It is essential to ensure that the DLS techniques employed in scientific applications today adhere to their original design goals and specifications and attain trust in the implementation of DLS techniques in today’s studies. To achieve this goal, verification of DLS techniques implementation via the reproduction of selected experiments  was performed via simulative and native experimentation.
Simulation alleviates a large number of exploratory native experiments required to optimize applications performance, which may not always be feasible or practical due to associated time and costs. Bridging the native and simulative executions of parallel applications are needed for attaining trustworthiness in simulation results. To this end, a methodology for bridging the native and simulative executions of parallel applications on HPC systems is devised in this work. The experiments presented in this poster confirm that the simulation reproduces the performance achieved on the past computing platform and accurately predicts the performance achieved on the present computing platform. The performance reproduction and prediction confirm that the present implementation of the DLS techniques considered both, in simulation and natively, adheres to their original description.
Using the above simulation methodology, trusted simulation of application performance was leveraged to achieve a balanced execution under perturbations via simulated assisted scheduling (SimAS). SimAS is a new control-theoretic inspired approach that predicts and selects DLS techniques that improve the performance under certain execution scenarios. The performance results confirm that the SimAS-based DLS selection delivered improved application performance in most experiments.
 S. Flynn Hummel, E. Schonberg, and L. E. Flynn, “Factoring: A method for scheduling parallel loops,” Communications of the ACM, vol. 35, no. 8, pp. 90–101, 1992.