(PP18): Harmony: A Harness Monitoring System for the Oak Ridge Leadership Computing Facility
Event Type
Project Poster
Extreme-Scale Computing
HPC workflows
Heterogeneous Systems
Parallel Applications
TimeTuesday, June 18th3:15pm - 3:45pm CEST
LocationBooth N-230
DescriptionSummit, the latest flagship supercomputer at the Oak Ridge Leadership Computing Facility (OLCF), and the number one system in the November 2018 Top500 list, completed its acceptance testing in 2018. Acceptance of a new system requires extensive testing and is comprised of hundreds of tests executed for several weeks. The acceptance test (AT) team utilizes the OLCF test harness to automate the launching and verification of these tests. Within this activity, analysis of test results is required to understand and classify all test failures. The sheer number of tests involved makes performing these tasks challenging. To complete these tasks more efficiently, in addition to lessen the personnel burden during acceptance testing, we have developed a harness monitoring system for the OLCF test harness called Harmony.

Harmony consists of three distinct modules: monitoring, recording, and reporting modules. Harmony’s monitoring module ensures that tests launched by the harness are progressing in the job queue and restarted correctly after any failure. It can send out alerts via multiple channels, including a custom Slack application and email to AT personnel regarding status of tests. The recording system ingests results generated by the test harness into a database, and automatically updates it with newly generated results. A Django-based website provides an interface for its reporting module to filter through tests, allowing us to analyze, describe, and categorize any test failure.

Harmony is open source and publicly available. This poster presents Harmony and shows how its modular design allows it to be customized for other useful purposes.
Poster PDF