My Summer Learning about High-performance Computing from Argonne National Laboratory

Jonathan Hoy | November 15, 2017

What PhD student Jonathan Hoy learned from the Argonne Training Program on Extreme-Scale Computing

The 2017 cohort for the Argonne Training Program on Extreme-Scale Computing (ATPESC), with Jonathan Hoy (bottom row, far left). Photo/Marta Garcia Martinez, Argonne National Laboratory

In the computational science community, there is a need to train engineers and scientists on how to effectively use current and future high-performance computing (HPC) systems. A couple of years ago, a training program was established by the federal government that would be able to address this deficiency. This program, known as the Argonne Training Program on Extreme-Scale Computing (ATPESC), contains over 100 hours of lectures and hands on training in the span of two weeks.

This past summer, I was honored to be able to represent USC Viterbi along with 80 other individuals from other prestigious institutions at ATPESC 2017. This year, the training program took place in St. Charles, Illinois. I’m from Southern California and since this was my first time in the Midwest, I was not used to seeing such vast expanses of flat land and large cornfields.

Supercomputers at Argonne National Laboratory in Lemont, Illinois. Photo/Jonathan Hoy

This training program was split into several tracks where each one involved an important aspect of HPC. In addition, every evening we had a dinner talk from an accomplished speaker on a relevant topic. At the halfway point, we got to visit Argonne National Laboratory in Lemont, Illinois.

One of the most important and practical parts of ATPESC was the second track where programming models and languages were discussed, since it involved writing code that runs on the supercomputers.

While supercomputers share a lot in common with normal desktop computers, they also have a few key differences. The most important difference being communication between nodes. While a normal desktop would represent a single node, a supercomputer represents a collection of nodes connected together in a high-speed network. In recent years, some supercomputers have started using graphics processing units (GPU) to accelerate certain types of computations. This has been possible in large part because of the video game industry and their desire to create more immersive and realistic gameplay.

In the section on math numerical libraries, popular numerical libraries were introduced that can solve common math problems for extreme scale. For my research, I have already been using one of these libraries, named Trilinos. Among many other capabilities, Trilinos can solve large-scale linear systems of equations on a variety of different hardware (CPU, GPU, etc.) using many different methods (Direct, Iterative, Multigrid). To achieve these capabilities, it has millions of lines of code, along with over one hundred contributors, and has been worked on for well over a decade. For this reason, it is important to use established software like Trilinos whenever possible rather than write your own implementation. In the event that pre-existing software solutions are not sufficient, the second part of this track focused on how to write algorithms which will scale to extreme levels and maximize system performance.

ATPESC 2017 group during a lecture. Photo/Marta Garcia Martinez, Argonne National Laboratory

Best practices for software engineering was also discussed. In many academic codes, these practices are not followed. The purpose of this track was to help our research by discussing methods to improve the quality of our code and strive for more verification and reproducibility of scientific results.

These best practices are best summarized as follows:

  1. Do not reinvent the wheel. By using existing software components, you can focus more of your efforts on advancing science and less effort on writing the code to perform the specific task.
  2. Implement version control in your work. Software code is fragile in that it only takes a single character in the wrong place to prevent the code from performing the needed calculations. By having working checkpoints that are saved in that permanent state, you can revert your code to remove errors or bugs that were unintentionally added.
  3. To further improve your ability to find and remove errors, a build and test system can be added which automates the building/compiling of your code and performs tests to check for correctness and possible bugs.
  4. In order for your work to be used and improved, it is important that it be well documented whether it be closed or open source.

During my time at ATPESC 2017, I was given a lot of useful information about software development, computer architectures, parallel programming frameworks, software libraries and visualization technologies. While just the tip of the iceberg, having information about the most relevant research tools available allows me to write better code and enables my research to focus more on the physics of the problems that I investigate. Moreover, from the knowledge I acquired at ATPESC, I can use it to help solve difficult computational problems in industry and in academia.

Jonathan Hoy is a PhD student in the Department of Aerospace and Mechanical Engineering, working with Assistant Professor Ivan Bermejo-Moreno.