At the PEARC19 conference this past summer, researchers at the USC Viterbi School of Engineering won a transformative research award in addition to the best technical paper. The work focused on providing scientists extra assurances that their data is not inadvertently corrupted when executing their analysis on multiple computers in a distributed computing environment.
Computer Scientists Mats Rynge, Karan Vahi and Ewa Deelman from the Pegasus research team at USC Viterbi’s Information Sciences Institute, or ISI, along with researchers from Indiana University and RENCI, designed and implemented a scheme to compute and verify checksums of datasets at various points in a scientific workflow’s execution lifecycle in the Pegasus Workflow Management System.
Much like the constellation it’s named after, Pegasus connects thousands of points across the world. Instead of stars, however, Pegasus is a network of computers that interact with one another to help researchers design a workflow, or a proposed set of steps for how to efficiently analyze large data sets, thereby streamlining the scientific process.
Pegasus can give researchers access not only to a large sets of data, but also to the software needed and computers powerful enough to run computations on the data. It’s a helpful tool for any researcher dealing with large sets of data. Pegasus is used, according to Rynge, “whenever you have a computational problem that’s bigger than one machine.”
In 2017, scientists from the Massachusetts Institute of Technology and the California Institute of Technology made headlines with the detection of gravitational waves, confirming Albert Einstein’s General Theory of Relativity. Pegasus is credited with facilitating the Nobel Prize-winning research along with other earth-shattering scientific developments, such as the simulation of earthquake flows in Southern California. While the Pegasus team has contributed significantly to the scientific research community with its software, the people behind Pegasus couldn’t help but wonder if their well-oiled machine was running as smoothly as everyone had thought.
In the world of big data, computers store and share large amounts of data. In distributed computing, it is common to move from computer to computer. During those transitions, bits of data can get corrupted along the way, which would lead to incorrect computations. When computations fail, scientists require more time and resources to run them again. Or worse, they may not realize the data is corrupt, leading to inaccurate scientific results.
The major part of the problem is that users believe that the computing infrastructure will automatically flag errors and/or correct them if corruption occurs. Though some transfer protocols may offer such capabilities, it is highly dependent on the computing environment and the tools provided.
Rather than making users aware of these issues, Pegasus team members decided to implement an end-to-end scheme for checksumming of various data products – both produced as part of a workflow and those used by it independently of the underlying transfer tools. This work is a result of the Scientific Workflow Integrity with Pegasus project, funded through a multi-year grant from the National Science Foundation.
When this capability was rolled out, Pegasus team members expected it to act as a safety net, but did not expect to catch many real world errors that otherwise would have gone undetected. That finding even took the developers by surprise and raises a larger issue: computer scientists could no longer assume large sets of data shared among computers are consistently accurate.
The Pegasus team described its findings in “Integrity Protection for Scientific Workflow Data: Motivation and Initial Experiences,” which won the Best Technical Paper at the PEARC19 conference, along with the Phil Andrews Most Transformative Contribution Award. The annual conference, titled “Practice and Experience in Advanced Research Computing,” took place in Chicago in late July. The gathering attracted hundreds of computer scientists and scientific computing specialists from around the country for a scholarly weekend focused on applied computing research.
The discovery of inconsistent data goes to show that science isn’t always about getting things right, but also about asking the right questions. “It’s focusing on a topic that a lot of people didn’t realize maybe was an issue,” Rynge said. “Us highlighting it made people really think about it as something to pay attention to.”