What Is Reproducibility?
The scientific method – it’s the backbone of all scientific research. Everyone from third grade science students to Nobel Prize winners use the problem-solving method. And one of the cornerstones of the scientific method is that results must be reproducible.
What does that mean? For our purposes, “reproducibility” means being able to obtain consistent results using the same input data; computational steps, methods, and code; and conditions of analysis.
So, if your experiment is to test X under Y conditions and your result is Z, then if you repeat the experiment, you should get Z again. And when another researcher tries to test X under Y conditions, they should also get the result Z. That’s how science works!
Unfortunately, that is not always – or even often – the case.
The Big Problem in Science Right Now
In the last two decades, the science community has come to understand that many scientific studies are hard or almost impossible to accurately reproduce – it’s been termed “the reproducibility crisis.” In a 2016 survey of over 1500 scientists done by the publication Nature, 70% of researchers reported that they have tried and failed to reproduce another scientist’s experiments, and more than half have failed to reproduce their own experiments.
This undermines the credibility of the findings – which is obviously bad – but it gets even worse when you take into account the fact that new research builds upon previous findings. That is to say, if we can’t trust that testing X under Y conditions will always result in Z, then we can’t use Z to figure out anything else.
Jay Pujara, director of the Center on Knowledge Graphs at USC’s Information Sciences Institute (ISI) put it quite simply, “People will not believe in science if we can’t demonstrate that scientific research is reproducible.”
Pujara was part of a team of researchers at ISI that included co-PIs Kristina Lerman and Fred Morstatter who, in 2019, were awarded a Defense Advanced Research Projects Agency (DARPA) contract. This DARPA program — called SCORE (Systematizing Confidence in Open Research and Evidence) — aimed to develop and deploy automated tools to assign “confidence scores” that would quantify the degree to which a particular scientific result is likely to be reproducible.
Assessing Reproducibility
It is possible to manually assess the reproducibility of a research paper, however it’s resource and time-intensive, and therefore unscalable. And in many scientific fields, career advancement hinges on obtaining new results, not redoing old experiments. So, with limited time available, researchers are not incentivized to run laborious replication studies to try to manually recreate results related to their work. What the science community needs, and what SCORE was seeking to find, is a way to automatically assess the reproducibility of research at scale.
In recent years, researchers have begun to apply machine learning techniques to the problem – analyzing how different variables in a paper impact reproducibility. Kexuan Sun, a PhD student in Pujara’s research group, took these methods a step farther, looking not only at the variables within a research paper (which had been done before), but also taking a holistic view and analyzing the relationships between papers (which had never been done).
“We had new ideas about how to assess reproducibility using knowledge graphs and looking at the full context of a paper, instead of just trying to read it or look at the specific features within it,” Pujara said. “We weren’t just going to look at sample size and p-values, for example, but try to be more holistic to cover everything that we could learn about a paper.”
The team developed and tested their novel approach and the resulting paper, “Assessing Scientific Research Papers with Knowledge Graphs,” was published at ACM SIGIR 22 (Association for Computing Machinery’s Special Interest Group on Information Retrieval) in July 2022.
From the Micro to the Macro
The ISI team gathered a huge amount of data from over 250 million academic papers and turned it into a vast knowledge graph (KG) [a representation of a network of real-world entities that illustrates the relationships between them]. During the KG construction, the team combined information from both a micro and macro level.
On the micro level, they zoomed in and looked at the variables within a particular published article that have been shown to impact reproducibility; things like sample sizes, effect sizes, and experimental models, for example. These micro features were used as additional information associated with entities like author or publication.
The team also zoomed out to the macro level, to incorporate relationship information between entities; things like authorship and reference information. These macro relationships were used to construct the network structure.
So, Did It Work?
Short answer: yes! The novel method proved to be surprisingly synergistic. The relationship information from the macro level analysis informed the data obtained at the micro level. Likewise, the micro level text analysis gave insights into the macro level KG analysis. A key finding of the research was that the method provided information that was more than just the sum of the two parts; the two ways of looking at the data worked in concert with one another.
“We found that if we could put both of these things together, combine some features from the text and some features from the knowledge graph, we could do much better than either of those methods alone,” said Pujara, calling their result “state-of-the-art performance on predicting reproducibility.”
From here, the team plans to scale up and extend the KG to include more papers, as well as use it to dig deeper into specific components. The team has already used the knowledge graph to analyze the relationship between citations and gender, and the resulting paper was published in Proceedings of the National Academy of Sciences (PNAS) on September 26, 2022.
They’d also like to make their tools open to the public. “We’re trying to build portals where people can investigate research and its reproducibility,” said Pujara, “a web page where people could type in papers and explore the knowledge graph and some of the features.”
Settling the Score with Science
Pujara sees real-world opportunities for students, the scientific community and the public to use the results, which in the end would be “reproducibility scores” given to pieces of research.
“If you’re a student who’s getting started, maybe this would help you understand how to do better research.”
And for the science community, it could streamline the current system used to review new research. “When a conference or a journal decides which papers to review and which papers to potentially reject or send back for more work, sometimes there are 10 or 20 thousand papers submitted to a conference. Just think about the energy that it takes to review those. This is the way of prioritizing those twenty thousand papers.”
And finally, for the general public, the method they’ve developed could help with trustworthiness, which – as we saw during the COVID-19 pandemic – has real public health implications. He said, of the pandemic: “There was sham science, and then there was really good science” which led to controversy over trustworthiness.
Pujara continued, “If we could use these sorts of automated tools to understand what research was trustworthy and what research we should take with a grain of salt, maybe we’d be able to avoid that type of controversy, and people would believe in the research results and avoid spreading misinformation.”
Thanks to the continued work done by Pujara and the ISI team, the reproducibility crisis in science may not be over, but it’s being addressed. And in the future, if we test X under Y conditions, we’ll be able to tell exactly how trustworthy Z really is.
This work was funded by the Defense Advanced Research Projects Agency with award W911NF-19-20271.
Published on November 2nd, 2022
Last updated on May 16th, 2024