Proteins work together in complex ways to support the processes of life, and their structures tell unique tales of functionality and purpose. Documenting these structures can unlock new avenues of biological knowledge, and researchers at USC Viterbi’s Information Sciences Institute (ISI) are currently contributing to these efforts.
The Protein Data Bank is a globally shared repository of data on macromolecular structures and their complexes. Established in 1971, PDB provides open access to structural biology data. PDB data was used to create and enable advances in AlphaFold, an AI system developed by DeepMind that can predict a protein’s 3D structure from its amino acid sequence. ISI is now collaborating with members of the PDB to build the PDB-Dev repository that expands the capabilities of the PDB to archive macromolecular structures from novel and emerging methods.
Currently, Carl Kesselman, Director of ISI’s Informatics Systems Research Division and other colleagues have developed critical technology that facilitates the process of data entry into the PDB-Dev repository. The technology, Deriva, allows data insights to become more discoverable and more useful to scientists.
Discovering protein structures
One of the core problems in structural biology, which scientists have been investigating for around 40 years, is understanding the shape of proteins from their amino acids.
Historically, the only way to determine the tertiary structure of proteins was through crystallography. Rosalind Franklin’s breakthrough research, which contributed to the discovery of the double helix structure of DNA in 1953, was facilitated by X-ray crystallography.
However, the process of crystallography is slow and expensive, and some proteins are not good candidates for crystal formation, which poses a problem for research. This is where AlphaFold comes in. After taking in the PDB’s expansive data bank of information, AlphaFold was able to analyze and identify protein structures computationally instead of experimentally.
Even with AlphaFold’s achievements, there still exists a growing need to examine more complicated structures and more complicated proteins and macromolecules with a new generation of techniques supported by PDB-Dev.
ISI’s technology at the forefront
ISI has a history of working with protein structures, previously joining the GPCR consortium to advance structural information about G-protein couple receptors.
This uniquely positioned Kesselman and his team to develop the PDB-Dev Data Harvesting System, which runs on the Deriva software that provides researchers with a platform to manage their scientific assets.
“Our software plays a critical role in the PDB-Dev data processing infrastructure, supporting data collection, curation, validation, and dissemination. This allows the PDB-Dev repository to collect information about new structures and process and store them in the archive more easily. If you want to publish anything about protein structures in peer-reviewed journals, the structures you discovered have to be archived in PDB or PDB-Dev. ISI’s software is crucial for the data submission process,” Kesselman noted.
Kesselman is optimistic that ISI’s efforts will bring about big leaps and bounds in understanding diseases and discovering new drugs, which would dramatically expand the scope of scientific research.
“The technology we developed will play a huge role in expanding the protein data bank,” said Kesselman. “Whenever scientists want to input their findings into the PDB-Dev system, it goes through our software. Members of the PDB recognized that ISI’s tools are well-suited to flexibly manage data and enhance collaboration.”
Impact on the quality of scientific data
It is important that the data in scientific papers remains transparent and reproducible. Especially if the data set is foundational, like the data in PDB-Dev, every entry has to be well-documented and citable so the scientific community can benefit from it.
“The New York Times recently published an article about how the Dana Farber Cancer Center at Harvard had to withdraw 12 papers because of fraudulent data,” Kesselman recalls. “One of the ways to prevent this from happening is to have well-curated data that goes into your studies. If not, you’re prone to errors or malfeasance. That’s why repositories like the PDB-Dev are important, because they force contributors to validate their data.”
This could also have implications for the future, such as developments in artificial intelligence (AI), which is currently in its infancy and requires a baseline for data quality.
“More globally, there is an issue of whether people can trust the data that goes into AI,” said Kesselman. “AI is very dependent on high quality training data. That’s why ISI is part of the team helping to ensure better data entry into PDB-Dev, which will drive future advances in research.”
Published on April 8th, 2024
Last updated on October 1st, 2024