Large language models such as ChatGPT are trained using texts—or datasets—from sources like websites, news articles, scientific papers and books. Researchers have been training these models on increasingly more data to improve their performance and versatility.
But quantity does not always equal quality, says Swabha Swayamdipta, a USC assistant professor of computer science, who recently won an Intel Rising Star Faculty Award (RSA) for her research in natural language processing and machine learning.
Swayamdipta is one of 15 award recipients from around the world selected for their great promise as future academic leaders in disruptive computing technologies. This year, she also received the Allen Institute for AI’s AI2 Young Investigators Award which recognizes outstanding early career researchers in the field of artificial intelligence.
We recently sat down with Swayamdipta to discuss her reaction to the awards, her research, and her journey in academia.
This interview has been edited for style and clarity.
Q: What was your reaction when you heard that you received both the Intel Rising Star Award and the AI2 Young Investigators Award?
A: For me it was a boost of motivation. Both awards have evaluated my entire research direction and it’s very encouraging because it means my research pursuits are supporting the interests of the broader scientific community.
Q: In addition to teaching, you also lead the Data, Interpretability, Language and Learning (DILL) Lab here at USC. What is the main focus of your research?
A: As the name of our lab suggests, the goal is to look at natural language processing (NLP) and AI from a data-centric view. Much of the progress made in NLP over the last few years can be attributed to the ability to scale neural network models on very large data sets.
“The focus of my research leans more toward analyzing the quality of data sets and how we can better curate the models that we train.”
But, the way my lab looks at data is a little more practical. The focus of my research leans more toward analyzing the quality of data sets and how we can better curate the models that we train.
Traditionally, a lot of innovation in the field has focused on architecting better models. But data does not exist independently of models and we cannot train machine learning models without data; neither exists in a vacuum.
Instead, I try to look at data and models as a continuum to understand why certain data sets work better for certain models. We can use this as a springboard to design better datasets or synthetically generate data to train better models.
Q: What sparked your interest in NLP and computation?
A: I have always been a very curious person. While I was pursuing my undergraduate degree in India, I had limited exposure to NLP. Years ago, NLP research was not as front-and-center as it is today. Especially in India, this was not a popular topic to pursue.
But personally, I have always been interested in language. My parents and grandparents have dabbled in poetry and literature throughout their lives and even published a few books on these topics. So pursuing a computer science degree, where there is scope to look at natural language and computation, was immediately very attractive to me.
Q: What challenges have you encountered in your research? How are you tackling these in your lab today?
A: Since I began working as a professor, there has been a lot of upheaval in the field of NLP. The introduction of GPT 3.5 and GPT 4 spurred many changes in the field, leading to investment in large language models that didn’t exist previously. In retrospect, I think this is a welcoming challenge because these models can complete problems that used to be considered impossible for machines.
But we are only beginning to scratch the surface of the biases and ethical challenges in generative models of natural language. We are aware of issues relating to hallucination and copyright infringement but have no concrete solutions to deal with these issues yet.
“Be as open-minded and as creative as you can, but don’t lose track of the fundamental understanding of existing models.”
Other issues, such as social biases picked up by language models and reflected in the generated language, can be subtle and hard to detect in generative AI settings. My lab is working on how we can evaluate generative models of language at scale by automatically detecting subsets of data that require careful human inspection.
For instance, take hate speech detection: current models would be able to easily detect slurs, but struggle with microaggressions without the much-needed context. So that’s one push that we are trying to make, which is to teach the model explicitly about the subjectivity of what constitutes hate speech, instead of having the model decide on its own based on the data provided.
Q: You recently joined USC as faculty last year, what sparked your interest in the Thomas Lord Department of Computer Science?
A: USC was one of my top choices when I decided to join academia, primarily because the department has a lot of energy. I had great interactions both with the faculty and the students I met during my interviews and beyond. I also got the impression that my colleagues were exploring a lot of different areas, some of which were connected to my own research interests. However, they were diverse enough that I could see myself learning something new while finding support for my own work.
Q: What is your advice to other early career researchers and students hoping to pursue academia?
A: There is so much room for creativity now in the field. When I first started my journey in academia, the tools and models were very limited in capacity. But today, there are so many new and exciting questions to pursue. Be as open-minded and as creative as you can, but don’t lose track of the fundamental understanding of existing models. You won’t be able to create models of tomorrow if you don’t understand the models of today. Balance these two ideals and it will take you a really long way.
Published on October 16th, 2023
Last updated on October 16th, 2023