USC ISI to Develop Translation and Information-Retrieval System for Uncommon Languages

| January 8, 2018

Researchers receive $16.7 million grant to automatically translate and summarize “low-resource” language documents into English

ISI researchers are developing an automated information translation and summarization tool to quickly translate obscure languages. Photo/iStock.

A team of researchers from the Information Sciences Institute at USC Viterbi has received a $16.7 million grant from the Intelligence Advanced Research Projects Activity (IARPA) to develop an automated information translation and summarization tool to quickly translate obscure languages.

Principal investigator and ISI research team leader Scott Miller, ISI computer scientist Jonathan May, ISI research lead Elizabeth Boschee—with senior advisors Prem Natarajan, ISI’s Michael Keston executive director and research professor of computer science, and Kevin Knight, ISI research director and Dean’s professor of computer science—are leading a team of about 30 researchers, including academics from the University of Massachusetts, Northeastern University, MIT, RPI, and the University of Notre Dame.

The ISI team’s project is called SARAL, which stands for Summarization and domain-Adaptive Retrieval (a Hindi word whose translations include “simple” and “ingenious”), and includes experts in machine translation, speech recognition, morphology, information retrieval, representation, and summarization.

The overall objective is to provide a Google-like capability, except the queries are in English but the retrieved documents are in a low-resource foreign languageScott Miller

“The overall objective is to provide a Google-like capability, except the queries are in English but the retrieved documents are in a low-resource foreign language,” says Miller, who is based at ISI’s newest office in Boston, MA.

“The aim is to retrieve relevant foreign-language documents and to provide English summaries explaining how each document is relevant to the English query.”

In this project, the ISI team will initially test their systems using Tagalog and Swahili, two low-resource languages selected by IARPA for the task. Over the course of the project, the team will receive additional languages to translate using the systems.

Doing more with less

Although so-called “low resource” languages are often spoken by millions of people worldwide, relatively little written material exists in these languages. This creates a challenge for current translation systems, which typically “learn” from seeing millions of written examples.

“Since we don’t have a lot of written data in these languages, we have to do more with less,” says May, who also holds an appointment as a research assistant professor in computer science at USC Viterbi.

“Ideally, we would use about 300 million words to train a machine translation system—and in this case, we have around 800,000 words. There are about 100,000 words per novel, so we have only eight novels’ worth of words to work from.”

Principal investigator Scott Miller, a research team leader based at ISI’s newest office in Boston. MA.

The researchers will begin the project by compiling documents in the test languages, including speech, online documents, and video clips, which have previously been translated into English.

They will then develop algorithms to analyze the language patterns, such as sentence structure—subject, verb and object position, for example—and morphology, the structure of words and their relation to other words in the same language.

The system will be designed to respond to domain-specific queries, for example environmental protection in the “government and politics” domain or primary education in the “lifestyle” domain, and will produce a summarized response of about 100 words describing how the result is relevant to the search.

“You can think of the summary as something like CliffsNotes, but with the added feature that it is indexed to the precise part you want to write your essay about,” says May.

In addition to ISI, a number of universities and research institutions will work towards the same goal: John Hopkins, Columbia University, and Raytheon BBN Technologies are also taking part in the IARPA program, called MATERIAL, which stands for Machine Translation for English Retrieval of Information in Any Language.

“IARPA’s MATERIAL program is the first organized attempt at synthesizing recent advances in machine translation, speech recognition, cross-lingual retrieval and summarization into a powerful new capability that allows users to accurately access all relevant information, across languages and modalities,” says Natarajan.  “We are tremendously grateful for the opportunity to contribute to this nationally important effort.”