Words matter. And in the wake of a crisis, understanding another country’s language is vital so that information never gets lost in translation.
Enter a team of 10 researchers and students from USC’s Information Sciences Institute. Led by research director Kevin Knight, the team hopes to build machine-learning systems to quickly decrypt any of the world’s 7,000 languages and automatically produce information that can be acted on.
To hone their algorithms, Knight and the team from ISI’s Natural Language Lab recently took part in a three-week assignment to translate Oromo and Tigrinya — two Ethiopian languages spoken by millions — and yet unknown to machine-translation systems. Current online translation services, including Google Translate, support only about 100 of the world’s languages.
The assignment was part of a project overseen by the Defense Advanced Research Projects Agency (DARPA), which aims to create a rapid automated language toolkit for languages currently missing from the linguistic databases that feed online translations systems.
By creating platforms that could be used in any region where international disaster relief teams have little or no local language expertise, the team hopes to get a step closer to the holy grail of machine translation: a universal translator that supports all the world’s languages.
“Let’s say there’s an earthquake in Armenia, the language they speak in that area is probably not covered with current technology,” says Knight.
“We want to be able to look at these messages and say: these are the ones that are describing the earthquake, these are the ones that are asking for food and water. That way, the aid organization knows, for example, what food and supplies to put on trucks and where to send them.”
Looking for clues
When the assignment launched on Aug. 7, DARPA gave participating teams the names of the languages, a humanitarian disaster scenario based on real events and a data pack of text in the assigned languages.
Machine translation relies on huge annotated datasets: the bigger the dataset, the better it learns. But since massive datasets don’t exist for low-resource languages, including Oromo and Tigrinya, the team had to forage for clues using its artificial intelligence toolkit.
The tactics included:
- Using a program, developed by an ISI team member, that turns any language’s writing system into the Latin-based alphabet.
- A name-finding tool that highlights the names of people, places and organizations to quickly contextualize the document and identify what supplies are needed most and where.
- Using a known language, which is a member of the same language family as the unknown language, as a stepping-stone for decoding similar words.
The team was also permitted a one-hour Skype call with a native speaker, who could translate specific words and terms that proved tricky to decode using AI alone.
As the team assembled the linguistic puzzle, the newly translated data was fed back into the systems, allowing the computer to gradually “learn” the previously unknown languages.
On August 28, DARPA issued a final translation test to the ISI team and two other teams taking part in the assignment from Raytheon BBN Technologies and Carnegie Mellon University.
The final score will be revealed at the DARPA principal investigator meeting on September 13 in Raleigh, North Carolina. Ultimately, however, it’s a collaborative project. After the result announcement, the teams come together to share their experiences and lessons learned.
Knight says: “It really allows us to put our technology to the test. It’s a way to drive the research and technology forwards and work on something that could make a huge impact.”
Published on August 22nd, 2017
Last updated on June 2nd, 2021