ISI natural language research has played a key role in the creation of a still-rudimentary but working two-way voice translation system that allows an English-speaking doctor to talk to a Persian-speaking patient.
The Transonics Spoken Dialog Translator both turns a doctor-user’s spoken English questions into spoken Persian, and translates patients’ spoken Persian replies into spoken English. Shrikanth Narayanan leads the large multidisciplinary University of Southern California team that developed Transonics. One member of this team presented a report on the system June 25 at the Association for Computational Linguistics conference in Ann Arbor Michigan.
The USC Information Sciences Institute’s Kevin Knight and Daniel Marcu’s machine translation system is an integral part of the system.
“Fluent two-way machine voice translation is one of the holy grails of engineering,” said Narayanan, an associate professor of electrical engineering, computer science and linguistics at the USC Viterbi School of Engineering who directs the Speech Analysis and Interpretation Laboratory (SAIL) in the Viterbi School’s Integrated Media Systems Center.
“We are years away from perfecting it, but we think the choices we have made about how to go about creating such a system are working. We hope to have something that will be useful in emergency rooms or ambulances within two years or so.”
The system that exists, funded by two DARPA grants totalling $3.8 million, is a result of intensive research in information technology, critically supplemented by careful observation of patient-doctor dynamics in numerous bilingual interaction sessions staged for the project.
Narayanan noted that the Transonics approach relies not just on computer code, but also on the ability of humans to use even imperfect tools. This approach, he adds, grows directly out of the extraordinary difficulty of the technical problems involved.
“Two-way voice translation involves combining at least three highly imperfect existing disciplines, with the errors multiplying at every stage,” Narayanan explained. These include:
- Text translation. Taking a written text in one language, and translating it into another. Machine translation systems developed by ISI researchers Knight and Marcu consistently rank among the world’s best – but nevertheless still make frequent grammatical and other errors. Marcu and Knight developed a specialized system specifically for use in Transonics.
- Spoken word recognition. This is Narayanan’s specialty. Just being able to reliably recognize a large number of different single words, in a variety of regional or foreign accents, is a difficult problem that is far from solved, as anyone who has tried to use existing telephone interfaces knows. Recognizing a wide variety of words informally said in a noisy, chaotic environment (emergency room, ambulance) adds another level of difficulty.
- Extra-verbal communication: Humans express themselves in speech not just with words, but also with intonations. A rising tone at the end of a sentence to express a question is one familiar example of this, one that is extraordinarily difficult for a machine to assess. Nonsense syllables (“um, uh, ah, er”), catchphrases (“you know, like,”) and exclamations (Wow! Hey!) in utterances are easy for humans to decode or ignore, but major stumbling blocks for machines. The insights of David Traum of the USC Institute for Creative Technology in dialog management are aiding in this area and the others by narrowing the range of possibilities by bringing context and previous exchanges into the computer’s decision-making. Additionally, teaching computers to detect human emotions in speech is a major focus work by researchers at the USC Speech Analysis and Interpretation Laboratory under the direction of Narayanan and his colleague, USC research assistant professor Panos Georgiou.
The Transonics system runs in a laptop computer using the Linux operating system. Doctor and patient both wear headphones with microphones attached. A small keypad connected to the computer speeds and simplifies certain routine commands – switching from doctor mode to patient mode, for example.
When a doctor asks a question, the speech recognition software captures it – but hedges its bets by displaying not just its best guess about what was said, but a range of options. When the doctor chooses the most appropriate (some of the most used phrases can be put in a quick access “ready menu,”) and the result is a spoken Persian question in the earphones of the patient.
The same process takes place in the reverse direction.
Narayanan says much of the success of the interface grows directly out of analysis of a large database of some 300 English-speaking-doctor/Persian-speaking-patient dialogs created by USC medical students and Iranian-heritage USC students and Los Angeles residents. “Rather than imagining what people might say, we analyzed what people did say,” he explained, adding that recordings of the encounters were used to train and tune the system.
USC linguistics Ph.D. candidate Shadi Ganjavi played a vital role in setting up these encounters, said Narayanan. “We are grateful to her and to the large Persian-speaking community in Los Angeles.
The system contains about 23,000 English and 9,000 Persian words, a disproportion ithat exists because relatively little has so far been done in machine translation of Persian (a language also often called Farsi), either written or spoken. “In addition to our progress in the general problem of the interface,” says Narayanan, “we are also contributing to the specific problems posed by translating between English and Farsi.”
Opening menu of the system gives users instructions For Narayanan, one of the striking things that have emerged so far is the dependence of the system, in its current state, on the ability of users to recognize its limits and weaknesses, and work within them.
The team has created an elaborate user manual, and as with any system, reading the manual improves performance a great deal. And common sense is critical. Narayanan ruefully describes an interaction labeled a failure in followup questioning by both ‘doctor’ and ‘patient’ that foundered because both expected the system to translate the name “Excedrin.”The drug name wasn’t in the system. It’s the same in both languages, and both sides of the interaction understood it when they heard the other pronounce the word. But rather than just moving on, both stubbornly kept trying to enter it into the system – which kept rejecting it.
“We learn from things like this,” said Narayanan. He and his colleague Georgiou estimate that if the system were tagged with the familiar release number decimal system, the system would be at “three point something” – it has gone through three radical reconstructions in its three years of development so far.
Transonics interface displays possible messasge or messages captured from doctor’s speech. The doctor can choose the one he wants, and the machine will pronounce a Persian translation. Right hand column stores heavily used questions. More will come. The system is in a continuing process of upgrading and improvement. Simultaneously with the presentation at ACL conference, use testing was in process at a military facility.
In addition to the researchers and institutions already named, Malibu California-based HRL Laboratories works with USC on the project. HRL personnel involved include USC alumni Dr. Robert Belvin and Howard Neely, and Cheryl Hein. Usability testing and interface design contributions by Scott Millward, a postdoctoral scientist at IMSC have played a key role. Additionally, four USC electrical engineering graduate students have made large contributions: Emil Ettellaie, Dagen Wang, Ananthakrishnan Shankar, Murtaza Bulut, and Sudeep Ghande, the presenter of the paper at ACL.
More information on the system, including a video demonstration, is available at http://sail.usc.edu/ transonics
Published on June 25th, 2005
Last updated on August 9th, 2021