Your voice identifies you as uniquely as your eyes and fingerprints do. You can now use it to unlock your mobile device, turn on your car and refill your fridge. Naturally, you’d want the A.I. powered technologies in chatty assistants like Alexa, Google Assistant, and Siri to be even more trustworthy than a thumb scanner.
Not too long ago, the thought of an imposter running around with your voice sounded like something that could only happen to The Little Mermaid. But when a computer cloned the voice of late celebrity chef Anthony Bourdain in a 2021 released documentary film, and no one noticed, the world suddenly woke up to the reality of voice fakery.
When it comes to voice-controlled devices, an attack can make “turn on the lights” translate into “turn on the fire alarm.” The same tactics, however, could be used to fake news stories and deceive voice recognition systems at banks.
The Signal Analysis and Interpretation Laboratory (SAIL) at USC Viterbi is making breakthroughs in this new art of talking to our technology and having it seamlessly interact with us. The research team, led by Shrikanth Narayanan, University Professor and Niki & Max Nikias Chair in Engineering, is deepening our understanding of how vulnerable our smart technologies are to deepfakes while developing measures to make them more trusted and secure.
Their findings, published in Computer Speech and Language in February 2021, uncovered novel “Adversarial attack and defense strategies for deep speaker recognition systems.” The group revealed that speech audio can be attacked directly “over the air”, and discovered that the strongest attacks could reduce the accuracy of a voice speaker recognition system from 94% to even 0%. They also report several possible countermeasures to contend with adversarial attacks, laying a strong foundation for future investigations on adversarial attacks and defense strategies for speech technology systems.
“We’re surrounded by smart speakers that are continuously listening to us and learning,” said the paper’s main author Arindam Jati, Ph.D. ’21 electrical and computer engineering, currently a research scientist at IBM. “It’s possible for an attacker to play an audio recording imperceptible to human ears and by mixing this noise with our voice change the neural network predictions by a large margin.”
“Changing neural network predictions,” means changing the outcome of what a machine is meant to learn from listening to our voice. Machine learning relies on learning patterns and categories such as who a person is (speaker recognition) and what they are saying (speech recognition). These patterns are mapped from the audio data gathered. The maps can then be used to create security features like voice biometrics that decode human speech. The same process can be used to clone voices, a useful innovation in video games and virtual assistants.
A noisy world filled with listening devices
One of the biggest advances in human speech and language technologies is just how little raw data is needed to analyze and recreate speech. Systems once required dozens or even hundreds of hours of audio. Today we can clone a voice from just minutes of content.
“We’ve also gotten better at removing noise,” Jati said.
If you wanted to improve the performance of an Android device during Google voice searches you want it to recognize its learning target quickly, eliminating vocal intonations, language, or external noise.
Speaking of which, Google reports that 27% of the online global population is now using voice search on their mobile devices. That’s over a billion people whispering sweet little nothings into their devices, a growing trend. It has spurred a growth in the speech recognition tech market expected to be worth nearly $32 billion by 2025. “
In a malicious attack, if you wanted the voice not be recognized, you’d change the data by adding noise in ways that are imperceptible,” Narayanan said. “You change the algorithm slightly so that it will make a different inference than what it’s supposed to.”
Behavioral dimensions and blind spots
There’s also a behavioral dimension to all of this because emotions can play an important role in voice recognition, something Narayanan and his lab are constantly exploring in the field of behavioral biometrics. When we get excited, nervous, or scared, our vocal cords tense up, creating that higher pitch we often hear in our voice when we’re stressed or excited. “Training technology to be a better listener also means teaching it to pick up on these behavioral cues,” Narayanan said.
“It’s often a game of cat and mouse between attackers and defenders,” said study co-author Raghuveer Peri, Ph.D. candidate, electrical and computer engineering, a Summer 2021 applied scientist intern in Amazon’s Alexa group. “We’re discovering the blind spots in the learning paradigm of smart audio technologies and we as defenders try to figure out ways to compensate for those blind spots. Because learning models share some of the same architecture you don’t even need to have access to the model itself. You can devise your attacks on entirely different models on your own laptop and then transfer them to an Amazon Alexa device and that has huge implications.”
The paper was co-authored by Arindam Jati, Chin-Cheng Hsu, Monisankha Pal, Raghuveer Peri, Wael Abd Almageed and Shrikanth Narayanan. The research is supported by the Department of Defense.