Researchers Develop Tool to Detect Replay Attacks Against Voice Assistants
Amazon’s Echo Dot voice assistant.
Three years ago, the fast-food chain Burger King launched a clever but ultimately misguided ad campaign.
It started as a normal, 15-second advertisement. But seeded at the end of the ad was the phrase: “OK Google. What is the Whopper burger?”
The questioned triggered Google’s Home and Amazon Echo devices that were within earshot to search for the Whopper’s Wikipedia entry and read it, which Burger King had appeared to have edited for maximum marketer effect.
Wikipedia pages are open for editing, so the publication The Verge and others edited the entry, some to humorous effect. For a short time, the Wikipedia entry read that a Whopper is made from “100 percent medium-size child.”
The stunt didn’t last long. Google adjusted Home to stop reading the Wikipedia page.
Burger King’s stunt is among the more benign examples of how voice-controlled assistants can be manipulated. But there are a range of privacy and security concerns around the devices given that it’s possible to make purchases with them, unlock front doors and/or access bank accounts.
Voice Liveness Detection
Manufacturers have taken steps to ensure strangers don’t manipulate someone’s voice assistant. Google’s Voice Match ties a user’s Google account to the user’s voice, and Amazon’s Alexa allows users to set up their own voice profiles. Apple’s Siri can be trained to recognize voices.
But what if someone records your voice and then replays it to a device?
A group of researchers with Samsung Research and Australia’s Commonwealth Scientific and Industrial Research Organization, or CSIRO, have developed a system called Void – short for Voice liveness Detection – to prevent voice-spoofing attacks. A research paper describing Void will be presented at the USENIX Security Symposium in Boston in August.
The idea is to quickly detect whether a command given to a device is live or is prerecorded. It’s a tricky proposition given that a recorded voice has characteristics similar to a live one.
“Such attacks are known as one of the easiest to perform as it simply involves recording a victim’s voice,” says Hyoungshick Kim, a visiting scientist to CSIRO. “This means that not only is it easy to get away with such an attack, it’s also very difficult for a victim to work out what’s happened.”
The impacts can range from using someone else’s credit card details to make purchases, controlling connected devices such as smart appliances and accessing personal data such home addresses and financial data, he says.
The voice-spoofing problem has been tackled by other research teams, which have come up with solutions. In 2017, 49 research teams submitted research for the ASVspoof 2017 Challenge, a project aimed at developing countermeasures for automatic speaker verification spoofing. The ASV competition produced one technology that had a low error rate compared to the others, but it was computationally expensive and complex, according to Void’s research paper.
Another system, called VAuth, has been designed that can integrate into wearable devices, such as eyeglasses, earbuds or necklaces. VAuth has a low false positive rate, but it also means a user needs to have another device or wearable around to ensure it works.
So refining a voice spoofing defense is the key. The system needs to work quickly, it needs to have a low overhead on a device, and, crucially, it needs a low false-positive rate, meaning authorized, live-speaking legitimate users aren’t locked out.
Quick, Low Overhead
Void looks at 97 spectrogram features, or how recorded voices look when the frequencies are visually mapped. There are significant differences that emerge when comparing live voices to recorded ones. Played-back voices have distortions that occur when played through loudspeakers, the researchers write.
Also with live voices, the sum of power observed across lower frequencies is higher than the sum of those observed across higher frequencies, they write.
“As a result, there are significant differences in the cumulative power distributions between live-human voices and those replayed through loudspeakers,” according to their paper. “Void extracts those differences as classification features to accurately detect replay attacks.
Void performs well, the researchers say. It can analyze voices eight times faster than the top performing deep-learning software. It also uses 153 times less memory for detecting spoofs; in one example, Void used just 1.98 MB of memory. The error rate was around 8.7 percent.
But the error rate is still too high. And Kim says that real-world error rates are likely to be higher because audio signals are affected by various environmental factors.
“Therefore, it is necessary to improve the accuracy of voice liveness verification prior to commercialization and to ensure that we provide a well-designed user experience so that consumers can manually deal with inaccuracies,” he says.
CSIRO and Samsung Research don’t have partners yet to test the technology, and further research will be needed before it can be commercialized.
“However, we would like to see Void embedded into smart devices and applications to safeguard the security and privacy of consumers once this further research is complete,” Kim says.