IT Security How voice assistants follow inaudible commands
What sounds like a harmless piece of music to the human ear might be understood by a machine to be an instruction to perform a specific task.
Today, Alexa, Siri and Co. are much better at understanding what humans tell them than they were in the early days of speech recognition technology. Sometimes, they even understand things that humans can’t hear: a security gap of which the IT experts at the Horst Görtz Institute for IT Security (HGI) in Bochum are well aware. They have successfully hidden commands for voice assistants in various audio signals, for example in music, speech or bird song. As long as such attacks are only used for research purposes, they pose no threat. A malicious attacker, however, could use this method to manipulate a song played on the radio by inserting a command to buy a certain product or even to take control of a voice-controlled home automation.
The technical term for such attacks is adversarial examples. Lea Schönherr from the HGI research group Cognitive Signal Processing is developing them for her PhD project in the team headed by Professor Dorothea Kolossa. “We take advantage of the psychoacoustic model of hearing,” explains Schönherr. As long as the ear is busy processing a sound at a specific frequency, humans are incapable of hearing other sounds at low volume for a few milliseconds. These frequencies are where researchers hide the secret commands for machines. To the human ear, the additional information sounds like random static noise. But it changes the meaning of the message for the voice assistant: the human hears message A, the machine understands message B.
Attacks performed via loudspeaker
Lea Schönherr tested her attacks using the speech recognition software Kaldi, an open-source system that is integrated in Amazon’s Alexa as well as many other voice assistants. She hid inaudible commands in different audio signals and monitored which information Kaldi decoded. The speech recognition system did indeed consistently understand the secret commands.
At first, this attack couldn’t be carried out over the air; rather Lea Schönherr had to play the manipulated audio files directly into Kaldi. Today, the secret messages will be received even if the researcher uses a loudspeaker to play the audio signal to the speech recognition system. “This is more complicated,” she points out, “because the sound is affected by the room in which the file is played.” A piece of music sounds different when it is played in a cinema from when it is played on car loudspeakers. The size of the room, the material of the walls and the position of the loudspeaker in the room all play a role.
Taking the room into consideration
Lea Schönherr must take all these parameters into consideration if she wants to generate an audio file that a voice assistant will understand in a specific room. The so-called room impulse response helps. It describes how a room reflects and changes the sound. “When we know in which room an attack is to take place, we can simulate the room impulse response using dedicated computer programs and take the room’s properties into consideration when generating the manipulated audio file,” explains Lea Schönherr. The researcher has already demonstrated that this approach works. In a test room at RUB, Kaldi decoded secret messages, which the researcher had concealed in different audio signals.
“The attack can be tailored to a specific room setup in which it is played,” elaborates the communication engineer. „However, we have recently performed a generic attack, which does not need any prior information about the room, but still works equally well or even better over the air.” In the future, the researchers are planning to run tests with voice assistants available in the market.
Closing the security gap
Since speech recognition systems aren’t currently deployed in any safety-critical applications but are mainly used for convenience, adversarial examples cannot do a lot of damage yet. Therefore, there’s still time to close the security gap, according to the researchers from HGI in Bochum. In the Cluster of Excellence Casa, short for Cyber Security in the Age of Large-Scale Adversaries, the research group Cognitive Signal Processing, which developed the attacks, collaborates with the Chair for System Security headed by Professor Thorsten Holz, whose team is designing the countermeasures.
The IT security researchers intend to teach the speech recognition system to eliminate any ranges in the audio signals that are inaudible to humans and to hear only the rest. “Essentially, the recognition is meant to work rather like the human ear, rendering it more difficult to conceal secret messages in audio files,” says Thorsten Eisenhofer, who researches into the security of intelligent systems for his PhD project. The researchers cannot prevent audio files being manipulated by attackers. But if those manipulations are placed into frequencies that humans can hear because the speech recognition system weeds out the rest, the attacks could no longer be easily hidden. “Accordingly, we want humans to be able to hear that something is wrong with an audio file,” says the researcher. “In the best case scenario, an attacker would have to manipulate the audio file to such an extent that it would sound more like the hidden message than like its original content.”
MP3 principle as countermeasure
The idea is: if the speech recognition system eliminates everything that is inaudible to humans, the attacker would have to position his commands in the audible range. In order to put this into practice, Thorsten Eisenhofer utilises the MP3 principle.
MP3 files are compressed by deleting any ranges that are inaudible to humans – and this is what the defence strategy against adversarial examples is aiming at. Consequently, Eisenhofer combined Kaldi with an MP3 encoder that cleans up the audio files before they reach the speech recognition system. The tests have shown that Kaldi did indeed no longer understand the secret messages, unless they were moved into the human hearing range. “At this point, the audio files were considerably changed,” explains Thorsten Eisenhofer. “The static in which the secret commands are hidden could be distinctly heard.”
Despite the MP3 clean-up, Kaldi’s speech recognition performance remained as good as speech recognition for files that weren’t cleaned up. But only if the system was trained with MP3-compressed files. “Inside Kaldi, a machine-learning model is at work,” as Thorsten Eisenhofer explains this fact. This model is an artificial intelligence that is trained using numerous audio files as learning material in order to learn how to interpret the meaning of audio signals. Only if Kaldi has been trained using MP3-compressed files will it be able to understand them later.
Using this training approach, Thorsten Eisenhofer taught the speech recognition system to understand everything it was supposed to understand – and nothing more.