Voice assistants such as Alexa contain the speech recognition software Kaldi. In it, researchers from Bochum have detected a security gap.

The researchers manipulate audio files so that machines understand a completely different message than humans.

They can conceal secret messages for voice assistants in any audio file, those including speech, music and ambient noise – e.g. birds’ twittering.

1/3

IT Security
How voice assistants follow inaudible commands

What sounds like a harmless piece of music to the human ear might be understood by a machine to be an instruction to perform a specific task.

Today, Alexa, Siri and Co. are much better at understanding what humans tell them than they were in the early days of speech recognition technology. Sometimes, they even understand things that humans can’t hear: a security gap of which the IT experts at the Horst Görtz Institute for IT Security (HGI) in Bochum are well aware. They have successfully hidden commands for voice assistants in various audio signals, for example in music, speech or bird song. As long as such attacks are only used for research purposes, they pose no threat. A malicious attacker, however, could use this method to manipulate a song played on the radio by inserting a command to buy a certain product or even to take control of a voice-controlled home automation.

Lea Schönherr develops new attacks against the speech recognition software Kaldi in order to expose security gaps. Thorsten Eisenhofer develops relevant countermeasures.

The technical term for such attacks is adversarial examples. Lea Schönherr from the HGI research group Cognitive Signal Processing is developing them for her PhD project in the team headed by Professor Dorothea Kolossa. “We take advantage of the psychoacoustic model of hearing,” explains Schönherr. As long as the ear is busy processing a sound at a specific frequency, humans are incapable of hearing other sounds at low volume for a few milliseconds. These frequencies are where researchers hide the secret commands for machines. To the human ear, the additional information sounds like random static noise. But it changes the meaning of the message for the voice assistant: the human hears message A, the machine understands message B.

Attacks performed via loudspeaker

Lea Schönherr tested her attacks using the speech recognition software Kaldi, an open-source system that is integrated in Amazon’s Alexa as well as many other voice assistants. She hid inaudible commands in different audio signals and monitored which information Kaldi decoded. The speech recognition system did indeed consistently understand the secret commands.

In the past, the attacks could only be carried out if the manipulated files were input directly as data into the speech recognition software. Now, they work even when the audio files are played on loudspeakers.

At first, this attack couldn’t be carried out over the air; rather Lea Schönherr had to play the manipulated audio files directly into Kaldi. Today, the secret messages will be received even if the researcher uses a loudspeaker to play the audio signal to the speech recognition system. “This is more complicated,” she points out, “because the sound is affected by the room in which the file is played.” A piece of music sounds different when it is played in a cinema from when it is played on car loudspeakers. The size of the room, the material of the walls and the position of the loudspeaker in the room all play a role.

Taking the room into consideration

Lea Schönherr must take all these parameters into consideration if she wants to generate an audio file that a voice assistant will understand in a specific room. The so-called room impulse response helps. It describes how a room reflects and changes the sound. “When we know in which room an attack is to take place, we can simulate the room impulse response using dedicated computer programs and take the room’s properties into consideration when generating the manipulated audio file,” explains Lea Schönherr. The researcher has already demonstrated that this approach works. In a test room at RUB, Kaldi decoded secret messages, which the researcher had concealed in different audio signals.

Hinweis: Beim Klick auf den Play-Button wird eine Verbindung mit einer RUB-externen Website hergestellt, die eventuell weniger strengen Datenschutzrichtlinien unterliegt und gegebenenfalls personenbezogene Daten erhebt. Weitere Informationen finden Sie in unserer Datenschutzerklärung. – Die datenschutzfreundliche Einbettung erfolgt via Embetty.

“The attack can be tailored to a specific room setup in which it is played,” elaborates the communication engineer. „However, we have recently performed a generic attack, which does not need any prior information about the room, but still works equally well or even better over the air.” In the future, the researchers are planning to run tests with voice assistants available in the market.

Closing the security gap

Since speech recognition systems aren’t currently deployed in any safety-critical applications but are mainly used for convenience, adversarial examples cannot do a lot of damage yet. Therefore, there’s still time to close the security gap, according to the researchers from HGI in Bochum. In the Cluster of Excellence Casa, short for Cyber Security in the Age of Large-Scale Adversaries, the research group Cognitive Signal Processing, which developed the attacks, collaborates with the Chair for System Security headed by Professor Thorsten Holz, whose team is designing the countermeasures.

The IT security researchers intend to teach the speech recognition system to eliminate any ranges in the audio signals that are inaudible to humans and to hear only the rest. “Essentially, the recognition is meant to work rather like the human ear, rendering it more difficult to conceal secret messages in audio files,” says Thorsten Eisenhofer, who researches into the security of intelligent systems for his PhD project. The researchers cannot prevent audio files being manipulated by attackers. But if those manipulations are placed into frequencies that humans can hear because the speech recognition system weeds out the rest, the attacks could no longer be easily hidden. “Accordingly, we want humans to be able to hear that something is wrong with an audio file,” says the researcher. “In the best case scenario, an attacker would have to manipulate the audio file to such an extent that it would sound more like the hidden message than like its original content.”

MP3 principle as countermeasure

The idea is: if the speech recognition system eliminates everything that is inaudible to humans, the attacker would have to position his commands in the audible range. In order to put this into practice, Thorsten Eisenhofer utilises the MP3 principle.

Accordingly, the defence measures aim at revealing the secret messages by rendering them audible to humans.

MP3 files are compressed by deleting any ranges that are inaudible to humans – and this is what the defence strategy against adversarial examples is aiming at. Consequently, Eisenhofer combined Kaldi with an MP3 encoder that cleans up the audio files before they reach the speech recognition system. The tests have shown that Kaldi did indeed no longer understand the secret messages, unless they were moved into the human hearing range. “At this point, the audio files were considerably changed,” explains Thorsten Eisenhofer. “The static in which the secret commands are hidden could be distinctly heard.”

Despite the MP3 clean-up, Kaldi’s speech recognition performance remained as good as speech recognition for files that weren’t cleaned up. But only if the system was trained with MP3-compressed files. “Inside Kaldi, a machine-learning model is at work,” as Thorsten Eisenhofer explains this fact. This model is an artificial intelligence that is trained using numerous audio files as learning material in order to learn how to interpret the meaning of audio signals. Only if Kaldi has been trained using MP3-compressed files will it be able to understand them later.

Using this training approach, Thorsten Eisenhofer taught the speech recognition system to understand everything it was supposed to understand – and nothing more.

Original publication

Lea Schönherr, Steffen Zeiler, Thorsten Holz, Dorothea Kolossa: Imperio: robust over-the-air adverarial examples for automatic speech recognition systems, 2019, pre-released online

Download high-resolution images

The selected images are downloaded as a ZIP file. The captions and image credits are available in the HTML file after unzipping.

Conditions of use

The images are free to use for members of the press, provided the relevant copyright notice is included. The images may be used solely for press coverage of Ruhr-Universität Bochum that relates solely to the contents of the article that includes the link for the image download. By downloading the images, you receive a simple right of use for one-time reporting. Saving the images for other purposes or further processing of the images that goes beyond adapting them to the respective layout requires an extended right of use. Should you therefore wish to use the photos in any other way, please contact redaktion@ruhr-uni-bochum.de

I accept the conditions of use.