This voice assistant does not only react to the trigger word “Amazon,” but is also activated by the phrase “and the zone.”

Bochumer IT-Sicherheitsforscherinnen und -forscher

Part of the Bochum-based research team: Thorsten Eisenhofer, Jan Wiele, Lea Schönherr, Maximilian Golla, Dorothea Kolossa (left to right)

The researchers used their setup to analyse eleven different smart speakers, including devices by Amazon, Apple, Google, Microsoft, and Deutsche Telekom.

1/3

IT Security
When speech assistants listen even though they shouldn’t

“Alexa,” “Hey Siri,” “OK Google” – voice assistants are supposed to react to these triggers. But other words activate them, too.

Researchers from RUB and the Bochum Max Planck Institute (MPI) for Cyber Security and Privacy have investigated which words inadvertently activate voice assistants. They compiled a list of English, German, and Chinese terms that were repeatedly misinterpreted by various smart speakers as prompts. Whenever the systems wake up, they record a short sequence of what is being said and transmit the data to the manufacturer. The audio snippets are then transcribed and checked by employees of the respective corporation. Thus, fragments of very private conversations can end up in the companies’ systems.

Süddeutsche Zeitung and NDR reported on the results of the analysis on 30 June 2020. Examples yielded by the researchers’ analysis can be found at unacceptable-privacy.github.io.

For the project, Lea Schönherr from the RUB research group Cognitive Signal Processing, headed by Professor Dorothea Kolossa at the RUB Horst Görtz Institute for IT Security (HGI), collaborated with Dr. Maximilian Golla, previously at HGI, now at MPI for Security and Privacy, as well as, Jan Wiele and Thorsten Eisenhofer from the HGI Chair for Systems Security headed by Professor Thorsten Holz.

Testing all major manufacturers

The IT experts tested the voice assistants by Amazon, Apple, Google, Microsoft, and Deutsche Telekom, as well as, three Chinese models by Xiaomi, Baidu, and Tencent. They played them hours of English, German, and Chinese audio material, including several seasons from the series “Game of Thrones,” “Modern Family,” and “House of Cards,” as well as, news broadcasts. Moreover, professional audio data sets that are used to train smart speakers were also included.

Using light sensors, they registered when the indicator LEDs of the speakers lit up.

All voice assistants were equipped with a light sensor that registered when the activity indicator of the smart speaker lit up, thus, visibly switching the device into active mode indicating that a trigger occurred. The setup also registered when a voice assistant sent data to the outside. Whenever one of the devices switched to active mode, the researchers recorded which audio sequence had caused it. They later manually evaluated which terms had triggered the assistant.

False triggers identified and generated

Based on this data, the team created a list of over 1,000 sequences that incorrectly trigger speech assistants. Depending on the pronunciation, Alexa reacts to the words “unacceptable” and “election,” while Google reacts to “OK, cool.” Siri can be fooled by “a city,” Cortana by “Montana,“ Computer by “Peter,” Amazon by “and the zone,” and Echo by “tobacco.”

Hinweis: Beim Klick auf den Play-Button wird eine Verbindung mit einer RUB-externen Website hergestellt, die eventuell weniger strengen Datenschutzrichtlinien unterliegt und gegebenenfalls personenbezogene Daten erhebt. Weitere Informationen finden Sie in unserer Datenschutzerklärung. – Die datenschutzfreundliche Einbettung erfolgt via Embetty.

In order to understand what makes these terms false triggers, the researchers broke the words down into their smallest possible sound units and identified the units that were often confused by the voice assistants. Based on these findings, they generated new trigger words and showed that these terms also activated the voice assistants.

“The devices are intentionally programmed in a somewhat forgiving manner, because they are supposed to be able to understand their humans. Therefore, they are more likely to start up once too often rather than not at all,” concludes Dorothea Kolossa.

Audio snippets are analysed in the cloud

The researchers analysed in more detail how the manufacturers evaluate false triggers. A two-stage process is most common. First, the device analyses locally whether the speech it perceives contains a trigger word. If the device suspects that it has heard the trigger word, it begins to upload the current conversation to the manufacturer’s cloud for further analysis with more computing power. If the cloud analysis identifies the term as a false trigger, the voice assistant remains silent, only its indicator LED lights up briefly. In this case, several seconds of audio recording may already end up at the corporation, where they are transcribed by humans in order to avoid such false triggers in the future.

The manufacturers have to strike a balance between data protection and technical optimisation.

—
Thorsten Holz

“From a privacy point of view, this is of course alarming, because sometimes very private conversations can end up with strangers,” says Thorsten Holz. “From an engineering point of view, however, this approach is quite understandable, because the systems can only be improved using such data. The manufacturers have to strike a balance between data protection and technical optimisation.”