Siri may improve accuracy by mapping the room like a HomePod does
New research from Apple and Carnegie Mellon University delves into how smart devices could learn about their surroundings to better understand requests by knowing when and where they are being talked to.
Academics from Apple, and Carnegie Mellon University's Human-Computer Interaction Institute, have published a research paper describing how devices such as Siri and HomePod could be improved by having them listen to their surroundings. While many Apple devices listen, they are explicitly waiting to hear the phase "Hey, Siri," and anything else is ignored.
It's the same with Alexa, or at least it is in theory, but these researchers advocate having smart devices actively listen in order to determine details of their environment — and what people are doing there.
"Listen Learner," they say in their paper, "[is] a technique for activity recognition that gradually learns events specific to a deployed environment while minimizing user burden."
Currently, HomePods automatically adjust their audio output to suit the environment and space that they are in. And Apple has filed patents that would see future HomePods using the position of people in a room to direct audio to them.
The idea behind this paper's research is that similar sensors could listen for sounds and detect where they are coming from. It could then group them so that, for instance, it recognizes what direction the bleeps from a microwave are coming. Understanding the context of where someone is standing and what noises are being heard from which directions, could make Siri better able to understand requests, or to volunteer information.
"For example, the system can ask a confirmatory query: 'was that a doorbell?', in which the user responds with a 'yes,'" it continues. "Once a label is established, the system can offer push notifications and other actions whenever the event happens again. This interaction links both physical and digital domains, enabling experiences that could be valuable for users who are e.g., hard of hearing."
While the paper repeatedly and exclusively mentions HomePods, it is really concerned with any device with microphones. It suggests that since we all now have an ever-increasing number of devices that are capable of listening, then we already have tools to improve voice control.
In a video accompanying the paper, the researchers demonstrate how listening like this can improve accuracy, and also how it's more successful than previous attempts to train devices.
The paper, "Automatic Class Discovery and One-Shot Interactions for Acoustic Activity Recognition," proposes that a device be able to listen continuously, although "no raw audio is saved to the device or to the cloud." It keeps doing this, effectively creating labels or tags that are triggered by certain sounds, until it's basically heard enough.
"Eventually, the system becomes confident that an emerging cluster of data is a unique sound, at which point, it prompts [the user] for a label the next time it occurs," explains the paper. "The system asks: 'what sound was that?', and [the user] responds with: 'that is my faucet.' As time goes on, the system can continue to intelligently prompt Lisa for labels, thus slowly building up a library of recognized events."
As well as a general "what sound was that?" kind of question, it might be able to guess and so try asking a more specific question. "The system might ask: 'was that a blender?'" says the paper. "In which [case the user] responds: 'no, that was my coffee machine.'"
While the paper is chiefly concerned with the effectiveness of a device asking the user questions like this, the researchers explain that they also tried specific use cases. "We built a smart speaker application that leverages Listen Learner to label acoustic events to aid accessibility in the home.," it says.
There is no indication of Apple or other firms integrating this idea into their smart speakers yet. Instead, this was a short-term focused test, and the team have recommendations for further research.
However, it's promising because they conclude that this test "provides accuracy levels suitable for common activity-recognition use cases," and brings "the vision of context-aware interactions closer to reality."