Bridging the Gap in Speech Recognition Interfaces
Today, something really unusual happened: Siri amazed me.
As I walked across campus this morning, I wanted to listen to one of my favorite recent albums, “Ashes” by the Bedsit Infamy. So, like a douchebag from the future, I raised my arm and spoke to my Apple Watch, whose “virtual assistant” is named “Siri”. I said:
“Hey Siri, play songs by the Bedsit Infamy”
A Recipe for Failure
Speech Recognition, as I’ve discussed before, relies heavily on guesswork, particularly when there are homophones (words which sound identical to other words) or when there’s missing information (maybe due to traffic noise overlapping speech or misarticulation).
Both of the words in this particular band’s name are, well, weird. “Bedsit” is a British term for a studio apartment, and “infamy”, although well known (infamous, even), just isn’t used very often.
I love the saying “When you hear hoofbeats, think horses, not zebras”, and it applies here: when you hear something that sounds like “bedsit infamy”, it’s deeply unlikely that those two words are what’s being said. So, I figured that Siri would “mis-hear” those words as something more common and, well, reasonable. Sure enough, she did:
But, moments later, to my absolute amazement, my phone started playing the first song from the album:
Bridging the Gap between perception and the “real world”
This means that Apple (or Nuance, or whoever’s providing Siri’s logic) has added a logical step that I’ve never seen before in a consumer-facing system, but which has long been present in humans.
Imagine that you’re sitting across the table from a friend, and she says something that you hear as “Hand me that gas”. Unless you’re sitting next to a tank of compressed air1 or something similarly improbable, there’s really no way to complete the request as heard. This is where most natural language processing in speech recognition stops: “I tried to do exactly what I heard you ask me to do, but I can’t. Sorry!”
However, with a little bit more logic, we can bridge the gap between our mis-perception and the world around us. We might realize that on the table, there’s a glass, which sounds a lot like “gas” and is something that I could hand to her. So, without stopping to ask questions, we just hand over the glass, and interaction continues without problems.
So, it appears that, much like humans, when a voice command doesn’t “make sense” (because I don’t own music by “The Bed Sitting For Me”), Siri will now test other phonetically similar commands, to see if any of them make sense. If a similar command (“Play songs by The Bedsit Infamy”) actually can be completed, it’s programmed to do that, instead! But, if there’s nothing even close to what you ask for in your music library, it still gives up:
Speech Recognition is still really hard
This (small) victory illustrates just how hard good making a good speech recognition interface actually is. Even once they’ve factored out all the environmental noise and figured out the sounds being made (which is no small feat), they’ve still got to match the resulting commands to actual concepts and entities in the user’s life, some of which are going to be really unlikely and hard to predict.
As much as I love mocking and doing terrible things to Speech recognition, even the error-prone systems we have today are amazing. And every time a bit more logic is added to the process, they’ll get better and better, and eventually, we might actually believe Siri’s actually looking out for us.
I could have made a gas joke here, but all the good ones Argon. ↩
Categories: Computers and Software - Computational Linguistics - Speech Recognition -
Have a question, comment, or concern about this post? Contact me!