Affiliate Disclosure
If you buy through our links, we may get a commission. Read our ethics policy.

Research into Siri, Alexa, Google Assistant voice tech reveals bias in training data

Automated speech recognition systems are essential to most of the features of smart speakers and virtual assistants.

Last updated

Speech recognition systems from major tech companies have a harder time understanding words spoken by black people than the same ones spoken by whites, a new study finds.

These types of systems are commonly used in digital assistants like Siri, as well as tools like closed captioning and hands-free controls. But, as with any machine learning system, their accuracy is only as good as their dataset.

Automated speech recognition (ASR) systems developed by companies like Apple, Google and Facebook tend to have higher error rates when transcribing speech from African Americans than white Americans, according to a Stanford University study published in Proceedings of the National Academy of Sciences.

Researchers carried out 115 human-transcribed interviews and compared them to ones produced by speech recognition tools. Of those, 73 conversations were with black speakers, while 42 were with white speakers.

The team found that the "average word error rate" was nearly double (35%) when the ASR systems transcribed black speech, compared to 19% when it transcribed white speakers.

To rule out differences in vocabulary and dialect, the researchers also matched speech by gender and age, and had speakers say the same words. Even then, they found error rates nearly twice as high for black speakers than for white ones.

"Given that the phrases themselves have identical text, these results suggest that racial disparities in ASR performance are related to differences in pronunciation and prosody— including rhythm, pitch, syllable accenting, vowel duration, and lenition— between white and black speakers," the study reads.

Error rates tended to be higher for African American men than for women, though there was a similar disparity among white men and women. The accuracy was the worst for speakers who made heavy use of African American Vernacular English (AAVE).

Of course, machine learning systems can't be biased the same way people can. But if there's a lack of diversity in the data they are trained on, that's going to show up in their accuracy and performance. The study concludes that the primary issue seems to be a lack of audio data from black speakers when training machine learning models.

It's worth noting that the researchers used a custom-designed iOS app that leveraged Apple's free speech recognition technology, and it isn't clear whether Siri uses that exact machine-learning model. The tests were also conducted last spring, so the models may have changed since then.

While the study looked specifically at black and white speakers, digital assistants can also have a harder time interpreting other accents.

A 2018 story by The Washington Post found that digital assistants like Alexa or Google Assistant have a harder time understanding people with accents of all kinds. Generally, speakers from the West Coast — where most tech giants are located — were the best understood.

And in 2019, U.S. federal researchers also found widespread evidence of racial bias in nearly 200 facial recognition algorithms, cementing the fact that lack of diverse data sets can cause similar issues in all types of machine-learning platforms.



8 Comments

hmurchison 23 Years · 11824 comments

Neither of these systems are as accurate with my wife’s commands.  It’s easy to see a lot of the testing comes from males and these males are higher educated than the norm which gives them better command of English diction. 

gatorguy 13 Years · 24627 comments

The worst of the group? Siri. Best was Microsoft's Cortana, so kudos MS. Google and Amazon's voice assistants were in the middle. In fairness to all of them the inaccuracies in general were not due to "race" as such but dialect and speech rhythm.

EsquireCats 8 Years · 1268 comments

It would be handy if the numbers were compared with population demographics - since the machine learning systems are actively trained by their own users it stands to reason that smaller demographics will have proportionally less accurate dictation. (While larger demographics will benefit from wider variety and a higher number of samples.)

dysamoria 12 Years · 3430 comments

Unsurprising. This same issue is present with facial recognition tech and sensors on touchless soap dispensers and water faucets. I seem to remember such an article on this site...?

The tech industry is very white and very male, and their test and training audiences are usually more of the same, because, well, white male tech geeks are plentiful, and they like trying out bleeding-edge tech (as a born-again user, I’m no longer willing to be a crash-test dummy).

My girlfriend has a Droid phone. Whenever she speaks to it for dictation or commands, she talks sort of robot-like. She unnaturally enunciates and speaks slowly. I tell her she doesn’t need to do that. When I do that with speech recognition, it presents worse results, because systems aren’t trained to listen to robotic people. She says it helps...

beowulfschmidt 12 Years · 2361 comments

The team found that the "average word error rate" was nearly double (35%) when the ASR systems transcribed black speech, compared to 19% when it transcribed white speakers.

Purely because my education seems to be lacking, could you please define "black speech" for me?  I'm also curious why you chose to say "transcribed black speech" and then "transcribed white speakers."

And if it matters, boomer white cis male here.