When Men and Women talk to Siri

Just adding a short note to elaborate on the Twitter conversation I recently had on differential error rates in speech products for male and female speakers. This is not a commentary on Siri; I am using Siri in the title as eponymous to voice assistants in general.

TLDR: different gender error rates for speech products exist mainly because of 1) our lack of better models even when the data is balanced, 2) inherent hardness of the problem.

The “technical reasons” I mentioned start with the way nature made our vocal tracts and understanding some phonetics. In Speech, we measure the mean fundamental frequency (correlates with our perception of “pitch”). This is also called mean F0. The range of tones produced by our vocal tract is a function of the distribution around that. You could write a simple, rule-based, gender classifier if you had the mean F0 from audio (can be extracted from audio). From many sources, including [1], we know the mean F0 for men is around 120Hz and much higher for women (~200Hz).

Of course, in reality, you would learn all of that from data (usually the function predict_gender is a neural network). But soon as you implement your model, the problems become apparent:

1. The mean F0s for a given gender vary depending on your age, whether you smoke, your current afflictions, your ethnicity, and so on.
2. Even when the predictions make sense given data, it might not make sense to the user since we are limiting the notion of gender here to biological gender at puberty. For a recent good study of these issues (from a linguistics POV), see [2] and articles citing it.

There is sociolinguistics work, exploiting content and disfluencies to improve accuracies for female speakers. See [3] for example.

To make things worse, our choice of features can impact performance too (duh). MFCCs, popularly used in speech, can degrade the performance of speaker recognition systems for women [4].

The mel scale is almost optimal for males, but suboptimal for females in speaker recognition, because it quefrency quashes the cepstra containing the female formants.

Speech systems designed without mindfulness to the extent of this problem can make an already hard problem worse. Fortunately, with recent deep models for speech, we can build models that directly learn from raw waveforms, throw a lot of data and compute at it, and hope the models have enough capacity to reliably encode class-specific variation. This is appealing but also sort of favors large companies than smaller startups that push out new technologies all the time. But with sufficient thought, many of these over-provisioned deep models may be replaced with simpler deep models.

From a pure stats/ML POV, the mean F0 range for female voices is usually much higher than male voices. So feature vectors for women can show more variance in the estimates, which can also contribute to the lack of performance. There are two ways of dealing with this: Building really high capacity models that account for all that variation or conditioning the models based on gender. For example, speech emotion models when conditioned on gender show a significant improvement (similarly for race).

This is just a bare bones explanation. For an in-depth explanation, consult your friendly phonetician from your Linguistics/CogSci department. I am excited that this conversation is getting attention (kudos to Rachael Tatman and Aravind Narayanan for getting this conversation started on Twitter in the first place) as there is much to be discussed about the status quo speech models in production.

Also, talk to some women speech researchers about what they think about this. I am guessing they surely have given more thought to this problem than anyone else. There are many, but here’s a seed list — Tara Sainath, Erin Fitzgerald, Carolina Parada, Karen Livescu, Sharon Goldwater, Mari Ostendorf. Unrelated: I tried hard to find a list of women in speech research and couldn’t find one. Perhaps, someone can take this seed list and build a large one?

Apologize in advance for missing any key citations in my intentionally sparse citations. It’s a blog post :)

References:
1. Yukio et al. (1972), “A statistical analysis of melody curves in the intonation of American English”
2. Zimman, Voices in Transition: Testosterone, Transmasculinity, and the Gendered Voice among Female-to-Male Transgender People
3. Garera and Yarowsky, Modeling Latent Biographic Attributes in Conversational Genres, 2009, ACL
4. Mason and Thompson, Gender effects in Speaker Recognition
5. Erwan Pépiot, Male and Female speech

© 2016 Delip Rao. All Rights Reserved.