This report explore the performance of the OpenAI Whisper transcription models on the Speech Accent Archive dataset. The Whisper models are often considered the state-of-the-art transcription models and are widely deployed. But will they serve all users equally well? Or are there hidden biases we should be aware of if we want to build fair systems? The Speech Accent Archive is a fascinating dataset that asks people from all over the world to say the same English phrase which contains common English sounds. The dataset has a ton of metadata about the speakers, making it great for evaluating potential biases in transcription models.

In this report we specifically look at four Whisper versions: tiny, tiny.en, base, and base.en. We use the common word error metric for evaluation. Note that for this metric lower is better, so smaller numbers equal better performance.

Overall Performance

We can first look at the overall performance of the four models on the dataset. We would expect a decreased WER for the base models, which are larger and slower, but interestingly, we only see this improvement with the English-specific model.

Holistic Performance

Slice does not exist anymore.

Looking at the data instances in which the base and base.en models differ, we see that in many cases the base model will start transcribing in the wrong language or script.

What if we exclude these mistranscriptions - how do the models compare when they pick the right language? Most examples in the wrong language had very high error rates, so we select examples with less than 85% WER.

< .85 WER

Interestingly, we find that the base Whisper model actually outperforms the English-specific model. The main benefit of increased performance is from the base model picking the right language.

Performance by Continent

Another question we might want to ask is “which models will perform well across a broad variety of speakers?” This is important to make sure that we have models that aren’t seriously underserving users with particular accents, for example.

We can use the demographic information in the dataset to dive deeper and start to answer these questions. First, we can look at how the model performs for speakers from different continents.

WER by Continent

We see a clear pattern across all four models: Models perform much better for speakers from continents where English is the primary language.

Additionally, the disparity between base and base.en is much higher for continents in which English is not the primary language, showing that mistranscriptions are more likely for speakers with accents not from North America or Oceania

But is this just because speakers from these continents are not native speakers? Or is it an accent bias? We can look specifically at speakers from all of these continents for whom English is their first language.

WER by Continent (Native)

While the model improved for speakers from Europe, it remains worse for speakers from the other non-English native continents. These results should be taken with a grain of salt as there are only a handful of native English speakers in the dataset from these continents.

Performance by Age Learned

There are other interesting demographic aspects we can look at using this dataset. Another dimension is the age at which someone learned English. In this chart, we look at groups ranging from native speakers to late learners.

We see that there is a steep dropoff in performance as speakers get older, but the difference between the base English model and the other models grows as the learning age increases.

Learning Age

Is this because of mistranscriptions? We can look at the same chart, excluding mistranscriptions, and see the same trend from above - the base model outperforms the english model when it picks the correct language!

Learning Age (<.85 WER)

Conclusion

While the Whisper models significantly improve the SOTA of translation, they have a bias towards American and European speakers. Diversity of the training set could lead to more equitable models.

Enjoyed this report? You can make one too!

Error analysis, chart authoring, shareable reports, and more with Zeno.

Audio Transcription Report