This report explore the performance of the OpenAI Whisper transcription models on the Speech Accent Archive dataset. The Whisper models are often considered the state-of-the-art transcription models and are widely deployed. But will they serve all users equally well? Or are there hidden biases we should be aware of if we want to build fair systems? The Speech Accent Archive is a fascinating dataset that asks people from all over the world to say the same English phrase which contains common English sounds. The dataset has a ton of metadata about the speakers, making it great for evaluating potential biases in transcription models.
In this report we specifically look at four Whisper versions: tiny, tiny.en, base, and base.en. We use the common word error metric for evaluation. Note that for this metric lower is better, so smaller numbers equal better performance.
Overall Performance
We can first look at the overall performance of the four models on the dataset. We would expect a decreased WER for the base models, which are larger and slower, but interestingly, we only see this improvement with the English-specific model.