diamond tesselation logo

Comparing OpenAI Whisper Transcription Models

Updated Wednesday, November 15, 2023 at 5:38 PM

Author:

Alex Cabrera

Linked Projects:

The OpenAI Whisper Models

In late 2022, OpenAI released the Whisper series of audio transcription models. Since then, they have quickly become the go-to open-source models used in numerous deployed applications. There has also been a series of updates and follow-on work aiming to improve their speed and accuracy.

How do these new Whisper models compare to the originals? We decided to explore two variations:

  • Distil Whisper - A distilled version of Whisper that is 6x faster, smaller, and similarly performant to the base Whisper models.
  • Whisper Large v3 - An updated Whisper version trained on a larger corpus of data.

This report is a follow-up to our first transcription report, which looked at Whisper's performance across demographic groups. We use the same Speech Accent Archive dataset for this analysis, a dataset of speakers from around the world saying the same linguistically diverse phrase. If you want to explore this data further, take a look at the Whisper Accents Project.

Overall Performance

When just looking at the overall word error rate of each model, we see what looks to be a very counter-intuitive result: the larger models are significantly worse than the smaller models!

Overall Performance

If we dig a bit deeper and look at the underlying data we can quickly see why this is the case. Unlike the medium and distil models, the large models are multilingual and often pick the wrong language for the speaker given their accent.

Slice does not exist anymore.

If we do a rough filtering of the mistranscriptions (a WER < .75), we see a more intuitive result. Both distil models have the highest WER, with the two newly updated large V2 and large V3 having the lowest WER.

WER < .75

How common is language misidentification?

The English-specific models (unsurprisingly) never mistranscribe into non-English languages. We also see that the large-v3 model mistranscribes significantly less than the previous versions, likely due to the larger audio dataset it was trained on.

WER > .75

Performance by Demographics

Let's look at how the models perform for people from around the world. For this analysis, we'll only look at the English-specific models to avoid confounding the results with mistranscriptions.

The results are quite interesting - the difference in WER between the base medium model and the distil models increases for continents where English is not the primary language. Despite the trend, the difference is small and likely not as apparent in real-world performance.

English Models by Continent

How do the Distil Models Differ?

We found that the distil models overall are impressively good, even across diverse speakers. Can we identify some qualitative differences between the base and distilled models?

Missing Sentences

Very rarely we found that the distil models can miss chunks of sentences in a row - we quantified this behavior by looking for low length ratios between the output and label.

Slice does not exist anymore.

Low length ratio

Missing Punctuation

In some situations, the distil models outperform the base models! We found a set of instances where the base model does not insert any punctuation. Even more interestingly, the distil medium model seems to be the best at correct punctuation.

Slice does not exist anymore.

Few special characters

Omitting verbal repetition

Lastly, we found a few instances where the base Whisper models will omit verbal repetitions by the speaker that the distil models will include.

Tag does not exist anymore.

Tag does not exist anymore.

Conclusion

Overall, we found that the distil family of Whisper models provide nearly equal performance to the much larger and slower base models, despite rare issues such as missing sentences.

We also find that developers should be careful when using multilingual models to transcribe English, as language misidentification is a common problem, especially for speakers with accents.

The Whisper models were a step function improvement in audio transcription, and these exciting follow-up models keep improving the quality of the models.


If you liked this report and want to use Zeno to evaluate your models or create a report, check out zenoml.com or reach us at hello@zenoml.com.

Enjoyed this report? You can make one too!

Error analysis, chart authoring, shareable reports, and more with Zeno.