This report explores the performance of large language models (LLMs) such as ChatGPT and GPT-4 on language translation tasks. Specifically, it compares the performance of these models with current state-of-the-art translation-specific models. This is a separate exploration of the experimental results from the fantastic full paper by Nathaniel R. Robinson, Perez Ogayo, David R. Mortensen, and Graham Neubig.
The paper compared the performance of translation models on the FLORES 200 dataset, a set of translation results for diverse and under-resourced languages. While the full paper looks at all 200 languages, this interactive report explores a subset of 20 languages (due to cost constraints) on which GPT-4 was also run.
The models shown below are the following:
- ChatGPT zero-shot (GPT-3.5-turbo)
- ChatGPT five-shot (GPT-3.5-turbo)
- GPT4 five-shot
- NLLB MOE - SOTA translation model
We use the ChrF metric of character n-gram F-score from sacrebleu for evaluation. This metric is a measure of overlap between the ground truth and predicted output and has shown reasonable correlation with human rating.
Overall Performance
We first look at the overall performance of these models across all language pairs. We find that while GPT-4 significantly outperforms ChatGPT, it still significantly lags behind the state-of-the-art translation models, especially on under-resourced languages.