Zeno AI Evaluation Platform

This report explores the performance of large language models (LLMs) such as ChatGPT and GPT-4 on language translation tasks. Specifically, it compares the performance of these models with current state-of-the-art translation-specific models. This is a separate exploration of the experimental results from the fantastic full paper by Nathaniel R. Robinson, Perez Ogayo, David R. Mortensen, and Graham Neubig.

The paper compared the performance of translation models on the FLORES 200 dataset, a set of translation results for diverse and under-resourced languages. While the full paper looks at all 200 languages, this interactive report explores a subset of 20 languages (due to cost constraints) on which GPT-4 was also run.

The models shown below are the following:

ChatGPT zero-shot (GPT-3.5-turbo)
ChatGPT five-shot (GPT-3.5-turbo)
GPT4 five-shot
NLLB MOE - SOTA translation model

We use the ChrF metric of character n-gram F-score from sacrebleu for evaluation. This metric is a measure of overlap between the ground truth and predicted output and has shown reasonable correlation with human rating.

Overall Performance

We first look at the overall performance of these models across all language pairs. We find that while GPT-4 significantly outperforms ChatGPT, it still significantly lags behind the state-of-the-art translation models, especially on under-resourced languages.

Overall Performance

Why do LLMs lag?

Can we dig a bit deeper and explore why specifically LLMs perform worse? A common pattern we found browsing through the data was that GPT produced degenerated outputs containing the same word repeated over and over.

To quantify this behavior we looked at how many outputs each model had in which any word was repeated more than three times.

>3 Word Repetitions by Model

Slice does not exist anymore.

We see that degenerate outputs are relatively common for ChatGPT, and while they are much less common for GPT-4 they do still occur.

Are these outliers skewing our results? We can go back and look at the overall performance of our models excluding these degenerate inputs to see.

< 4 repetitions

Removing degenerate inputs has a very slight change to the overall ChrF and is not the driving factor for the difference in performance.

Performance Across Languages

It appears that general-purpose LLMs will not replace dedicated translation models anytime soon. But is this the case across all languages? We can dive into the data to understand where the biggest disparities in performance are.

Language Scripts

An interesting question we had was how well models do across different language scripts, especially rarer scripts that might not be as common in training data.

ChrF by Script

We see that the same pattern of model performance holds across all language scripts, with the interesting exception of Cyrillic. Perhaps there are properties of Cyrillic text that make it easier to transcribe to and from English?

Languages

We can take a step further and visualize model performance across all the 20 languages in the dataset.

All Languages

This chart presents a lot of information to parse, but we can see some clear outliers and patterns.

First, there are a few languages for which the LLMs are competitive with NLLB MOE, particularly in high-resource languages with lots of available data like French, Romainan, and Ukranian.

There are also a few less common languages for which GPT-4 actually outperforms NLLB MOE, in particular the following two:

GPT 4 > NLLB MOE

While it would require some deeper investigation to uncover why this is the case, there are a few interesting hypotheses as to why this may be the case. Tok Pisin is a Creole language derived from a combination of languages. GPT-4 may be performing better because:

Low-resource languages written in Latin script are probably relatively easy compared to those not written in Latin script.
Tok Pisin may be more prominent on the internet as a whole than it is on Wikipedia, so we’re underestimating its level of resources in the GPT-4 training data
Tok Pisin is a creole that borrows many English words, and GPT-4 has learned to “guess” these words through transliteration

Conclusion

General-purpose language models have made significant strides across numerous tasks. GPT-4, in particular, has shown that it can rival some state-of-the-art models in translation. Despite these improvements, translation-specific models still significantly out-perform general models. This in-depth analysis shows specific areas in which LLMs can be used instead of translation models, and areas in which they can be improved.

Enjoyed this report? You can make one too!

Error analysis, chart authoring, shareable reports, and more with Zeno.

GPT MT Benchmark Report

GPT MT Benchmark Report

Overall Performance

Overall Performance

Why do LLMs lag?

>3 Word Repetitions by Model

< 4 repetitions

Performance Across Languages

Language Scripts

ChrF by Script

Languages

All Languages

GPT 4 > NLLB MOE

Conclusion

Enjoyed this report? You can make one too!