diamond tesselation logo

Gemini: Flores Translation Evaluation

Updated Tuesday, December 19, 2023 at 4:54 PM

Author:

aashiqmuhamed

Linked Projects:

Gemini Benchmark - Flores

In this report, we compare the performance (in chrf(%)) of Gemini Pro, GPT-3.5 Turbo and GPT-4 Turbo, on the machine translation dataset Flores. We examine the overall average performance, performance by language pairs and performance on other slices. If you want to look in more detail at the individual examples, you can click over to the corresponding Zeno project.

First, looking at the overall results in the figure below, we can see that Gemini Pro achieves lower performance on average than that of GPT-3.5 Turbo, and GPT 4 Turbo. In subsequent sections, we will investigate why this might be the case.

Overall performance

Analysis by language

We first plot performance by language pair. We see that Gemini Pro tends to perform high on a subset of pairs (competitively with GPT-4 Turbo and GPT-3.5 Turbo), and achieves almost 0 chrf on another subset of pairs.

We also see that 5-shot prompts only minimally improve performance and on the subset that Gemini Pro performs competitively on.

Performance by language pair

If we plot the number of blocked responses per language for Gemini Pro, we see that the performance on language pairs correlates with the number of blocked responses.

Blocked per language

performance per language

If we examine the performance on unblocked samples only, the trend is largely similar to the combined performance above, suggesting that a language specific block is in effect.

Unblocked performance per language

If we compare the performance on unblocked samples for Gemini Pro vs. all samples for other models, we see that Gemini Pro beats other models significantly suggesting that the blocked examples might also be samples that are difficult or the model is uncertain about.

Non-empty prediction performance

Examining the performance on blocked samples (samples blocked by either Gemini Pro 0-shot or 5-shot) vs unblocked samples, we see that Gemini Pro performs better on the unblocked samples than GPT-4 Turbo and GPT-3.5 Turbo. We also observe that all models perform poorly on the blocked samples suggesting that these examples are harder for models to translate.

Performance on gemini blocked vs unblocked

Performance by language script

If we analyze performance by language script, we see that few-shot prompts tend to increase the performance envelope for all models. In particular, it significantly enhances performance on the Devanagari script.

Gemini Pro is confident about its predictions on Cyrillic script, but underperforms on other scripts.

performance per lang script

Repetitions and length distribution

All analysis below is on unblocked samples ie. samples on which the Gemini Pro model predicts confidently.

The first plot shows the number of samples with >3 max word repetitions. We see that Gemini Pro and GPT-3.5 Turbo tend to produce more repetitions than GPT-4 Turbo .

Repetitions >3 on unblocked samples

We next examine the target length distribution (for unblocked samples), and the performance in each length bucket.

In general we find that Gemini Pro content filter-based response blocks, change the distribution of the samples when conditioned on target length, so that the mode of distribution changes from 100<x<150 to 200<x<250.

Length distribution target

All samples target length distribution

We find that performance (in chrf) is correlated with the length of the target sentence.

Performance by target length (unblocked samples only)

When we examine the length distribution of predictions (for unblocked samples), we find that the distribution is somewhat identical across models.

However the performance of Gemini Pro is much greater than GPT-4 and GPT-3.5-Turbo for the predictions with predlen (x>350) and much worse at shorter length (x<50).

Length distribution preds

Performance by preds length

Enjoyed this report? You can make one too!

Error analysis, chart authoring, shareable reports, and more with Zeno.