Gemini Benchmark - MMLU

In this report, we compare the accuracy of Gemini Pro to GPT 3.5 Turbo and GPT 4 Turbo, on the knowledge-based QA dataset MMLU. We examine the overall performance, output choice ratio, performance by tasks, and performance by output length. If you want to look in more detail about the individual examples, you can click over to the corresponding Zeno project.

First, looking at the overall results in the figure below, we can see that Gemini Pro achieves an accuracy lower than that of GPT 3.5 Turbo, and much lower than that of GPT 4 Turbo. Using chain-of-thought prompting may not boost the model performance as MMLU is mostly a knowledge-based question answering task and may not benefit a lot from stronger reasoning-based prompts.

Performance Overview

Output Choice Ratio

Based on this overall result, we next dive a bit deeper. One first notable point is that all questions in MMLU are multiple-choice with 4 potential answers ordered A through D. In the table below, we show the ratio of the number of times each model selects each multiple choice answer. From this table, we can see that Gemini has a very skewed label distribution, biased towards selecting the final choice of "D", which contrasts to the result of the GPT model, which is more balanced. This may indicate that Gemini has not been heavily instruction-tuned towards solving multiple-choice questions, which can cause models to be biased with respect to answer ordering.

	A	B	C	D
gemini-pro	18.2%	19.8%	27.9%	34.1%
gpt-3.5-turbo	21.4%	27.4%	28.7%	22.6%
gpt-4-turbo	24.9%	24.4%	25.0%	25.7%

Comparison by Task

Next, we examine each subtask's performance. The table below illustrates each model's performance on selected representative tasks. We notice that generally, Gemini underperforms on most tasks compared to GPT 3.5. Interestingly, GPT 4 performs disastrously on the high_school_mathematics when not using chain-of-thought prompting. This is because GPT 4 tends to generate an explanation first (To find or To determine, ..., which violates our output format) rather than answering directly, even without such prompts to guide it to do so. We hypothesize that the questions in this task may trigger some under-explored ''magic'' mechanism inside the model, enabling it to explain before answering. Nevertheless, chain-of-thought prompting mitigates this issue by using explicit prompts of reasoning and thus decreases the variance across the subtasks.

Performance Table by Task

Further, we dig deeper into the tasks where Gemini Pro underperforms/outperforms GPT 3.5 the most. From the figures below, we can see that Gemini falls behind GPT 3.5 on human_sexuality (social sciences), formal_logic (Humanities), elementary_mathematics (STEM), and professional_medicine (specialized domains). For the only two tasks (security_studies and high_school_microeconomics) that Gemini outperforms GPT 3.5, the gap is marginal, and thus, we cannot infer too much from it.

The underperformance of Gemini Pro on particular tasks can be attributed to two reasons. First, in some cases Gemini fails to return an answer. In most MMLU sub-tasks, the API response rate was greater than 95%, but two had notably low response rates: moral_scenarios at 85% and human_sexuality at 28%. This indicates that low performance on some tasks can be attributed to content filters on the input. Second, Gemini Pro performed somewhat more poorly at the basic mathematical reasoning necessary to solve the formal_logic and elementary_mathematics tasks (see the instances below).

Top-4 Tasks where GPT 3.5 Turbo > Gemini Pro

Top-4 Tasks where Gemini Pro > GPT 3.5 Turbo

Slice does not exist anymore.

Comparison by Output Length

Finally, we analyze how the output length in the chain-of-thought prompting affects the model performance in the figures below. Generally, a stronger model tends to perform more complex reasoning and thus outputs a longer response. One of the noteworthy advantages of Gemini Pro is that its accuracy is less influenced by the output length compared to the two counterparts. It even outperforms GPT 3.5 when the output length is over 900. However, it also can be seen that Gemini Pro and GPT 3.5 Turbo rarely output these long reasoning chains compared to GPT 4 Turbo.

Accuracy by Output Length

Output Length Distribution

Enjoyed this report? You can make one too!

Error analysis, chart authoring, shareable reports, and more with Zeno.

Gemini MMLU