Gemini Benchmark - Big Bench Hard

In this report, we compare the accuracy of Gemini Pro to GPT 3.5 Turbo, GPT 4 Turbo, and Mixtral, on the general-purpose reasoning dataset BigBench Hard. We examine the overall performance, performance by question complexity, and performance by task. If you want to look in more detail about the individual examples you can click over to the corresponding Zeno project.

First, looking at overall results in the figure below, we can see that Gemini Pro achieves an accuracy slightly lower than that of GPT 3.5 Turbo, and much lower than that of GPT 4 Turbo. In contrast, the Mixtral model achieves much lower accuracy.

Overview Accuracy Last Num

Question Complexity

Based on this overall result, let us dig a little bit deeper into why Gemini might be underperforming. First, we examined accuracy by the length of the question, and found that Gemini Pro underperformed on longer, more complex questions while the GPT models were more robust to this. This was particularly the case for GPT 4 Turbo, which showed very little degradation even on longer questions, indicating an impressively robust ability to understand longer and more complex queries. GPT 3.5 Turbo fell in the middle with respect to this robustness. Mixtral was notably stable with respect to question length, but had low accuracy overall.

Accuracy by Question Length

Accuracy by Task

Next we look at whether there are variations in accuracy by the specific task in Big Bench Hard. Below, we list the tasks where GPT 3.5 Turbo outperformed Gemini Pro by the largest amount.

Tasks where GPT 3.5 Turbo > Gemini Pro

We can notice that Gemini is particularly bad at the tracking_shuffled_objects tasks. These tasks involve keeping track of who has certain objects as they are traded among people, and Gemini Pro often has difficulty keeping the order straight (as below).

Slice does not exist anymore.

However, there were a few tasks where Gemini Pro outperformed GPT 3.5. The figure below shows the six tasks where Gemini Pro outperformed GPT 3.5 by the largest amount. These were various and included those that required world knowledge (sports_understanding), manipulating stacks of symbols (dyck_languages), sorting words in alphabetical order (word_sorting), and parsing tables (penguins_in_a_table), among others.

Tasks where Gemini Pro > GPT 3.5

We also find Gemini Pro performing worse than Mixtral in a couple of tasks.

Tasks where Mixtral > Gemini Pro

We further investigate the robustness of LLMs across different answer types in the Figure below. We can see that Gemini Pro shows the worst performance in Valid/Invalid answer type which falls under the task formal_fallacies. Interestingly 68.4% of questions from this task were considered illegal/sensitive by Gemini Pro and received VertexAIException - The response was blocked error from Gemini Pro. However, Gemini outperformed all GPT models as well as Mixtral by significant margin on Other answer types (consisting of word_sorting and dyck_language tasks) which follows a similar line of findings as above i.e., Gemini is particularly good at word rearrangement and producing symbols in the correct order.

Accuracy by Answer Type

In sum, there did not seem to be a particularly strong trend in which tasks one model performed better than the other, so when performing general-purpose reasoning tasks it may be worth trying both the Gemini and GPT models before making a decision on which to use.

Enjoyed this report? You can make one too!

Error analysis, chart authoring, shareable reports, and more with Zeno.

Gemini BBH