Gemini Benchmark - Big Bench Hard
In this report, we compare the accuracy of Gemini Pro to GPT 3.5 Turbo, GPT 4 Turbo, and Mixtral, on the general-purpose reasoning dataset BigBench Hard. We examine the overall performance, performance by question complexity, and performance by task. If you want to look in more detail about the individual examples you can click over to the corresponding Zeno project.
First, looking at overall results in the figure below, we can see that Gemini Pro achieves an accuracy slightly lower than that of GPT 3.5 Turbo, and much lower than that of GPT 4 Turbo. In contrast, the Mixtral model achieves much lower accuracy.