Gemini Benchmark - GSM8K
In this report, we compare the accuracy of Gemini Pro to GPT 3.5 Turbo, GPT 4 Turbo, and Mixtral, on the general-purpose reasoning dataset GSM8K. We examine the overall performance, performance by question complexity, and performance by task. If you want to look in more detail about the individual examples you can click over to the corresponding Zeno project
First, looking at overall results in the figure below, we can see that Gemini Pro achieves an accuracy slightly lower than that of GPT 3.5 Turbo, and much lower than that of GPT 4 Turbo. In contrast, the Mixtral model achieves much lower accuracy.