Gemini Benchmark - MMLU
In this report, we compare the accuracy of Gemini Pro to GPT 3.5 Turbo and GPT 4 Turbo, on the knowledge-based QA dataset MMLU. We examine the overall performance, output choice ratio, performance by tasks, and performance by output length. If you want to look in more detail about the individual examples, you can click over to the corresponding Zeno project.
First, looking at the overall results in the figure below, we can see that Gemini Pro achieves an accuracy lower than that of GPT 3.5 Turbo, and much lower than that of GPT 4 Turbo. Using chain-of-thought prompting may not boost the model performance as MMLU is mostly a knowledge-based question answering task and may not benefit a lot from stronger reasoning-based prompts.