Gemini Benchmark - GSM8K

In this report, we compare the accuracy of Gemini Pro to GPT 3.5 Turbo, GPT 4 Turbo, and Mixtral, on the general-purpose reasoning dataset GSM8K. We examine the overall performance, performance by question complexity, and performance by task. If you want to look in more detail about the individual examples you can click over to the corresponding Zeno project

First, looking at overall results in the figure below, we can see that Gemini Pro achieves an accuracy slightly lower than that of GPT 3.5 Turbo, and much lower than that of GPT 4 Turbo. In contrast, the Mixtral model achieves much lower accuracy.

Accuracy

Accuracy by Label Length

Accuracy by Response Length

Accuracy by Question Length

Accuracy by Answer Digit

Gemini Benchmark - SVAMP

In this report, we compare the accuracy of Gemini Pro to GPT 3.5 Turbo, GPT 4 Turbo, and Mixtral, on the general-purpose reasoning dataset SVAMP. We examine the overall performance, performance by question complexity, and performance by task. If you want to look in more detail about the individual examples you can click over to the corresponding Zeno project.

First, looking at overall results in the figure below, we can see that Gemini Pro achieves an accuracy slightly lower than that of GPT 3.5 Turbo, and much lower than that of GPT 4 Turbo. In contrast, the Mixtral model achieves much lower accuracy.

Accuracy with Last Number

Accuracy by Question Length

Accuracy by Answer Digit

Gemini Benchmark - ASDIV

In this report, we compare the accuracy of Gemini Pro to GPT 3.5 Turbo, GPT 4 Turbo, and Mixtral, on the general-purpose reasoning dataset ASDIV. We examine the overall performance, performance by question complexity, and performance by task. If you want to look in more detail about the individual examples you can click over to the corresponding Zeno project.

First, looking at the overall results in the figure below, we can see that Gemini Pro achieves an accuracy slightly lower than that of GPT 3.5 Turbo, and much lower than that of GPT 4 Turbo. In contrast, the Mixtral model achieves much lower accuracy.

Accuracy with Last Number

Accuracy by Question Length

Accuracy by Answer Digit

Gemini Benchmark - MAWPSMultiArith

In this report, we compare the accuracy of Gemini Pro to GPT 3.5 Turbo, GPT 4 Turbo, and Mixtral, on the general-purpose reasoning dataset MAWPSMultiArith. We examine the overall performance, performance by question complexity, and performance by task. If you want to look in more detail about the individual examples you can click over to the corresponding Zeno project.

First, looking at the overall results in the figure below, we can see that Gemini Pro achieves an accuracy slightly lower than that of GPT 3.5 Turbo, and much lower than that of GPT 4 Turbo. In contrast, the Mixtral model achieves much lower accuracy.

Gemini Mathematics

Gemini Mathematics

Gemini Benchmark - GSM8K

Accuracy

Accuracy by Label Length

Accuracy by Response Length

Accuracy by Question Length

Accuracy by Answer Digit

Gemini Benchmark - SVAMP

Accuracy with Last Number

Accuracy by Question Length

Accuracy by Answer Digit

Gemini Benchmark - ASDIV

Accuracy with Last Number

Accuracy by Question Length

Accuracy by Answer Digit

Gemini Benchmark - MAWPSMultiArith

Accuracy

Accuracy by Question Length

Accuracy by Answer Digit

Enjoyed this report? You can make one too!