Gemini Benchmark - Code

In this report, we compare the Pass@1 of Gemini Pro to GPT 3.5 Turbo and GPT 4 Turbo on two code generation tasks HumanEval and ODEX. We present overall performance, performance by gold solution length, performance by used library, and a case study. If you want to look in more detail about the individual examples, you can click over to the corresponding Zeno project HumanEval and Zeno project ODEX.

First, from the overall results shown in the figures below, we can see that Gemini Pro achieves a Pass@1 lower than GPT 3.5 Turbo and much lower than GPT 4 Turbo on both tasks. The results demonstrate that Gemini's code generation capabilities still have room for improvement.

Performance Overview (HumanEval)

Performance Overview (ODEX)

Pass@1 by Gold Solution Length

Second, we analyze the relationship between the gold solution length and the model performance in the figure below. The solution length can partly indicate the difficulty of solving the corresponding code generation task. We find that even though Gemini Pro achieves comparable Pass@1 with GPT 3.5 when the solution length is below 100 (e.g., easier cases), it falls behind by large margins when the solution becomes longer. This is an interesting contrast to the results from previous sections, where we found that in general Gemini Pro performed robustly with respect to longer inputs and outputs on English language tasks.

Pass@1 by Gold Solution Length

Pass@1 by Used Library

We also present the analysis of how the libraries required in each solution affect the model performance in the figure below. Gemini Pro performs worse than GPT 3.5 on most library-used cases, such as mock, pandas, numpy, and datetime. However, it outperforms GPT 3.5 and GPT 4 on the matplotlib cases, showing stronger capabilities when performing drawing visualization via code.

Pass@1 by Used Library

Case Study

We then conduct a detailed case study to examine why Gemini performs worse in code generation than GPT 3.5. We summarize two main reasons to explain Gemini's weaknesses:

Gemini may mistake the intent due to negligence in the language understanding of code description. As shown in example id 26, the description asks the model to remove all elements that occur more than once, but Gemini just extracts the unique numbers without removal.
Gemini is somewhat worse at correctly choosing functions and arguments from the Python API such as the "bytearray.fromhex" in example id 0, which is rarely seen in GPTs. This issue may be due to the bias from Gemini's training data.

In contrast, GPT 3.5 can handle these two cases reasonably.

Slice does not exist anymore.

Enjoyed this report? You can make one too!

Error analysis, chart authoring, shareable reports, and more with Zeno.

Gemini Code

Gemini Code

Gemini Benchmark - Code

Performance Overview (HumanEval)

Performance Overview (ODEX)

Pass@1 by Gold Solution Length

Pass@1 by Gold Solution Length

Pass@1 by Used Library

Pass@1 by Used Library

Case Study

Enjoyed this report? You can make one too!