Zeno AI Evaluation Platform

In this report, we compare the accuracy of Gemini Pro to GPT 3.5 Turbo, GPT 4 Turbo, and Mixtral, on the simulative agent benchmark Webarena. If you want to look in more detail about the individual examples you can click over to the corresponding Zeno project.

First, looking at overall results in the figure below, we can see that Gemini Pro achieves an accuracy slightly lower than that of GPT 3.5 Turbo, and much lower than that of GPT 4 Turbo. In contrast, the Mixtral model achieves much lower accuracy.

There is also a consistent trend for both gpt-3.5-turbo and gemini-pro that performance is better when given an "UA"( unachievable) hint in the prompt, meaning that all the prompt indicate that the task may be unachievable.

Overall Success

If we break it down by websites, we can see that gemini-pro performs worse than gpt-3.5-turbo on gitlab and map, while being close to gpt-3.5-turbo on shopping admin, reddit, and shopping. It may perform better than gpt-3.5-turbo on multi-site task.

Success by Site 2

Let's take a peek on gmini-pro-uahint's answers in shopping sites, where it has the same performance as gpt-3.5-turbo

Slice does not exist anymore.

One thing we notice by inspecting model outputs is that gemini-pro responded with a lot of stop [N/A], which under webarena means: this task is unachievable. We are interested in knowing how often this happens and how does it compare to the GPTs(This may indicate more responsible output by gemini, check out the paper for a more in-depth analysis on this).

UA indication

Aside from overwhelmingly labeling tasks as unachievable, gemini also tends to give shorter responses. We visualzie the number of steps taken by each model.

Number of Steps 2

We can see that gemini-pro takes less than 10 steps for majority of its tasks, while gpt models exhibit a more even spread, ranging from less than 10 steps all the way to more than 30 steps.

Gemini also tends to be more brief in its responses, most of its output lengths are under 100. In contrast, the gpts and mixtral all have majority of the output at above 300 in lengths.

Response Length 2

To illustrate this, let's look at a specific example, where both gemini-pro and gpt-3.5-turbo are giving a CLICK action prediction. Here is the output of gemini-pro: In summary, the next action I will perform is click [67], and here is the output of gpt-3.5-turbo: Let's think step-by-step. This page is the admin dashboard of a Magento website. It provides various information and options. To find the top-1 best-selling product in 2022, I need to look for the section that displays the best-selling products. According to the observation, there is a tab called "Bestsellers" in the tablist. I can click on that tab to view the best-selling products. In summary, the next action I will perform is click [1121] to select the "Bestsellers" tab. We can see that gemini tends to give out the action prediction directly without doing much reasoning, even though it is instructed to do reasoning led by the phrase Let's think step-by-step in the prompt.

Enjoyed this report? You can make one too!

Error analysis, chart authoring, shareable reports, and more with Zeno.