In this report, we compare the accuracy of Gemini Pro to GPT 3.5 Turbo, GPT 4 Turbo, and Mixtral, on the simulative agent benchmark Webarena. If you want to look in more detail about the individual examples you can click over to the corresponding Zeno project.
First, looking at overall results in the figure below, we can see that Gemini Pro achieves an accuracy slightly lower than that of GPT 3.5 Turbo, and much lower than that of GPT 4 Turbo. In contrast, the Mixtral model achieves much lower accuracy.
There is also a consistent trend for both gpt-3.5-turbo and gemini-pro that performance is better when given an "UA"( unachievable) hint in the prompt, meaning that all the prompt indicate that the task may be unachievable.