Zeno AI Evaluation Platform

What is the DROP Benchmark?

The DROP (Discrete Reasoning Over Paragraphs) evaluation dataset consists of English-text paragraphs followed by questions that require a series of reasoning steps to answer (including math, comparison, etc.).

Here are a couple of example instances, or explore all the data in the accompanying Zeno Project

Tag does not exist anymore.

The DROP benchmark was recently added to the 🤗 Open LLM Leaderboard. However, there has been some major variation in performance between models, with large models (such as Falcon 180B) we would expect to perform well having extremely low scores.

In this report, we looked at a subset of models representing a large range of performance:

High Performers - Yi 34B and TigerBot 70B
Middle Performers - XGLM 7.5B
Low Performers - Mistral 7B and Falcon 180B

Below we can see their F1 scores on the entire dataset. The F1 metric measures the overlap between the Bag-of-Words (BoW) of the output and ground-truth answers.

Overall F1 Performance

Is the performance dependent on the type of question? Not necessarily, as the same models underperform across all question types, with a slightly smaller difference for "span" or text-type answers.

Performance by Question Type

Numeric Outputs

Let's explore numeric outputs first, as they seem to be the hardest across models.

Floating-point Answers

When we looked at some example instances in the Zeno project we found that the model outputs often ended after a decimal in floating-point answers.

Slice does not exist anymore.

It turns out this is the case for all models. Of the 1,600 questions with a decimal answer, not a single model produces a floating-point output! of the form #.#

Floating-point Answer (Size)

After further investigation into the implementation of the DROP benchmark, we found that . is used as a token for stopping generation. Hence, no decimal answer can be generated by any of the models.

While this is interesting, it does not explain the discrepancy between the models as it is consistent across both the high- and low-performing models.

Integer Answers

We next decided to look at questions with integer output labels. When looking at the instances we found that the under-performing models often outputted the correct number but were given a 0 F-1 score.

Integer Labels (F1)

Aha! We see a huge difference in F1-score here! More importantly, this can be observed in a sizeable 2,000 instances, about 1/5 of the dataset. What's happening here?

We see that Mistral and Falcon always return an integer followed by a newline character, which is not parsed correctly and always counted as wrong. This never happens with the other, higher-performing models.

Slice does not exist anymore.

Integer + Newline Output (size)

String Answers

Finally, let's now take a look at string answers, which also have a huge difference in performance. Here, we could observe a trend where the length of answers varied greatly across models.

Output Length

When looking at the F1 score for short and long answers we see that short answers have a significantly higher score. This is due to how the F1 score is calculated, where the set of words in the model output is compared to the set of words in the gold label. Naturally, if the model produces many words not in the label, it will be scored much lower even though these words might just be part of a potentially helpful explanation.

Non-numeric vs. Length (F1)

Conclusion

There is a whole category of questions for which all models fail, any questions with floating-point answers. This is because model generation is stopped after a . token, which cuts off any floating-point number.

For questions with integer answers, the underperforming models are always given a 0 F1 score because they add newlines immediately after the output. Our friends at HuggingFace show how this is due to the tokenization process of answers.

Lastly, for open-ended text questions, the underperforming models have much longer outputs with a passage of justification. This causes the F1 score to be significantly lower even if the right answer is provided.

Thus, the large disparities in performance are is due to two distinct effects:

incorrect parsing of outputs
artificially deflated f1 scores for answers with justification.

Enjoyed this report? You can make one too!

Error analysis, chart authoring, shareable reports, and more with Zeno.

DROP Benchmark Exploration