Conclusion
There is a whole category of questions for which all models fail, any questions with floating-point answers. This is because model generation is stopped after a .
token, which cuts off any floating-point number.
For questions with integer answers, the underperforming models are always given a 0 F1 score because they add newlines immediately after the output. Our friends at HuggingFace show how this is due to the tokenization process of answers.
Lastly, for open-ended text questions, the underperforming models have much longer outputs with a passage of justification. This causes the F1 score to be significantly lower even if the right answer is provided.
Thus, the large disparities in performance are is due to two distinct effects:
- incorrect parsing of outputs
- artificially deflated f1 scores for answers with justification.