Daniel Gross recently published a dataset of tax questions for evaluating LLMs and released the results for four models: GPT 3.5, GPT 4, Claude 2, and Mistral 7B. The dataset is a fantastic testbed for evaluating various LLM skills, such as reading comprehension and math. Let's take a look at how the models perform.
Summary
- All LLMs struggle significantly more with math-based accounting and tax questions than text answers.
- The GPT models are much better at instruction following.
- There is significant positional bias in answers (especially for Claude 2) that should be explored further.
In conclusion, models that can call tools for executing math and retrieving updated legal text may significantly outperform raw LLM baselines.
