Zeno AI Evaluation Platform

Daniel Gross recently published a dataset of tax questions for evaluating LLMs and released the results for four models: GPT 3.5, GPT 4, Claude 2, and Mistral 7B. The dataset is a fantastic testbed for evaluating various LLM skills, such as reading comprehension and math. Let's take a look at how the models perform.

Summary

All LLMs struggle significantly more with math-based accounting and tax questions than text answers.
The GPT models are much better at instruction following.
There is significant positional bias in answers (especially for Claude 2) that should be explored further.

In conclusion, models that can call tools for executing math and retrieving updated legal text may significantly outperform raw LLM baselines.

What's in the Dataset?

The dataset has a diverse array of questions, from accounting to legal. About half of the questions require dollar amount answers after doing a few math operations, while the others are about general legal and accounting knowledge.

Here are representative samples of these two types of questions:

Tag does not exist anymore.

Overall Performance

Overall, this dataset is quite challenging for LLMs, with the best model only scoring about 60%. The results are unsurprising, with GPT-4 outperforming all the other models. Mistral 7B barely does better than chance, with Claude 2 being about on par with GPT 3.5.

Overall Accuracy

Performance by Question Type

How do the models do on numerical vs. textual answered questions? Here, we see the major weakness of LLMs, their limited ability to reason and do math. Interestingly, we see that the main improvement of GPT-4 is in the textual answers, not the numeric answers that require math.

Numeric vs. Textual answers

Instruction Following

An interesting pattern we found was that the GPT models followed the prompt much better than both Mistral and Claude 2. Each prompt was followed with these instructions:

Answer the question by entering the number of the correct option. Response with a single number and nothing else.

Claude 2 actually answered every single question in the dataset with more than one number.

Output Length > 1

Answer Positions

Another interesting pattern was the frequency of outputs. While the dataset is quite balanced across ground truth output position (with slightly less 4 outputs), there was a large variation in the models.

Answer Position Frequency

Output Position Frequency

Claude is the biggest offender, outputting 3 more than half of the time. We can see this more clearly looking at a confusion matrix for Claude:

Confusion Matrix

Bonus: Question Topics

Lastly, I trained a simple topic model on the questions to look for any systematic differences between the models. While I did not find any large systematic patterns, there were some interesting outliers where GPT-4 outperformed (government and inventory questions), and one category where it underperformed (depreciation).

Accuracy by Topic

Enjoyed this report? You can make one too!

Error analysis, chart authoring, shareable reports, and more with Zeno.

Can LLMs do Your Taxes?