Zeno AI Evaluation Platform

What does the OpenLLM Leaderboard measure?

Updated Tuesday, November 7, 2023 at 8:50 PM

Author:

Alex Bäuerle

Linked Projects:

The 🤗 Open LLM Leaderboard is a widely used benchmark for comparing LLMs. Despite its popularity, there have been questions about the validity and usefulness of the benchmark for evaluating real-world performance, with some even saying "the land of open benchmarks is pointless and measures almost nothing useful".

In this report, we used Zeno to dive into the data and explore what the benchmark actually measures. What tasks does it test? What does the data look like?

We find that it is indeed hard to gauge the real-world usability of LLMs from the results of the leaderboard, as the tasks it includes are disconnected from how LLMs are used in practice. Furthermore, we find clear ways the leaderboard can be gamed, such as by exploiting the common structure of ground truth labels. In sum, we hope that this report demonstrates the importance of testing your model in a disaggregated way on on data that is representative of the downstream use-cases you care about.

Tasks

Before looking at the leaderboard's results, let's look at the data that backs the leaderboard. We used Zeno to interactively explore, visualize and analyze benchmark data, summarized in this Zeno Report.

The Open LLM Leaderboard combines scores from four NLP benchmarks:

MMLU, 57 tasks that cover different fields of world knowledge, over 14,000 questions in total
HellaSwag, ~10,000 sentences where the model must select the most likely completion
ARC, 1,172 (test split of challenging questions) grade-school science questions
TruthfulQA, 817 questions that span 38 categories, including health, law, finance and politics

An overall score is calculated by averaging the performance of models across all four datasets.

MMLU

MMLU is a question answering task (explore the data) where each question has four potential answers, one of which is correct. Questions come from 57 categories, including elementary mathematics, US history, computer science, law, and more. MMLU uses the model's log-likelihood of outputting one of four letters (A,B,C,D) as the next token after the question. Let's look at a few example instances:

Slice does not exist anymore.

We can also use Zeno to slice our data into different subgroups of instances and start to make sense of this benchmark. Below, we can see how the instances are distributed among the 57 categories, with some serious disparities in number of questions:

Instances per Task

We can also compare the performance of models across specific categories. As seen below, while the 70b parameter Llama model outperforms the 7b Mistral model on most tasks in the MMLU benchmark, there are tasks where you might get a similar performance using the smaller model.

Tasks Mistral vs. llama

But are these questions actually measuring LLM capabilities in complex school topics?. While this task measure log-likelihood of a single output letter, it's not often that people using LLMs in downstream applications care about the log likelihood of the next token. Additionally, recent in-depth studies have found that the way in which a MMLU question is asked can also play a huge role for benchmark results.

HellaSwag

The next task is HellaSwag, a common-sense inference task (explore the data). The model is given the start of a sentence and has to choose between four potential continuations. The questions are designed to be relatively easy to answer for humans but hard to get correct for an LLM. Browse some of the examples in this dataset below:

Slice does not exist anymore.

Evaluation is done based on the likelihood with which one of the four options would be the next token (A, B, C, D). However, likelihoods are normalized, hence the pred_raw and pred_norm in the model output. Predictions are normalized by dividing the model's next token prediction for the available options (A, B, C, D) by the length of the respective continuation. The reasoning behind this normalization is that maximum-likelihood trained language models tend to assign lower overall scores to longer sequences. The normalization applied here was first proposed by Cho et. al..

While a sensible regularization, this can lead to situations where a choice that was not ranked highly but represents a particularly long answer is counted the model's prediction. The following examples illustrate such cases:

Slice does not exist anymore.

ARC

The third benchmark, ARC, is a question-answering task with grade-school science questions (explore the data). Like the HellaSwag task, the next token log-likelihood is used to pick between options with length normalization.

Slice does not exist anymore.

TruthfulQA

The last task, TruthfulQA, differs quite a bit from the other as there is not a single correct answer (explore data). Interestingly, all the true responses come before the false responses. Preconditioning an LLM with this information might help it perform better on TruthfulQA, thus effectively gaming the leaderboard.

Instances per Task

Slice does not exist anymore.

For this benchmark, they use an average likelihood score across the true answers. The score is the total probability assigned to the set of true answers normalized by the total probability assigned to all answers. The calculation is done as follows:

def mc2(question, predictions):
    # Split on the first `0` as everything before it is true (`1`).
    split_idx = list(question["labels"]).index(0)
    ll_true, ll_false = predictions[:split_idx], predictions[split_idx:]
    # Compute the normalized probability mass for the correct answer.
    p_true, p_false = np.exp(np.array(ll_true)), np.exp(np.array(ll_false))
    p_true = p_true / (sum(p_true) + sum(p_false))
    return sum(p_true)

Conclusion

Benchmarks such as the Open LLM Leaderboard are awesome tools for systematic model comparison. However, benchmark results are only as expressive as the data they are based on. We find that the log-likelihood multiple-choice questions most of the benchmark tests differ quite drastically from the real-world tasks people are using LLMs for.

For example, it is rare that we want to apply LLMs in simple multiple-choice question answering scenarios where the model has to output a letter for the correct answer. After all, LLMs are so powerful because of their flexibility and their ability to generalize. Hence, we argue that you might want to evaluate different LLMs on your data before using them in production. Maybe you even want to use Zeno for this!

This post is by no means meant to speak unfavorably of evaluation benchmarks. In fact, we applaud the work that many open-source developers pour into projects such as the Eleuther AI Evaluation Harness and the 🤗 Open LLM Leaderboard. However, we encourage you to go one step further and look into the raw data, ideally in the context of your use-case.

You can do that for the tasks we talk about in this report on their respective Zeno projects:

Enjoyed this report? You can make one too!

Error analysis, chart authoring, shareable reports, and more with Zeno.