Zeno AI Evaluation Platform

We were intrigued by the recent Mamba release. This seems like an exciting approach towards LLMs. Also, their image of a giant snake attacking a tiny transformer is too cool to just scroll over.

Giant snake attacking tiny transformer.

Hence, we wanted to do our own evaluation in Zeno and compare the Mamba model family to some commonly used open source models.

Mamba Evaluation Results

First off, the results that the authors report in the paper seem to be pretty reproducible (yay!). We got very similar numbers in terms of accuracy compared to what the authors reported in their paper.

Overall, the performance of the Mamba model family given the size of the models is impressive. It outperforms models such as the similar-size Pythia-2.8B on all tasks it was evaluated on. However, we were interested how Mamba would compare to even bigger models.

Comparing Mamba to 7B Models

While the authors include the scores of some 7B parameter models, they don't use the newest and hottest ones out there. We know that a 2.8B model is much cheaper and easier to run, but why not compare against some of the current hype?

Specifically, we wanted to see how Mamba compares to Falcon, Vicuna, Llama, and Mistral. The Mamba paper compares the models on six benchmarks, namely Winogrande, Piqa, Lambada, Hellaswag, Arc Easy, and Arc Challenge. Let's dive into the results.

Piqa

The Piqa challenge has Mamba 2.8B, Llama-7B, and Vicuna-7B almost on par. Falcon-7B and Mistral-7B outperform the other models.

Explore the Data in Zeno

Overall Accuracy

Winogrande

In the Winogrande challenge, Mamba-2.8B model is pretty close to Llama-7B and Falcon-7B. However, Mistral-7B outperforms Mamba quite heavily.

Explore the data in Zeno

Overall Accuracy

Lambada

Similar looking graph for Lambada, even if the falloff for the smaller Mamba models is more noticeable. Again, models are close, with Falcon-7B and Mistral-7B pulling ahead.

Explore the data in Zeno

Overall Accuracy

Hellaswag

Hellaswag is where the performance gap between the different model sizes widens. Here we can see that, while the 7B models are pretty close, the smaller Mamba models don't really stand a chance.

Explore the data in Zeno

Overall Accuracy

Arc Easy

Similar to Hellaswag, Arc Easy shows how the smaller Mamba models can't keep up with the bigger 7B models in some tasks.

Explore the data in Zeno

Overall Accuracy

Arc Challenge

This one is interesting to look at. Notice how the scale for this graph is very different. For the Arc Challenge task, which is much harder than Arc Easy, the difference between the models becomes even bigger.

Explore the data in Zeno

Overall Accuracy

Conclusion

Overall, the performance of the Mamba model family is remarkable. It is also refreshing to see an architectural change in a space that has been dominated by Transformer models for a while. While Mamba-2.8B performs almost on par to some of the 7B models. You should definitely try how well Mamba works on your task, and might be able to save some money deploying Mamba instead of one of the larger models.

However, the Mamba model family is not magic and the bigger models generally still outperform Mamba. Especially Mistral, which has an edge over the other 7B models on these benchmarks, performed super strong and consistently ranked #1.

We are excited to see whether the ideas in the Mamba paper will be picked up by other model developers. We'd love to add a larger model with the Mamba architecture to this evaluation and see how scaling laws work for this architecture.

Enjoyed this report? You can make one too!

Error analysis, chart authoring, shareable reports, and more with Zeno.

Mamba vs 7B

Mamba vs 7B

Mamba Evaluation Results

Comparing Mamba to 7B Models

Piqa

Overall Accuracy

Winogrande

Overall Accuracy

Lambada

Overall Accuracy

Hellaswag

Overall Accuracy

Arc Easy

Overall Accuracy

Arc Challenge

Overall Accuracy

Conclusion

Enjoyed this report? You can make one too!