Welcome to Zeno

Learn about Zeno or sign in or sign up to create and see your projects and reports.

Filter:

Sort:

GPT MT Benchmark

cabreraalex

20.2k

TruthfulQA

a13x

817

TruthfulQA (https://arxiv.org/abs/2109.07958) task in the Open-LLM-Leaderboard.

MMLU

a13x

14k

MMLU (https://arxiv.org/abs/2009.03300) tasks in the Open-LLM-Leaderboard.

HellaSwag

a13x

10k

HellaSwag (https://arxiv.org/abs/1905.07830) task in the Open-LLM-Leaderboard.

What does the OpenLLM Leaderboard measure?

a13x

An investigation of the Open LLM Leaderboard and why you should double-check before using the top-ranked model.

GPT MT Benchmark Report

cabreraalex

Explore how LLMs compare to dedicated language translation models, particularly for low-resourced languages.

Exploring the WebArena Agent Environment

cabreraalex

DiffusionDB

cabreraalex

Explore 2 million images generated by Stable Diffusion. From the DiffusionDB dataset: https://poloclub.github.io/diffusiondb/

Web Arena

cabreraalex

100

ARC

a13x

1.2k

ARC (https://arxiv.org/abs/1803.05457) task in the Open-LLM-Leaderboard.

Audio Transcription Accents

cabreraalex

2.1k

Analysis of OpenAI's Whisper transcription models across speakers of different demographic groups.

Gemini MMLU

a13x

Audio Transcription Report

cabreraalex

Analysis of OpenAI's Whisper models across demographic groups.

Whisper Audio Transcription Comparison

cabreraalex

2.1k

Test of audio transcription

Flores Translation Evaluation

aashiqmuhamed

20.2k

Gemini BBH

a13x

What's in the Updated OpenLLM Leaderboard?

cabreraalex

Exploration of the three new tasks in the HuggingFace OpenLLM Leaderboard.

GSM8k OpenLLM Leaderboard

cabreraalex

1.3k

GSM8k task in the Open-LLM-Leaderboard (https://arxiv.org/abs/2110.14168).

multimodal-reasoning

yueqis

1.2k

Gemini Evaluation - MawpsMultiArith

sakter

600

Evaluation of Gemini, GPT-4, and Mixtral on MawpsMultiArith dataset