AfroBench

Comprehensive benchmark evaluating language models across African languages, tasks, and datasets spanning question answering, NLU, NLG, reasoning, and knowledge.

12rows
average_scoreprimary metric
2026-05-06sampled

Metadata

Metrics

Average score, Dataset coverage (lower is better), Category qa, Category nlu, Category nlg, Category reasoning, Category knowledge, Task xqa, Task rc, Task ner, Task nli, Task intent, Task topic, Task senti, Task hate, Task pos, Task mt en fr xx, Task adr, Task mt xx en fr, Task summ, Task math, Task arc e, Task mmlu

Latest Results

Rows are parsed from the public AfroBench leaderboard JSON. Score is a transparent macro-average over available dataset-level source scores; per-dataset values are preserved in metadata.

Rank Subject Average score Model Match Provenance Sampled
1 GPT-4o (Aug) 59.64 GPT-4o
openai-gpt-4o
Imported 2026-05-06
2 Gemini 1.5 pro 58.49 Imported 2026-05-06
3 Gemma2 27b 47.92 Gemma 2 27B
google-gemma-2-27b-it
Imported 2026-05-06
4 LLaMa3.1 70B 43.52 Imported 2026-05-06
5 Gemma2 9b 43.10 Imported 2026-05-06
6 Aya-101 13B 40.34 Imported 2026-05-06
7 LLaMAX3 8B 30.14 Imported 2026-05-06
8 LLaMa3.1 8B 29.53 Imported 2026-05-06
9 Gemma1.1 7b 29.09 Imported 2026-05-06
10 LLaMa3 8B 28.83 Imported 2026-05-06
11 LLaMa2 7b 22.49 Imported 2026-05-06
12 AfroLLaMa 8B 19.79 Imported 2026-05-06