AutoBench

Dynamic LLM benchmarking platform using multi-model generated agentic environments and collective LLM-as-judge scoring, with quality, cost, latency, and iteration metrics.

32rows
quality_scoreprimary metric
2026-05-06sampled

Metadata

Metrics

Quality Score, Quality Rank (lower is better), Avg Cost (lower is better), Cost Rank (lower is better), Avg Latency (lower is better), Latency Rank (lower is better), P99 Latency (lower is better), P99 Latency Rank (lower is better), Iterations

Latest Results

Rows ranked by highest Quality Score. Related Hugging Face leaderboard: https://huggingface.co/spaces/AutoBench/AutoBench-Leaderboard

Rank Subject Quality Score Model Match Provenance Sampled
1 Claude Opus 4.7 3.30 Claude Opus 4.7
anthropic-claude-opus-4.7
Imported 2026-05-06
2 Claude Opus 4.6 3.24 Claude Opus 4.6
anthropic-claude-opus-4.6
Imported 2026-05-06
3 Gemini 3.1 Pro Preview 3.21 Gemini 3.1 Pro Preview
google-gemini-3.1-pro-preview
Imported 2026-05-06
4 Claude Sonnet 4.6 3.16 Claude Sonnet 4.6
anthropic-claude-sonnet-4.6
Imported 2026-05-06
5 GLM 5.1 3.15 GLM GLM 5.1
z-ai-glm-5.1
Imported 2026-05-06
6 GPT-5.4 (xhigh) 3.13 GPT-5.4
openai-gpt-5.4
Imported 2026-05-06
7 Mimo V2 Pro 3.10 MiMo-V2-Pro
xiaomi-mimo-v2-pro
Imported 2026-05-06
8 Qwen3.6 Plus 3.07 Qwen3.6 Plus
qwen-qwen3.6-plus
Imported 2026-05-06
9 Kimi K2.5 3.02 KIMI MoonshotAI: Kimi K2.5
moonshotai-kimi-k2.5
Imported 2026-05-06
10 MiniMax M2.7 3.01 MiniMax M2.7
minimax-minimax-m2.7
Imported 2026-05-06
11 Grok 4.20 3 GROK Grok 4.20
x-ai-grok-4.20
Imported 2026-05-06
12 Claude haiku 4.5 2.99 Claude Haiku 4.5
anthropic-claude-haiku-4.5
Imported 2026-05-06
13 Gemini 3 Flash Preview 2.98 Gemini 3 Flash Preview
google-gemini-3-flash-preview
Imported 2026-05-06
14 GLM 4.7 2.92 GLM GLM 4.7
z-ai-glm-4.7
Imported 2026-05-06
15 GPT-5.4 Mini (xhigh) 2.91 GPT-5.4 Mini
openai-gpt-5.4-mini
Imported 2026-05-06
16 Grok 4.1 fast 2.84 GROK Grok 4.1 Fast
x-ai-grok-4.1-fast
Imported 2026-05-06
17 Qwen3.5 122B A10B 2.84 Qwen3.5-122B-A10B
qwen-qwen3.5-122b-a10b
Imported 2026-05-06
18 Qwen3.5 35B A3B 2.82 Qwen3.5-35B-A3B
qwen-qwen3.5-35b-a3b
Imported 2026-05-06
19 Gemini 3.1 Flash Lite Preview 2.82 Gemini 3.1 Flash Lite Preview
google-gemini-3.1-flash-lite-preview
Imported 2026-05-06
20 Nemotron 3 Super 120B A12B 2.80 Nemotron 3 Super
nvidia-nemotron-3-super-120b-a12b
Imported 2026-05-06
21 Gemma 4 31B IT 2.79 Gemma 4 31B
google-gemma-4-31b-it
Imported 2026-05-06
22 MiniMax M2.5 2.79 MiniMax M2.5
minimax-minimax-m2.5
Imported 2026-05-06
23 GPT-5.4 Nano (xhigh) 2.78 GPT-5.4 Nano
openai-gpt-5.4-nano
Imported 2026-05-06
24 Gpt oss 120b 2.76 gpt-oss-120b
openai-gpt-oss-120b
Imported 2026-05-06
25 Nemotron 3 Nano 30B A3B 2.71 Nemotron 3 Nano 30B A3B
nvidia-nemotron-3-nano-30b-a3b
Imported 2026-05-06
26 Mistral Small 4 2.69 Mistral: Mistral Small 4
mistralai-mistral-small-2603
Imported 2026-05-06
27 Nova 2 lite v1 2.66 Nova 2 Lite
amazon-nova-2-lite-v1
Imported 2026-05-06
28 Gpt oss 20b 2.65 gpt-oss-20b
openai-gpt-oss-20b
Imported 2026-05-06
29 Deepseek v3.2 2.64 DeepSeek V3.2
deepseek-deepseek-v3.2
Imported 2026-05-06
30 Mistral large 2512 2.62 Mistral: Mistral Large 3 2512
mistralai-mistral-large-2512
Imported 2026-05-06
31 Gemma 4 26B A4B IT 2.61 Gemma 4 26B A4B
google-gemma-4-26b-a4b-it
Imported 2026-05-06
32 Llama 4 Maverick 2.27 Llama 4 Maverick
meta-llama-4-maverick
Imported 2026-05-06