AutoBench
Dynamic LLM benchmarking platform using multi-model generated agentic environments and collective LLM-as-judge scoring, with quality, cost, latency, and iteration metrics.
32rows
quality_scoreprimary metric
2026-05-06sampled
Metadata
Metrics
Quality Score, Quality Rank (lower is better), Avg Cost (lower is better), Cost Rank (lower is better), Avg Latency (lower is better), Latency Rank (lower is better), P99 Latency (lower is better), P99 Latency Rank (lower is better), Iterations
| Rank | Subject | Quality Score | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | Claude Opus 4.7 | 3.30 | Claude Opus 4.7 anthropic-claude-opus-4.7 | Imported | 2026-05-06 |
| 2 | Claude Opus 4.6 | 3.24 | Claude Opus 4.6 anthropic-claude-opus-4.6 | Imported | 2026-05-06 |
| 3 | Gemini 3.1 Pro Preview | 3.21 | Gemini 3.1 Pro Preview google-gemini-3.1-pro-preview | Imported | 2026-05-06 |
| 4 | Claude Sonnet 4.6 | 3.16 | Claude Sonnet 4.6 anthropic-claude-sonnet-4.6 | Imported | 2026-05-06 |
| 5 | GLM 5.1 | 3.15 | GLM 5.1 z-ai-glm-5.1 | Imported | 2026-05-06 |
| 6 | GPT-5.4 (xhigh) | 3.13 | GPT-5.4 openai-gpt-5.4 | Imported | 2026-05-06 |
| 7 | Mimo V2 Pro | 3.10 | MiMo-V2-Pro xiaomi-mimo-v2-pro | Imported | 2026-05-06 |
| 8 | Qwen3.6 Plus | 3.07 | Qwen3.6 Plus qwen-qwen3.6-plus | Imported | 2026-05-06 |
| 9 | Kimi K2.5 | 3.02 | MoonshotAI: Kimi K2.5 moonshotai-kimi-k2.5 | Imported | 2026-05-06 |
| 10 | MiniMax M2.7 | 3.01 | MiniMax M2.7 minimax-minimax-m2.7 | Imported | 2026-05-06 |
| 11 | Grok 4.20 | 3 | Grok 4.20 x-ai-grok-4.20 | Imported | 2026-05-06 |
| 12 | Claude haiku 4.5 | 2.99 | Claude Haiku 4.5 anthropic-claude-haiku-4.5 | Imported | 2026-05-06 |
| 13 | Gemini 3 Flash Preview | 2.98 | Gemini 3 Flash Preview google-gemini-3-flash-preview | Imported | 2026-05-06 |
| 14 | GLM 4.7 | 2.92 | GLM 4.7 z-ai-glm-4.7 | Imported | 2026-05-06 |
| 15 | GPT-5.4 Mini (xhigh) | 2.91 | GPT-5.4 Mini openai-gpt-5.4-mini | Imported | 2026-05-06 |
| 16 | Grok 4.1 fast | 2.84 | Grok 4.1 Fast x-ai-grok-4.1-fast | Imported | 2026-05-06 |
| 17 | Qwen3.5 122B A10B | 2.84 | Qwen3.5-122B-A10B qwen-qwen3.5-122b-a10b | Imported | 2026-05-06 |
| 18 | Qwen3.5 35B A3B | 2.82 | Qwen3.5-35B-A3B qwen-qwen3.5-35b-a3b | Imported | 2026-05-06 |
| 19 | Gemini 3.1 Flash Lite Preview | 2.82 | Gemini 3.1 Flash Lite Preview google-gemini-3.1-flash-lite-preview | Imported | 2026-05-06 |
| 20 | Nemotron 3 Super 120B A12B | 2.80 | Nemotron 3 Super nvidia-nemotron-3-super-120b-a12b | Imported | 2026-05-06 |
| 21 | Gemma 4 31B IT | 2.79 | Gemma 4 31B google-gemma-4-31b-it | Imported | 2026-05-06 |
| 22 | MiniMax M2.5 | 2.79 | MiniMax M2.5 minimax-minimax-m2.5 | Imported | 2026-05-06 |
| 23 | GPT-5.4 Nano (xhigh) | 2.78 | GPT-5.4 Nano openai-gpt-5.4-nano | Imported | 2026-05-06 |
| 24 | Gpt oss 120b | 2.76 | gpt-oss-120b openai-gpt-oss-120b | Imported | 2026-05-06 |
| 25 | Nemotron 3 Nano 30B A3B | 2.71 | Nemotron 3 Nano 30B A3B nvidia-nemotron-3-nano-30b-a3b | Imported | 2026-05-06 |
| 26 | Mistral Small 4 | 2.69 | Mistral: Mistral Small 4 mistralai-mistral-small-2603 | Imported | 2026-05-06 |
| 27 | Nova 2 lite v1 | 2.66 | Nova 2 Lite amazon-nova-2-lite-v1 | Imported | 2026-05-06 |
| 28 | Gpt oss 20b | 2.65 | gpt-oss-20b openai-gpt-oss-20b | Imported | 2026-05-06 |
| 29 | Deepseek v3.2 | 2.64 | DeepSeek V3.2 deepseek-deepseek-v3.2 | Imported | 2026-05-06 |
| 30 | Mistral large 2512 | 2.62 | Mistral: Mistral Large 3 2512 mistralai-mistral-large-2512 | Imported | 2026-05-06 |
| 31 | Gemma 4 26B A4B IT | 2.61 | Gemma 4 26B A4B google-gemma-4-26b-a4b-it | Imported | 2026-05-06 |
| 32 | Llama 4 Maverick | 2.27 | Llama 4 Maverick meta-llama-4-maverick | Imported | 2026-05-06 |
No matching rows.