CAIA Benchmark
Crypto AI Agent benchmark for LLM-based agents solving web3 and crypto tasks across answer quality, reasoning, and tool-use dimensions.
7rows
total_scoreprimary metric
2026-05-06sampled
Metadata
Metrics
Average Score, Answer Score, Reasoning Score, Tool-use Score
| Rank | Subject | Average Score | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | Surf-06-21 | 65.98 | — | Imported | 2026-05-06 |
| 2 | openai-o3-2025-04-16 | 50.03 | — | Imported | 2026-05-06 |
| 3 | demo_gpt_o4_mini | 34.77 | — | Imported | 2026-05-06 |
| 4 | claude-4-sonnet-nonthinking | 30.06 | — | Imported | 2026-05-06 |
| 5 | demo_gpt_4.1 | 26.40 | — | Imported | 2026-05-06 |
| 6 | Deepseek-R1 | 24.69 | — | Imported | 2026-05-06 |
| 7 | demo_deepseek-r1-250120 | 21.55 | — | Imported | 2026-05-06 |
No matching rows.