CAIA Benchmark

Crypto AI Agent benchmark for LLM-based agents solving web3 and crypto tasks across answer quality, reasoning, and tool-use dimensions.

7rows
total_scoreprimary metric
2026-05-06sampled

Metadata

Metrics

Average Score, Answer Score, Reasoning Score, Tool-use Score

Latest Results

Rank Subject Average Score Model Match Provenance Sampled
1 Surf-06-21 65.98 Imported 2026-05-06
2 openai-o3-2025-04-16 50.03 Imported 2026-05-06
3 demo_gpt_o4_mini 34.77 Imported 2026-05-06
4 claude-4-sonnet-nonthinking 30.06 Imported 2026-05-06
5 demo_gpt_4.1 26.40 Imported 2026-05-06
6 Deepseek-R1 24.69 Imported 2026-05-06
7 demo_deepseek-r1-250120 21.55 Imported 2026-05-06