AMA-Bench

Benchmark for agent and model memory/reasoning across text-to-SQL, software, web, game, embodied AI, and open-world QA domains, with recall, causal inference, state updating, and state abstraction capabilities.

22rows
ama_scoreprimary metric
2026-05-06sampled

Metadata

Metrics

AMA Score, TEXT2SQL Score, SOFTWARE Score, WEB Score, GAME Score, EMBODIED_AI Score, OPENWORLD_QA Score, Recall, Causal Inference, State Updating, State Abstraction

Latest Results

Rank Subject AMA Score Model Match Provenance Sampled
1 gpt 5.2 0.71 GPT-5.2
openai-gpt-5.2
Verified 2026-05-06
2 GPT-5 mini 0.67 GPT-5 Mini
openai-gpt-5-mini
Verified 2026-05-06
3 AMA-agent 0.57 Verified 2026-05-06
4 Gemini 2.5 flash 0.51 Gemini 2.5 Flash
google-gemini-2.5-flash
Verified 2026-05-06
5 Long context 0.51 Verified 2026-05-06
6 Qwen3-32B 0.51 Qwen3 32B
qwen-qwen3-32b
Verified 2026-05-06
7 Qwen3-14B 0.46 Qwen3 14B
qwen-qwen3-14b
Verified 2026-05-06
8 Qwen2.5-14B-Instruct-1M 0.46 Verified 2026-05-06
9 Memorag 0.46 Verified 2026-05-06
10 Hipporag2 0.44 Verified 2026-05-06
11 Claude Haiku 3.5 0.43 Verified 2026-05-06
12 Qwen3-Embedding-4B 0.42 Verified 2026-05-06
13 Qwen3-8B 0.41 Qwen3 8B
qwen-qwen3-8b
Verified 2026-05-06
14 Memorybank 0.34 Verified 2026-05-06
15 Memgpt 0.33 Verified 2026-05-06
16 GRAPHRAG 0.33 Verified 2026-05-06
17 Amem 0.32 Verified 2026-05-06
18 Mem-alpha 0.31 Verified 2026-05-06
19 Memagent 0.27 Verified 2026-05-06
20 Mem0 0.21 Verified 2026-05-06
21 Simple mem 0.19 Verified 2026-05-06
22 Mem1 0.12 Verified 2026-05-06