AMA-Bench
Benchmark for agent and model memory/reasoning across text-to-SQL, software, web, game, embodied AI, and open-world QA domains, with recall, causal inference, state updating, and state abstraction capabilities.
22rows
ama_scoreprimary metric
2026-05-06sampled
Metadata
Metrics
AMA Score, TEXT2SQL Score, SOFTWARE Score, WEB Score, GAME Score, EMBODIED_AI Score, OPENWORLD_QA Score, Recall, Causal Inference, State Updating, State Abstraction
| Rank | Subject | AMA Score | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | gpt 5.2 | 0.71 | GPT-5.2 openai-gpt-5.2 | Verified | 2026-05-06 |
| 2 | GPT-5 mini | 0.67 | GPT-5 Mini openai-gpt-5-mini | Verified | 2026-05-06 |
| 3 | AMA-agent | 0.57 | — | Verified | 2026-05-06 |
| 4 | Gemini 2.5 flash | 0.51 | Gemini 2.5 Flash google-gemini-2.5-flash | Verified | 2026-05-06 |
| 5 | Long context | 0.51 | — | Verified | 2026-05-06 |
| 6 | Qwen3-32B | 0.51 | Qwen3 32B qwen-qwen3-32b | Verified | 2026-05-06 |
| 7 | Qwen3-14B | 0.46 | Qwen3 14B qwen-qwen3-14b | Verified | 2026-05-06 |
| 8 | Qwen2.5-14B-Instruct-1M | 0.46 | — | Verified | 2026-05-06 |
| 9 | Memorag | 0.46 | — | Verified | 2026-05-06 |
| 10 | Hipporag2 | 0.44 | — | Verified | 2026-05-06 |
| 11 | Claude Haiku 3.5 | 0.43 | — | Verified | 2026-05-06 |
| 12 | Qwen3-Embedding-4B | 0.42 | — | Verified | 2026-05-06 |
| 13 | Qwen3-8B | 0.41 | Qwen3 8B qwen-qwen3-8b | Verified | 2026-05-06 |
| 14 | Memorybank | 0.34 | — | Verified | 2026-05-06 |
| 15 | Memgpt | 0.33 | — | Verified | 2026-05-06 |
| 16 | GRAPHRAG | 0.33 | — | Verified | 2026-05-06 |
| 17 | Amem | 0.32 | — | Verified | 2026-05-06 |
| 18 | Mem-alpha | 0.31 | — | Verified | 2026-05-06 |
| 19 | Memagent | 0.27 | — | Verified | 2026-05-06 |
| 20 | Mem0 | 0.21 | — | Verified | 2026-05-06 |
| 21 | Simple mem | 0.19 | — | Verified | 2026-05-06 |
| 22 | Mem1 | 0.12 | — | Verified | 2026-05-06 |
No matching rows.