SpatialBench
Spatial transcriptomics agent benchmark with verifiable spatial biology analysis tasks and deterministic graders.
20rows
accuracyprimary metric
2026-05-28sampled
Metadata
Metrics
Accuracy, Cost (lower is better)
Showing 2 latest source slices.
| Rank | Subject | Accuracy | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | Claude Mythos Preview | 53.8% | Claude Mythos Preview anthropic-claude-mythos-preview | Self-reported | 2026-05-28 |
| 2 | Claude Opus 4.8 | 53.3% | Claude Opus 4.8 anthropic-claude-opus-4.8 | Self-reported | 2026-05-28 |
| 3 | Claude Opus 4.7 | 51.4% | Claude Opus 4.7 anthropic-claude-opus-4.7 | Self-reported | 2026-05-28 |
| 4 | Claude Sonnet 4.6 | 48.7% | Claude Sonnet 4.6 anthropic-claude-sonnet-4.6 | Self-reported | 2026-05-28 |
| 1 | gpt-5.5 via mini-swe-agent | 57.65% | GPT-5.5 openai-gpt-5.5 | Imported | 2026-05-27 |
| 2 | gpt-5.4 via mini-swe-agent | 57.44% | GPT-5.4 openai-gpt-5.4 | Imported | 2026-05-27 |
| 3 | gpt-5.5 via openai-codex | 53.67% | GPT-5.5 openai-gpt-5.5 | Imported | 2026-05-27 |
| 4 | claude-opus-4-6 via mini-swe-agent | 52.83% | Claude Opus 4.6 anthropic-claude-opus-4.6 | Imported | 2026-05-27 |
| 5 | claude-opus-4-7 via mini-swe-agent | 52.41% | Claude Opus 4.7 anthropic-claude-opus-4.7 | Imported | 2026-05-27 |
| 6 | gemini-3.1-pro-preview via mini-swe-agent | 51.57% | Gemini 3.1 Pro Preview google-gemini-3.1-pro-preview | Imported | 2026-05-27 |
| 7 | claude-opus-4-7 via claude-code | 51.36% | Claude Opus 4.7 anthropic-claude-opus-4.7 | Imported | 2026-05-27 |
| 8 | gpt-5.2 via mini-swe-agent | 50.1% | GPT-5.2 openai-gpt-5.2 | Imported | 2026-05-27 |
| 9 | grok-4.20-beta-0309-reasoning via mini-swe-agent | 45.91% | Grok 4.20 x-ai-grok-4.20 | Imported | 2026-05-27 |
| 10 | claude-sonnet-4-6 via mini-swe-agent | 44.23% | Claude Sonnet 4.6 anthropic-claude-sonnet-4.6 | Imported | 2026-05-27 |
| 11 | claude-opus-4-5 via mini-swe-agent | 42.77% | Claude Opus 4.5 anthropic-claude-opus-4.5 | Imported | 2026-05-27 |
| 12 | claude-sonnet-4-5 via mini-swe-agent | 41.51% | Claude Sonnet 4.5 anthropic-claude-sonnet-4.5 | Imported | 2026-05-27 |
| 13 | gpt-5.1 via mini-swe-agent | 39.83% | GPT-5.1 openai-gpt-5.1 | Imported | 2026-05-27 |
| 14 | grok-4-1-fast-reasoning via mini-swe-agent | 33.96% | Grok 4.1 Fast x-ai-grok-4.1-fast | Imported | 2026-05-27 |
| 15 | grok-4 via mini-swe-agent | 31.87% | Grok 4 x-ai-grok-4 | Imported | 2026-05-27 |
| 16 | gemini-2.5-pro via mini-swe-agent | 28.93% | Gemini 2.5 Pro google-gemini-2.5-pro | Imported | 2026-05-27 |
No matching rows.