SpatialBench

Spatial transcriptomics agent benchmark with verifiable spatial biology analysis tasks and deterministic graders.

20rows
accuracyprimary metric
2026-05-28sampled

Metadata

Metrics

Accuracy, Cost (lower is better)

Showing 2 latest source slices.

Latest Results

Provider-published system-card benchmark scores parsed from Anthropic's Claude Opus 4.8 capability evaluation tables. Rows are marked self-reported and should be interpreted as source claims unless independently reproduced.

Rank Subject Accuracy Model Match Provenance Sampled
1 Claude Mythos Preview 53.8% Claude Mythos Preview
anthropic-claude-mythos-preview
Self-reported 2026-05-28
2 Claude Opus 4.8 53.3% Claude Opus 4.8
anthropic-claude-opus-4.8
Self-reported 2026-05-28
3 Claude Opus 4.7 51.4% Claude Opus 4.7
anthropic-claude-opus-4.7
Self-reported 2026-05-28
4 Claude Sonnet 4.6 48.7% Claude Sonnet 4.6
anthropic-claude-sonnet-4.6
Self-reported 2026-05-28
1 gpt-5.5 via mini-swe-agent 57.65% GPT-5.5
openai-gpt-5.5
Imported 2026-05-27
2 gpt-5.4 via mini-swe-agent 57.44% GPT-5.4
openai-gpt-5.4
Imported 2026-05-27
3 gpt-5.5 via openai-codex 53.67% GPT-5.5
openai-gpt-5.5
Imported 2026-05-27
4 claude-opus-4-6 via mini-swe-agent 52.83% Claude Opus 4.6
anthropic-claude-opus-4.6
Imported 2026-05-27
5 claude-opus-4-7 via mini-swe-agent 52.41% Claude Opus 4.7
anthropic-claude-opus-4.7
Imported 2026-05-27
6 gemini-3.1-pro-preview via mini-swe-agent 51.57% Gemini 3.1 Pro Preview
google-gemini-3.1-pro-preview
Imported 2026-05-27
7 claude-opus-4-7 via claude-code 51.36% Claude Opus 4.7
anthropic-claude-opus-4.7
Imported 2026-05-27
8 gpt-5.2 via mini-swe-agent 50.1% GPT-5.2
openai-gpt-5.2
Imported 2026-05-27
9 grok-4.20-beta-0309-reasoning via mini-swe-agent 45.91% GROK Grok 4.20
x-ai-grok-4.20
Imported 2026-05-27
10 claude-sonnet-4-6 via mini-swe-agent 44.23% Claude Sonnet 4.6
anthropic-claude-sonnet-4.6
Imported 2026-05-27
11 claude-opus-4-5 via mini-swe-agent 42.77% Claude Opus 4.5
anthropic-claude-opus-4.5
Imported 2026-05-27
12 claude-sonnet-4-5 via mini-swe-agent 41.51% Claude Sonnet 4.5
anthropic-claude-sonnet-4.5
Imported 2026-05-27
13 gpt-5.1 via mini-swe-agent 39.83% GPT-5.1
openai-gpt-5.1
Imported 2026-05-27
14 grok-4-1-fast-reasoning via mini-swe-agent 33.96% GROK Grok 4.1 Fast
x-ai-grok-4.1-fast
Imported 2026-05-27
15 grok-4 via mini-swe-agent 31.87% GROK Grok 4
x-ai-grok-4
Imported 2026-05-27
16 gemini-2.5-pro via mini-swe-agent 28.93% Gemini 2.5 Pro
google-gemini-2.5-pro
Imported 2026-05-27