scBench
Bioinformatics agent benchmark with verifiable single-cell RNA-seq workflow tasks and deterministic graders.
20rows
accuracyprimary metric
2026-05-28sampled
Metadata
Metrics
Accuracy, Cost (lower is better)
Showing 2 latest source slices.
| Rank | Subject | Accuracy | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | Claude Mythos Preview | 58.2% | Claude Mythos Preview anthropic-claude-mythos-preview | Self-reported | 2026-05-28 |
| 2 | Claude Opus 4.8 | 58.2% | Claude Opus 4.8 anthropic-claude-opus-4.8 | Self-reported | 2026-05-28 |
| 3 | Claude Opus 4.7 | 55.3% | Claude Opus 4.7 anthropic-claude-opus-4.7 | Self-reported | 2026-05-28 |
| 4 | Claude Sonnet 4.6 | 50.4% | Claude Sonnet 4.6 anthropic-claude-sonnet-4.6 | Self-reported | 2026-05-28 |
| 1 | gpt-5.5 via mini-swe-agent | 57.95% | GPT-5.5 openai-gpt-5.5 | Imported | 2026-05-27 |
| 2 | gpt-5.5 via openai-codex | 57.78% | GPT-5.5 openai-gpt-5.5 | Imported | 2026-05-27 |
| 3 | gpt-5.4 via mini-swe-agent | 57.44% | GPT-5.4 openai-gpt-5.4 | Imported | 2026-05-27 |
| 4 | claude-opus-4-7 via mini-swe-agent | 55.21% | Claude Opus 4.7 anthropic-claude-opus-4.7 | Imported | 2026-05-27 |
| 5 | claude-opus-4-7 via claude-code | 54.02% | Claude Opus 4.7 anthropic-claude-opus-4.7 | Imported | 2026-05-27 |
| 6 | gemini-3.1-pro-preview via mini-swe-agent | 53.85% | Gemini 3.1 Pro Preview google-gemini-3.1-pro-preview | Imported | 2026-05-27 |
| 7 | claude-opus-4-6 via mini-swe-agent | 52.65% | Claude Opus 4.6 anthropic-claude-opus-4.6 | Imported | 2026-05-27 |
| 8 | gpt-5.2 via mini-swe-agent | 52.31% | GPT-5.2 openai-gpt-5.2 | Imported | 2026-05-27 |
| 9 | claude-sonnet-4-6 via mini-swe-agent | 50.26% | Claude Sonnet 4.6 anthropic-claude-sonnet-4.6 | Imported | 2026-05-27 |
| 10 | claude-opus-4-5 via mini-swe-agent | 47.18% | Claude Opus 4.5 anthropic-claude-opus-4.5 | Imported | 2026-05-27 |
| 11 | grok-4.20-beta-0309-reasoning via mini-swe-agent | 44.44% | Grok 4.20 x-ai-grok-4.20 | Imported | 2026-05-27 |
| 12 | grok-4.3 via mini-swe-agent | 44.27% | Grok 4.3 x-ai-grok-4.3 | Imported | 2026-05-27 |
| 13 | gpt-5.1 via mini-swe-agent | 38.80% | GPT-5.1 openai-gpt-5.1 | Imported | 2026-05-27 |
| 14 | claude-sonnet-4-5 via mini-swe-agent | 33.16% | Claude Sonnet 4.5 anthropic-claude-sonnet-4.5 | Imported | 2026-05-27 |
| 15 | grok-4-1-fast-reasoning via mini-swe-agent | 30.26% | Grok 4.1 Fast x-ai-grok-4.1-fast | Imported | 2026-05-27 |
| 16 | gemini-2.5-pro via mini-swe-agent | 23.59% | Gemini 2.5 Pro google-gemini-2.5-pro | Imported | 2026-05-27 |
No matching rows.