LAB-Bench

Biology research-assistant benchmark spanning literature QA, database QA, protocol understanding, figure interpretation, and related lab-research tasks.

2rows
open_response_mean_accuracyprimary metric
2026-05-27sampled

Metadata

Metrics

Open-response mean accuracy, CloningScenarios accuracy, ProtocolQA accuracy, FigQA accuracy

Latest Results

Rows are transcribed from public LAB-Bench arXiv Supplemental Table 5 open-response results. Primary score is a BenchmarkList-derived unweighted mean across CloningScenarios, ProtocolQA, and FigQA open-response accuracies.

Rank Subject Open-response mean accuracy Model Match Provenance Sampled
1 Claude 3.5 Sonnet 0.266667 Claude 3.5 Sonnet
anthropic-claude-3.5-sonnet
Imported 2026-05-27
2 GPT-4o 0.233333 GPT-4o
openai-gpt-4o
Imported 2026-05-27