InferenceBench

Benchmark for autonomous CLI agents optimizing OpenAI-compatible LLM inference servers under a fixed one-H100, two-hour budget, with quality and integrity gates and scenario-specific speedup metrics.

22rows
aggregate_speedupprimary metric
2026-05-20sampled

Metadata

Metrics

Aggregate Speedup, Aggregate SEM (lower is better), Prefill Latency, Prefill SEM (lower is better), Decode Latency, Decode SEM (lower is better), Throughput, Throughput SEM (lower is better), All-in-one, All-in-one SEM (lower is better)

Latest Results

Rows are ranked by aggregate speedup across search systems, agents, default backends, and the PyTorch baseline. Source agent ranks are preserved in metadata. Scores are speedups over the PyTorch baseline on Mistral-7B-Instruct-v0.3 with one H100 80 GB and a two-hour budget per agent or search run.

Rank Subject Aggregate Speedup Model Match Provenance Sampled
1 SMAC3 (search, 2 h vLLM) 11.53x Imported 2026-05-20
2 TPE (search, 2 h vLLM) 11.25x Imported 2026-05-20
3 Random (search, 2 h vLLM) 10.20x Imported 2026-05-20
4 Claude Sonnet 4.6 / Claude Code 8.08x Imported 2026-05-20
5 GLM-5 / OpenCode 6.20x Imported 2026-05-20
6 Gemini 3.1 Pro / OpenCode 6.16x Imported 2026-05-20
7 GPT-5.3 Codex (High) / Codex CLI 5.48x Imported 2026-05-20
8 GPT-5.4 (High) / Codex CLI 5.08x Imported 2026-05-20
9 GPT-5.3 Codex (Medium) / Codex CLI 4.86x Imported 2026-05-20
10 GPT-5.5 (High) / Codex CLI 4.22x Imported 2026-05-20
11 vLLM Default 4.05x Imported 2026-05-20
12 SGLang Default 3.92x Imported 2026-05-20
13 Claude Opus 4.6 / Claude Code 3.89x Imported 2026-05-20
14 GPT-5.2 / Codex CLI 3.82x Imported 2026-05-20
15 GPT-5.1 Codex Max / Codex CLI 3.54x Imported 2026-05-20
16 Claude Opus 4.5 / Claude Code 3.37x Imported 2026-05-20
17 HF TGI Default 3.30x Imported 2026-05-20
18 Claude Sonnet 4.5 / Claude Code 2.96x Imported 2026-05-20
19 Claude Opus 4.7 / Claude Code 2.25x Imported 2026-05-20
20 GPT-5.2 Codex / Codex CLI 1.55x Imported 2026-05-20
21 Claude Haiku 4.5 / Claude Code 1.24x Imported 2026-05-20
22 PyTorch Baseline 1.00x Imported 2026-05-20