AgentLeak

Full-stack privacy leakage benchmark for multi-agent LLM systems across output, inter-agent, tool, memory, log, and artifact channels.

5rows
total_leakprimary metric
2026-05-06sampled

Metadata

Metrics

Total Leak (lower is better), C1 Output Leak (lower is better), C2 Internal Leak (lower is better), H1 Audit Gap (lower is better)

Latest Results

Rows ranked by lowest Total Leak. Source model display names are preserved.

Rank Subject Total Leak Model Match Provenance Sampled
1 Claude-3.5-Sonnet 55.20 Claude 3.5 Sonnet
anthropic-claude-3.5-sonnet
Imported 2026-05-06
2 GPT-4o-mini 76.30 GPT-4o-mini
openai-gpt-4o-mini
Imported 2026-05-06
3 GPT-4o 77.60 GPT-4o
openai-gpt-4o
Imported 2026-05-06
4 Llama-3.3-70B 89.90 Llama 3.3 70B Instruct
meta-llama-llama-3.3-70b-instruct
Imported 2026-05-06
5 Mistral-Large 99.30 Mistral Large
mistralai-mistral-large
Imported 2026-05-06