GISA

General information-seeking assistant benchmark for structured item, set, list, and table answers from search-capable systems.

16rows
overallprimary metric
2026-05-27sampled

Metadata

Metrics

Overall, Item Exact Match, Set Exact Match, Set F1, List Exact Match, List F1, List Order, Table Exact Match, Table Row F1, Table Item F1

Latest Results

Rows parsed from the GISA public Hugging Face Space seed JSON. GISA evaluates information-seeking assistants on structured answers over item, set, list, and table outputs.

Rank Subject Overall Model Match Provenance Sampled
1 Claude 4.5 Sonnet (thinking) 19.3 Imported 2026-05-27
2 Qwen3-Max (thinking) 17.96 Imported 2026-05-27
3 Claude 4.5 Sonnet (non-thinking) 16.36 Imported 2026-05-27
4 GPT-5.2 (thinking) 15.82 Imported 2026-05-27
5 Kimi K2.5 (thinking) 15.55 Imported 2026-05-27
6 Gemini 3 Pro (high) 15.28 Imported 2026-05-27
7 Gemini 3 Pro (low) 14.74 Imported 2026-05-27
8 DeepSeek-V3.2 (thinking) 14.47 Imported 2026-05-27
9 GLM-4.7 (thinking) 14.21 Imported 2026-05-27
10 Seed-1.8 (thinking) 13.4 Imported 2026-05-27
11 DeepSeek-V3.2 (non-thinking) 11.53 Imported 2026-05-27
12 Qwen3-235B-A22B (thinking) 9.65 Imported 2026-05-27
13 Google Search AI Mode 9.38 Imported 2026-05-27
14 OpenAI o4 Mini Deep Research 7.78 o4 Mini Deep Research
openai-o4-mini-deep-research
Imported 2026-05-27
15 Perplexity Sonar Pro Search 7.51 Sonar Pro Search
perplexity-sonar-pro-search
Imported 2026-05-27
16 GPT-4o Search Preview 5.63 GPT-4o
openai-gpt-4o
Imported 2026-05-27