GeoRC | BenchmarkList

Metadata

F1, Precision, Recall, Country Accuracy

Rank	Subject	F1	Model Match	Provenance	Sampled
1	Human Expert #3 (Best Expert)	97.33	—	Imported	2026-05-27
2	Human Expert #1	96.67	—	Imported	2026-05-27
3	Human Expert #2	90	—	Imported	2026-05-27
4	Human Expert Average	53.92	—	Imported	2026-05-27
5	GPT-4.1	42.3	GPT-4.1 openai-gpt-4.1	Imported	2026-05-27
6	Gemini-2.5-Pro	41.51	Gemini 2.5 Pro google-gemini-2.5-pro	Imported	2026-05-27
7	Gemini-2.5-Flash	41.3	Gemini 2.5 Flash google-gemini-2.5-flash	Imported	2026-05-27
8	Gemini-3-Pro	40.98	Gemini 3 google-gemini-3	Imported	2026-05-27
9	GPT-5	40.56	GPT-5 openai-gpt-5	Imported	2026-05-27
10	Qwen2.5-VL-7B-Instruct	31.63	—	Imported	2026-05-27
11	Gemma-3-12b-it	31.21	Gemma 3 12B google-gemma-3-12b-it	Imported	2026-05-27
12	Llama-3.2-11B-Vision-Instruct	25.86	Llama 3.2 11B Vision Instruct meta-llama-llama-3.2-11b-vision-instruct	Imported	2026-05-27
13	Qwen3-VL-8B-Instruct	23.81	Qwen3 VL 8B Instruct qwen-qwen3-vl-8b-instruct	Imported	2026-05-27
14	Hallucinated	18.48	—	Imported	2026-05-27
15	Random Hallucinated	2.42	—	Imported	2026-05-27