Medmarks

Living medical-model benchmark spanning verifiable and open-ended clinical evaluation families with model/system comparisons.

83rows
win_rateprimary metric
2026-05-27sampled

Metadata

Metrics

Win Rate, Benchmark Mean

Latest Results

Rows parsed from Medmarks public CSV and model metadata JSON. Each source row is kept, with verifiable and open-ended subsets distinguished in the row display name and metadata.

Rank Subject Win Rate Model Match Provenance Sampled
1 GPT-5.2 (medium) (open-ended) 0.6389159522138381 GPT-5.2
openai-gpt-5.2
Imported 2026-05-27
2 GPT-5.1 (medium) (open-ended) 0.6243980841829406 GPT-5.1
openai-gpt-5.1
Imported 2026-05-27
3 Baichuan M3 235B (open-ended) 0.5677789621546633 Imported 2026-05-27
4 gpt-oss 120b (high) (open-ended) 0.5507240209717496 gpt-oss-120b
openai-gpt-oss-120b
Imported 2026-05-27
5 Qwen3 235B-A22B Thinking (open-ended) 0.5160613633727644 Qwen3 235B A22B Thinking 2507
qwen-qwen3-235b-a22b-thinking-2507
Imported 2026-05-27
6 Claude Sonnet 4.5 (open-ended) 0.49977366098330706 Claude Sonnet 4.5
anthropic-claude-sonnet-4.5
Imported 2026-05-27
7 Baichuan M2 32B (open-ended) 0.4761033570298975 Imported 2026-05-27
8 Gemini 3 Pro Preview (open-ended) 0.4712656820900838 Gemini 3
google-gemini-3
Imported 2026-05-27
9 GLM 4.7 FP8 (open-ended) 0.4518797039744835 GLM GLM 4.7
z-ai-glm-4.7
Imported 2026-05-27
10 gpt-oss 20b (high) (open-ended) 0.4266482358575701 gpt-oss-20b
openai-gpt-oss-20b
Imported 2026-05-27
11 Qwen3 30B-A3B Thinking (open-ended) 0.41304451718604834 Imported 2026-05-27
12 Qwen3 8B (Thinking) (open-ended) 0.3634064599826541 Qwen3 8B
qwen-qwen3-8b
Imported 2026-05-27
1 Gemini 3 Pro Preview (verifiable) 0.6627770031943667 Gemini 3
google-gemini-3
Imported 2026-05-27
2 GPT-5.1 (medium) (verifiable) 0.6395161191059014 GPT-5.1
openai-gpt-5.1
Imported 2026-05-27
3 Grok 4 (verifiable) 0.6342733786197539 GROK Grok 4
x-ai-grok-4
Imported 2026-05-27
4 Claude Sonnet 4.5 (verifiable) 0.6257561642171057 Claude Sonnet 4.5
anthropic-claude-sonnet-4.5
Imported 2026-05-27
5 GPT-5.2 (medium) (verifiable) 0.6236362195525137 GPT-5.2
openai-gpt-5.2
Imported 2026-05-27
6 GLM 4.7 FP8 (verifiable) 0.6199015428298814 GLM GLM 4.7
z-ai-glm-4.7
Imported 2026-05-27
7 Qwen3 235B-A22B Thinking (verifiable) 0.6031541433378664 Qwen3 235B A22B Thinking 2507
qwen-qwen3-235b-a22b-thinking-2507
Imported 2026-05-27
8 Baichuan M3 235B (verifiable) 0.5982526399869947 Imported 2026-05-27
9 Qwen3 Next 80B-A3B Thinking (verifiable) 0.5887883670153007 Qwen3 Next 80B A3B Thinking
qwen-qwen3-next-80b-a3b-thinking
Imported 2026-05-27
10 MiniMax M2.1 (verifiable) 0.5882080602904762 MiniMax M2.1
minimax-minimax-m2.1
Imported 2026-05-27
11 gpt-oss 120b (high) (verifiable) 0.5864776402646992 gpt-oss-120b
openai-gpt-oss-120b
Imported 2026-05-27
12 gpt-oss 120b (medium) (verifiable) 0.5771191625196621 gpt-oss-120b
openai-gpt-oss-120b
Imported 2026-05-27
13 MiniMax M2 (verifiable) 0.5707789479326955 MiniMax M2
minimax-minimax-m2
Imported 2026-05-27
14 Qwen3 Next 80B-A3B Instruct (verifiable) 0.5687149198848898 Qwen3 Next 80B A3B Instruct
qwen-qwen3-next-80b-a3b-instruct
Imported 2026-05-27
15 Intellect 3 (verifiable) 0.565976071102146 PI INTELLECT-3
prime-intellect-intellect-3
Imported 2026-05-27
16 Qwen3 30B-A3B Thinking FP8 (verifiable) 0.560069094508121 Imported 2026-05-27
17 Qwen3 30B-A3B Thinking 8-bit (verifiable) 0.5591250708061003 Imported 2026-05-27
18 Qwen3 30B-A3B Thinking (verifiable) 0.5587374669826407 Imported 2026-05-27
19 gpt-oss 120b (low) (verifiable) 0.552403723762488 gpt-oss-120b
openai-gpt-oss-120b
Imported 2026-05-27
20 AntAngelMed 100B (verifiable) 0.5523687077377001 Imported 2026-05-27
21 Baichuan M2 32B (verifiable) 0.5520088670998679 Imported 2026-05-27
22 Qwen3 VL 30B-A3B Thinking (verifiable) 0.5508662829903576 Qwen3 VL 30B A3B Thinking
qwen-qwen3-vl-30b-a3b-thinking
Imported 2026-05-27
23 Qwen3 30B-A3B Thinking 4-bit (verifiable) 0.5481082240479177 Imported 2026-05-27
24 GLM 4.5 Air (verifiable) 0.5410241342400426 GLM GLM 4.5 Air
z-ai-glm-4.5-air
Imported 2026-05-27
25 Qwen3 14B (Thinking) (verifiable) 0.5366833465791869 Qwen3 14B
qwen-qwen3-14b
Imported 2026-05-27
26 Llama 3.3 70B Instruct (verifiable) 0.5363568636656147 Llama 3.3 70B Instruct
meta-llama-llama-3.3-70b-instruct
Imported 2026-05-27
27 gpt-oss 20b (high) (verifiable) 0.5361352530208202 gpt-oss-20b
openai-gpt-oss-20b
Imported 2026-05-27
28 Qwen3 30B-A3B Instruct 8-bit (verifiable) 0.5320939419817264 Imported 2026-05-27
29 Qwen3 30B-A3B Instruct (verifiable) 0.5316817490568033 Imported 2026-05-27
30 Qwen3 30B-A3B Instruct FP8 (verifiable) 0.5310679149932498 Imported 2026-05-27
31 Qwen3 30B-A3B Instruct 4-bit (verifiable) 0.5253449626609891 Imported 2026-05-27
32 gpt-oss 20b (medium) (verifiable) 0.5197952213409743 gpt-oss-20b
openai-gpt-oss-20b
Imported 2026-05-27
33 Ling Flash 2.0 (verifiable) 0.517443364568796 Imported 2026-05-27
34 Nemotron Nano V3 30B-A3B (verifiable) 0.511383018613083 Imported 2026-05-27
35 Olmo 3.1 32B Think (verifiable) 0.5067271118560839 Imported 2026-05-27
36 MedGemma 27B (verifiable) 0.5024248457579267 Imported 2026-05-27
37 Olmo 3 32B Think (verifiable) 0.5021350069895713 OLMO Olmo 3 32B Think
allenai-olmo-3-32b-think
Imported 2026-05-27
38 Qwen3 8B (Thinking) (verifiable) 0.5019479387680824 Qwen3 8B
qwen-qwen3-8b
Imported 2026-05-27
39 Nemotron Nano 12B V2 (verifiable) 0.500617429141079 Imported 2026-05-27
40 Qwen2.5 32B Instruct (verifiable) 0.5003027449594964 Imported 2026-05-27
41 Qwen3 4B Thinking (verifiable) 0.4890572132245168 Imported 2026-05-27
42 Trinity Mini (verifiable) 0.48530192037153297 A Trinity Mini
arcee-ai-trinity-mini
Imported 2026-05-27
43 Hermes 4 70B (verifiable) 0.4849971812302666 H Hermes 4 70B
nousresearch-hermes-4-70b
Imported 2026-05-27
44 gpt-oss 20b (low) (verifiable) 0.4820454322540748 gpt-oss-20b
openai-gpt-oss-20b
Imported 2026-05-27
45 Phi 4 Reasoning (verifiable) 0.4814858250317761 Imported 2026-05-27
46 Ministral 3 14B Reasoning (verifiable) 0.4807102626851982 Imported 2026-05-27
47 Hermes 4 14B (verifiable) 0.4792513741037972 Imported 2026-05-27
48 Mirothinker 1.5 30B (verifiable) 0.4767840336895432 Imported 2026-05-27
49 Ministral 3 14B Instruct (verifiable) 0.47596322969546634 Imported 2026-05-27
50 Magistral Small (verifiable) 0.4745038458490505 Imported 2026-05-27
51 Gemma 3 27B (verifiable) 0.4610358850116875 Gemma 3 27B
google-gemma-3-27b-it
Imported 2026-05-27
52 Olmo 3.1 32B Instruct (verifiable) 0.45600284965867655 OLMO Olmo 3.1 32B Instruct
allenai-olmo-3.1-32b-instruct
Imported 2026-05-27
53 DASD 4B Thinking (verifiable) 0.45253644883118105 Imported 2026-05-27
54 Jamba2 Mini 52B (verifiable) 0.44576937105437486 Imported 2026-05-27
55 Granite 4.0H Small (verifiable) 0.44519142551470775 Imported 2026-05-27
56 Ministral 3 8B Instruct (verifiable) 0.4441678595421642 Imported 2026-05-27
57 Gemma 3 12B (verifiable) 0.4378666803080891 Gemma 3 12B
google-gemma-3-12b-it
Imported 2026-05-27
58 Ministral 3 8B Reasoning (verifiable) 0.431176364193834 Imported 2026-05-27
59 Olmo 3 7B Think (verifiable) 0.41298909104304243 Imported 2026-05-27
60 DASD 30B-A3B (verifiable) 0.4078997105339978 Imported 2026-05-27
61 Llama 3.1 8B Instruct (verifiable) 0.39674525595082544 Llama 3.1 8B Instruct
meta-llama-llama-3.1-8b-instruct
Imported 2026-05-27
62 Ministral 3 3B Instruct (verifiable) 0.39273238228563234 Imported 2026-05-27
63 MedGemma 4B (verifiable) 0.37785575712118635 Imported 2026-05-27
64 MedGemma 4B 1.5 (verifiable) 0.3759376783134903 Imported 2026-05-27
65 Ministral 3 3B Reasoning (verifiable) 0.37411793955645295 Imported 2026-05-27
66 Trinity Nano Preview (verifiable) 0.35961788899182445 Imported 2026-05-27
67 SmolLM3 3B (verifiable) 0.3548178824161098 Imported 2026-05-27
68 Granite 4.0H Tiny (verifiable) 0.35220342825432055 Imported 2026-05-27
69 Olmo 3 7B Instruct (verifiable) 0.3503147766281301 Imported 2026-05-27
70 Gemma 3 4B (verifiable) 0.3214340555088692 Gemma 3 4B
google-gemma-3-4b-it
Imported 2026-05-27
71 AFM 4.5B (verifiable) 0.31930428516320947 Imported 2026-05-27