UAVBench

Physically grounded benchmark for autonomous and agentic AI UAV systems, with 50,000 validated flight scenarios and 50,000 multiple-choice UAV reasoning questions spanning navigation, safety, policy, cyber-physical security, ethics, energy, and hybrid reasoning.

39rows
accuracyprimary metric
2026-05-06sampled

Metadata

Metrics

Accuracy, Correct Answers, Evaluated Questions

Latest Results

Rows are aggregated from public per-model UAVBench_MCQ result CSV files in the GitHub repository. Accuracy is computed from the is_correct column; reasoning-style breakdowns are retained in metadata.

Rank Subject Accuracy Model Match Provenance Sampled
1 qwen/qwen3-235b-a22b-2507 83.55 Qwen3 235B A22B Instruct 2507
qwen-qwen3-235b-a22b-2507
Imported 2026-05-06
2 openai/chatgpt-4o-latest 80.35 Imported 2026-05-06
3 openai/gpt-5-chat 80.15 GPT-5 Chat
openai-gpt-5-chat
Imported 2026-05-06
4 qwen/qwen3-max 79.85 Qwen3 Max
qwen-qwen3-max
Imported 2026-05-06
5 openai/gpt-4.1 79.05 GPT-4.1
openai-gpt-4.1
Imported 2026-05-06
6 openai/gpt-4.1-mini 78.10 GPT-4.1 Mini
openai-gpt-4.1-mini
Imported 2026-05-06
7 moonshotai/kimi-k2-0905 77.75 KIMI MoonshotAI: Kimi K2 0905
moonshotai-kimi-k2-0905
Imported 2026-05-06
8 opengvlab/internvl3-78b 77.10 Imported 2026-05-06
9 anthropic/claude-haiku-4.5 77.05 Claude Haiku 4.5
anthropic-claude-haiku-4.5
Imported 2026-05-06
10 mistralai/mistral-medium-3.1 76.85 Mistral: Mistral Medium 3.1
mistralai-mistral-medium-3.1
Imported 2026-05-06
11 google/gemini-2.5-flash 76.75 Gemini 2.5 Flash
google-gemini-2.5-flash
Imported 2026-05-06
12 microsoft/phi-4-reasoning-plus 76.75 Imported 2026-05-06
13 qwen/qwen3-vl-8b-instruct 75.95 Qwen3 VL 8B Instruct
qwen-qwen3-vl-8b-instruct
Imported 2026-05-06
14 deepseek/deepseek-chat-v3-0324 75.90 DeepSeek V3 0324
deepseek-deepseek-chat-v3-0324
Imported 2026-05-06
15 baidu/ernie-4.5-300b-a47b 75.45 ERNIE 4.5 300B A47B
baidu-ernie-4.5-300b-a47b
Imported 2026-05-06
16 meta-llama/llama-4-scout 75.10 Llama 4 Scout
meta-llama-llama-4-scout
Imported 2026-05-06
17 deepseek/deepseek-v3.2-exp 73.55 DeepSeek V3.2 Exp
deepseek-deepseek-v3.2-exp
Imported 2026-05-06
18 google/gemma-3n-e4b-it 73.25 Gemma 3n 4B
google-gemma-3n-e4b-it
Imported 2026-05-06
19 deepseek/deepseek-v3.1-terminus 72.70 DeepSeek V3.1 Terminus
deepseek-deepseek-v3.1-terminus
Imported 2026-05-06
20 x-ai/grok-4-fast 72.60 GROK Grok 4 Fast
x-ai-grok-4-fast
Imported 2026-05-06
21 liquid/lfm-2.2-6b 69.75 Imported 2026-05-06
22 qwen/qwen-2.5-7b-instruct 66.05 Qwen2.5 7B Instruct
qwen-qwen-2.5-7b-instruct
Imported 2026-05-06
23 liquid/lfm2-8b-a1b 65.80 Imported 2026-05-06
24 allenai/olmo-2-0325-32b-instruct 65.55 Imported 2026-05-06
25 meta-llama/llama-3.1-8b-instruct 65.30 Llama 3.1 8B Instruct
meta-llama-llama-3.1-8b-instruct
Imported 2026-05-06
26 meta-llama/llama-3.2-3b-instruct 62 Llama 3.2 3B Instruct
meta-llama-llama-3.2-3b-instruct
Imported 2026-05-06
27 ai21/jamba-mini-1.7 59.30 Imported 2026-05-06
28 anthropic/claude-sonnet-4.5 58.40 Claude Sonnet 4.5
anthropic-claude-sonnet-4.5
Imported 2026-05-06
29 ibm-granite/granite-4.0-h-micro 57.80 Granite 4.0 Micro
ibm-granite-granite-4.0-h-micro
Imported 2026-05-06
30 z-ai/glm-4.6 41.70 GLM GLM 4.6
z-ai-glm-4.6
Imported 2026-05-06
31 qwen/qwen3-30b-a3b 5.55 Qwen3 30B A3B
qwen-qwen3-30b-a3b
Imported 2026-05-06
32 nvidia/nemotron-nano-9b-v2 2.40 Nemotron Nano 9B V2
nvidia-nemotron-nano-9b-v2
Imported 2026-05-06
33 minimax/minimax-m1 1.75 MiniMax M1
minimax-minimax-m1
Imported 2026-05-06
34 baidu/ernie-4.5-21b-a3b-thinking 0 ERNIE 4.5 21B A3B Thinking
baidu-ernie-4.5-21b-a3b-thinking
Imported 2026-05-06
35 deepseek/deepseek-r1-0528-qwen3-8b 0 Imported 2026-05-06
36 minimax/minimax-m2 0 MiniMax M2
minimax-minimax-m2
Imported 2026-05-06
37 minimax/minimax-m2:free 0 Imported 2026-05-06
38 nvidia/llama-3.3-nemotron-super-49b-v1.5 0 Llama 3.3 Nemotron Super 49B V1.5
nvidia-llama-3.3-nemotron-super-49b-v1.5
Imported 2026-05-06
39 openai/gpt-oss-safeguard-20b 0 gpt-oss-safeguard-20b
openai-gpt-oss-safeguard-20b
Imported 2026-05-06