MLX Benchmark V2

Benchmark for evaluating LLM proficiency with Apple's MLX machine learning framework across 520 questions, 11 categories, 6 question types, and 4 difficulty levels.

18rows
accuracyprimary metric
2026-05-06sampled

Metadata

Metrics

Accuracy, Correct, Total, qa Accuracy, fill_blank Accuracy, mcq Accuracy, true_false Accuracy, coding Accuracy, debug Accuracy, easy Accuracy, medium Accuracy, hard Accuracy, very-hard Accuracy, mlx_core Accuracy, mlx_nn Accuracy, mlx_optimizers Accuracy, mlx_lm_lora Accuracy, mlx_embeddings_lora Accuracy, mlx_lm Accuracy, mlx_vlm Accuracy, mlx_embeddings Accuracy, coding Accuracy, debugging Accuracy, conceptual Accuracy

Latest Results

Rank Subject Accuracy Model Match Provenance Sampled
1 openrouter/anthropic/claude-sonnet-4.6 89.62 Imported 2026-05-06
2 openrouter/google/gemini-3-flash-preview 82.39 Imported 2026-05-06
3 openrouter/qwen/qwen3.6-max-preview 80.13 Imported 2026-05-06
4 openrouter/google/gemma-4-26b-a4b-it 75.19 Imported 2026-05-06
5 openrouter/openai/gpt-5.4-nano 75.19 GPT-5.4 Nano
openai-gpt-5.4-nano
Imported 2026-05-06
6 ollama/gemma4:31b-cloud 74.42 Imported 2026-05-06
7 openrouter/x-ai/grok-4.1-fast 72.69 Imported 2026-05-06
8 ollama/nemotron-3-super:cloud 71.15 Imported 2026-05-06
9 openrouter/google/gemini-2.5-flash-lite-preview-09-2025 67.31 Imported 2026-05-06
10 openrouter/qwen/qwen3.6-35b-a3b 52.50 Imported 2026-05-06
11 openrouter/openai/gpt-5-nano 41.92 GPT-5 Nano
openai-gpt-5-nano
Imported 2026-05-06
12 ollama/deepseek-v4-flash:cloud 24.81 Imported 2026-05-06
13 ollama/glm-5.1:cloud 19.23 Imported 2026-05-06
14 ollama/minimax-m2.7:cloud 10.77 Imported 2026-05-06
15 ollama/kimi-k2.5:cloud 4.81 Imported 2026-05-06
16 ollama/qwen3.5:cloud 4.62 Imported 2026-05-06
17 ollama/ministral-3:14b-cloud 4.42 Imported 2026-05-06
18 ollama/kimi-k2.6:cloud 3.10 Imported 2026-05-06