ChatRAG Bench

NVIDIA ChatRAG Bench evaluates conversational question answering over documents or retrieved context across ten derived datasets, including long-context, table reasoning, arithmetic, and unanswerable-question scenarios.

8rows
average_allprimary metric
2026-05-06sampled

Metadata

Metrics

Average (all), Average (exclude HybriDial), Unanswerable Avg-Both

Latest Results

Rows are parsed from the public Hugging Face dataset README. The main conversational QA table provides the primary score; unanswerable-scenario metrics are merged for models that also appear in the main table.

Rank Subject Average (all) Model Match Provenance Sampled
1 ChatQA-1.5-70B 58.25 Imported 2026-05-06
2 ChatQA-1.5-8B 55.17 Imported 2026-05-06
3 ChatQA-1.0-70B 54.14 Imported 2026-05-06
4 GPT-4-Turbo 54.03 GPT-4 Turbo
openai-gpt-4-turbo
Imported 2026-05-06
5 GPT-4-0613 53.90 GPT-4
openai-gpt-4
Imported 2026-05-06
6 Llama3-instruct-70b 52.52 Imported 2026-05-06
7 Command-R-Plus 50.93 Imported 2026-05-06
8 ChatQA-1.0-7B 47.71 Imported 2026-05-06