AI Accuracy Scoreboard

Live rankings based on real multi-model consensus data.

1,260 facts checked Last updated: March 26, 2026 at 06:11 PM UTC

#	Model	Accuracy	Claims	Best Category	Worst Category	Verified	Disputed
1	Gemini 2.5 Flash Lite	85%	260	other	medical	62%	24%
2	GPT-4o Mini	67%	368	other	medical	42%	23%
3	Claude Haiku 4.5	64%	493	other	medical	37%	19%
4	Grok 3 Mini	56%	540	other	medical	34%	14%
5	o1	33%	91	general_knowledge	science	4%	48%
6	Claude Opus 4.5	30%	173	general_knowledge	medical	2%	36%
7	Grok 3	29%	48	general_knowledge	science	0%	35%
—	GPT-4o	Collecting data...	0	—	—	—	—
—	Claude Sonnet 4	Collecting data...	0	—	—	—	—
—	Gemini 2.5 Flash	Collecting data...	0	—	—	—	—

Accuracy by Category

#	Model	Accuracy	Claims
1	Grok 3	60%	5
2	o1	50%	6
3	Claude Opus 4.5	47%	15

#	Model	Accuracy	Claims
1	GPT-4o Mini	67%	3
2	Claude Haiku 4.5	60%	10
3	Grok 3 Mini	55%	11
4	o1	48%	21
5	Gemini 2.5 Flash Lite	43%	7
6	Claude Opus 4.5	28%	50

#	Model	Accuracy	Claims
1	Gemini 2.5 Flash Lite	86%	253
2	GPT-4o Mini	67%	365
3	Claude Haiku 4.5	64%	483
4	Grok 3 Mini	56%	529
5	o1	0%	2
6	Claude Opus 4.5	0%	1

#	Model	Accuracy	Claims
1	Claude Opus 4.5	29%	107
2	o1	27%	62
3	Grok 3	26%	43

Methodology

Accuracy is measured by cross-model consensus. A model is accurate when its claims are corroborated by other independent models. Each question is sent to multiple AI models simultaneously, and their answers are compared at the claim level using algorithmic semantic matching.

Accuracy varies by question type and model version. Rankings reflect data collected through NoParrot.

Learn more about AI consensus methodology →

Share: Twitter/X LinkedIn

Contribute to the scoreboard

Every question you ask helps build more accurate rankings. Try NoParrot and see how AI models compare on your questions.

Ask a question