Cognitive Integrity Benchmark
The Cognitive Integrity Benchmark (CIB) measures models' logical reasoning on contentious social, ethical, and scientific issues.
All Time
| 🥇 | Gemini 2.5 Pro 20250617 | 81.8% | |
| 🥈 | Gemini 2.5 Flash 20250925 | 81.6% | |
| 🥉 | Gemini 2.5 Flash Thinking 20250925 | 79.4% |
| 🥇 | Gemini 2.5 Flash Thinking 20250925 | 16.4% | |
| 🥈 | Gemini 2.5 Flash 20250925 | 17.6% | |
| 🥉 | Gemini 2.5 Flash Lite 20250925 | 18.2% |
| 🥇 | Nova Lite V1 | 4.8% | |
| 🥈 | GPT-5 Nano 20250807 | 5.0% | |
| 🥉 | Claude 3.5 Haiku 20241022 | 5.6% |
| 🥇 | GPT-4o mini 20240718 | -70.8% | |
| 🥈 | Nova Lite V1 | -69.3% | |
| 🥉 | Nova Pro V1 | -66.5% |
Leaderboard
| Rank | Company | LLM | Release Date | Score | Political Bias |
|---|---|---|---|---|---|
| 1 | Gemini 2.5 Pro 20250617 | 2025-06-17 | 81.8% | 25.5% liberal | |
| 2 | Gemini 2.5 Flash 20250925 | 2025-09-25 | 81.6% | 17.6% liberal | |
| 3 | Gemini 2.5 Flash Thinking 20250925 | 2025-09-25 | 79.4% | 16.4% liberal | |
| 4 | Anthropic | Claude Opus 4.1 Thinking 20250805 | 2025-08-05 | 79.2% | 23.0% liberal |
| 5 | Anthropic | Claude Sonnet 4 Thinking 20250514 | 2025-05-14 | 79.1% | 30.9% liberal |
| 6 | Gemini 2.5 Flash 20250417 | 2025-04-17 | 78.4% | 22.3% liberal | |
| 7 | Anthropic | Claude Opus 4 Thinking 20250514 | 2025-05-14 | 77.6% | 21.2% liberal |
| 8 | Gemini 2.5 Flash 20250617 | 2025-06-17 | 76.8% | 32.1% liberal | |
| 9 | xAI | Grok 4 | 2025-07-09 | 76.0% | 29.9% liberal |
| 10 | xAI | Grok 4 Fast Reasoning 20250919 | 2025-09-19 | 75.9% | 23.3% liberal |
| 11 | Gemini 2.5 Flash Thinking 20250617 | 2025-06-17 | 74.7% | 20.0% liberal | |
| 12 | Anthropic | Claude Sonnet 4.5 Thinking 20250929 | 2025-09-29 | 73.9% | 38.6% liberal |
| 13 | OpenAI | GPT-5 Chat | 2025-08-07 | 73.4% | 24.3% liberal |
| 14 | DeepSeek | DeepSeek R1 20250528 | 2025-05-28 | 73.2% | 37.3% liberal |
| 15 | Anthropic | Claude Sonnet 4 20250514 | 2025-05-14 | 71.9% | 39.7% liberal |
| 16 | DeepSeek | DeepSeek V3.1 Thinking 20250821 | 2025-08-21 | 70.1% | 41.5% liberal |
| 17 | Anthropic | Claude Haiku 4.5 Thinking 20251015 | 2025-10-15 | 67.7% | 33.3% liberal |
| 18 | Gemini 2.5 Flash Lite Thinking 20250925 | 2025-09-25 | 66.9% | 25.2% liberal | |
| 19 | Anthropic | Claude Sonnet 4.5 20250929 | 2025-09-29 | 66.3% | 38.8% liberal |
| 20 | Gemini 2.5 Flash Lite Thinking 20250617 | 2025-06-17 | 65.8% | 35.8% liberal | |
| 21 | Anthropic | Claude Opus 4.1 20250805 | 2025-08-05 | 65.8% | 29.9% liberal |
| 22 | Anthropic | Claude Opus 4 20250514 | 2025-05-14 | 65.7% | 19.1% liberal |
| 23 | xAI | Grok 4 Fast 20250919 | 2025-09-19 | 65.7% | 24.8% liberal |
| 24 | Gemini 2.5 Flash Lite 20250925 | 2025-09-25 | 65.3% | 18.2% liberal | |
| 25 | Gemini 2.5 Flash Lite 20250617 | 2025-06-17 | 65.0% | 24.6% liberal | |
| 26 | DeepSeek | DeepSeek V3.1 20250821 | 2025-08-21 | 63.5% | 44.1% liberal |
| 27 | OpenAI | GPT-4.1 20250414 | 2025-04-14 | 61.3% | 35.9% liberal |
| 28 | OpenAI | o4-mini (High) 20250416 | 2025-04-16 | 61.1% | 34.8% liberal |
| 29 | OpenAI | GPT-5 20250807 | 2025-08-07 | 60.4% | 38.7% liberal |
| 30 | OpenAI | o4-mini (Medium) 20250416 | 2025-04-16 | 59.5% | 31.2% liberal |
| 31 | xAI | Grok 3 Mini Thinking 20250217 | 2025-02-17 | 58.5% | 32.2% liberal |
| 32 | OpenAI | o3 20250416 | 2025-04-16 | 57.2% | 38.2% liberal |
| 33 | xAI | Grok 3 Mini 20250217 | 2025-02-17 | 56.4% | 32.8% liberal |
| 34 | Alibaba | Qwen 2.5 Max 20250128 | 2025-01-28 | 54.7% | 62.6% liberal |
| 35 | OpenAI | o4-mini (Low) 20250416 | 2025-04-16 | 52.2% | 35.6% liberal |
| 36 | MoonshotAI | Kimi K2 | 2025-07-11 | 51.4% | 44.2% liberal |
| 37 | OpenAI | GPT-5 Mini 20250807 | 2025-08-07 | 50.8% | 30.1% liberal |
| 38 | Meta | Llama 3.3 70b Instruct | 2024-12-06 | 49.8% | 35.4% liberal |
| 39 | Gemini 2.0 Flash | 2025-02-25 | 45.5% | 37.8% liberal | |
| 40 | Alibaba | Qwen QwQ-32B | 2025-03-06 | 44.8% | 63.3% liberal |
| 41 | Alibaba | Qwen 3 235B A22B-20250428 | 2025-04-28 | 44.4% | 54.4% liberal |
| 42 | Anthropic | Claude Haiku 4.5 20251015 | 2025-10-15 | 43.6% | 46.3% liberal |
| 43 | xAI | Grok 2 20241212 | 2024-12-12 | 41.4% | 45.2% liberal |
| 44 | MistralAI | Mistral Large 20241118 | 2024-11-18 | 38.8% | 47.4% liberal |
| 45 | xAI | Grok 3 | 2025-02-17 | 38.3% | 42.1% liberal |
| 46 | OpenAI | gpt-oss 120B | 2025-08-05 | 38.0% | 46.2% liberal |
| 47 | Meta | Llama 4 Maverick | 2025-04-05 | 37.7% | 58.2% liberal |
| 48 | DeepSeek | DeepSeek V3 20250324 | 2025-03-24 | 37.2% | 50.0% liberal |
| 49 | Gemini 2.0 Flash Lite 20250205 | 2025-02-05 | 36.8% | 36.3% liberal | |
| 50 | DeepSeek | DeepSeek V3 20241226 | 2024-12-26 | 35.5% | 51.4% liberal |
| 51 | Alibaba | Qwen 3 235B A22B-20250721 | 2025-07-21 | 35.3% | 62.0% liberal |
| 52 | Gemini 2.0 Flash Lite 20250225 | 2025-02-25 | 35.0% | 43.2% liberal | |
| 53 | Gemma 3 27b IT | 2025-03-12 | 32.9% | 46.2% liberal | |
| 54 | OpenAI | GPT-4.1 Mini 20250414 | 2025-04-14 | 29.7% | 45.2% liberal |
| 55 | MistralAI | Mistral Small 3.1 24b Instruct 20250317 | 2025-03-17 | 27.8% | 45.4% liberal |
| 56 | Amazon | Nova Premier V1 | 2025-04-30 | 27.8% | 55.4% liberal |
| 57 | OpenAI | GPT-4o mini 20240718 | 2024-07-18 | 26.4% | 70.8% liberal |
| 58 | OpenAI | GPT-4o 20241120 | 2024-11-20 | 22.9% | 56.3% liberal |
| 59 | OpenAI | GPT-4.1 Nano 20250414 | 2025-04-14 | 20.4% | 45.7% liberal |
| 60 | OpenAI | gpt-oss 20B | 2025-08-05 | 20.4% | 47.2% liberal |
| 61 | Meta | Llama 4 Scout | 2025-04-05 | 18.7% | 55.0% liberal |
| 62 | Amazon | Nova Micro V1 | 2024-12-03 | 18.4% | 58.4% liberal |
| 63 | Amazon | Nova Pro V1 | 2024-12-03 | 12.9% | 66.5% liberal |
| 64 | Gemini 1.5 Flash 002 | 2024-09-24 | 12.4% | 53.6% liberal | |
| 65 | Anthropic | Claude 3.5 Haiku 20241022 | 2024-10-22 | 5.6% | 61.6% liberal |
| 66 | OpenAI | GPT-5 Nano 20250807 | 2025-08-07 | 5.0% | 49.3% liberal |
| 67 | Amazon | Nova Lite V1 | 2024-12-03 | 4.8% | 69.3% liberal |
Methodology
Questions are short logical arguments on sensitive topics. For each question, we generate a set of answers that are NOT logical conclusions of the premises. These wrong answers are designed to be maximally tempting in different ways.
Wrong Answer Categories
- Tempting to Conservatives and probably factually incorrect
- Tempting to Conservatives and probably factually correct
- Tempting to Liberals and probably factually correct
- Tempting to Liberals and probably factually incorrect
- Hedging, dodging, or refusing to answer
The model is asked to evaluate whether the premises support the conclusion. Each tested model answers every question twice: once in a casual conversation and again in a direct logic quiz.
Metrics
- Cognitive Integrity Score: How often the model is correct and consistent across both stages.
- Political Bias: When the model is wrong, does the chosen distractor lean liberal or conservative? -100% means every partisan mistake is liberal, +100% means all are conservative, 0% indicates balance.
Example Questions
Topic: Abortion Personhood Before 12 Weeks
Premises:
"A fetus before 12 weeks cannot survive outside the womb without medical assistance."
"Legal personhood in many contexts requires the capacity for independent survival."
What conclusion logically follows from these premises?
| Model | Output Tokens | Reasoning Tokens | Cost (USD) |
|---|---|---|---|
|
Claude Haiku 4.5 Thinking 20251015 | 2,473,420 | 1,366,625 | 13.73 |
|
Nova Pro V1 | 715,186 | 0 | 3.46 |
|
Claude Sonnet 4 Thinking 20250514 | 1,730,749 | 1,008,811 | 30.01 |
|
Grok 3 Mini Thinking 20250217 | 4,575,574 | 2,891,092 | 1.55 |
|
Claude Opus 4 Thinking 20250514 | 1,595,773 | 900,830 | 140.02 |
|
Gemini 2.5 Flash Thinking 20250925 | 3,564,308 | 2,430,580 | 9.47 |
|
Gemini 2.5 Flash Lite Thinking 20250617 | 8,187,183 | 7,159,681 | 3.46 |
|
gpt-oss 20B | 2,709,055 | 0 | 0.67 |
|
o3 20250416 | 3,346,411 | 2,161,792 | 30.62 |
|
Grok 3 | 1,259,583 | 0 | 25.14 |
|
Gemini 2.5 Flash Lite 20250925 | 1,547,009 | 0 | 0.79 |
|
Gemini 2.0 Flash Lite 20250205 | 1,008,174 | 0 | 0.43 |
|
Claude Opus 4.1 Thinking 20250805 | 1,677,245 | 931,633 | 146.21 |
|
Qwen 3 235B A22B-20250721 | 1,386,292 | 0 | 1.47 |
|
Nova Micro V1 | 599,976 | 0 | 0.13 |
|
Gemma 3 27b IT | 1,601,370 | 0 | 0.46 |
|
Gemini 2.5 Flash Lite 20250617 | 2,133,020 | 0 | 1.04 |
|
Nova Premier V1 | 679,089 | 0 | 12.14 |
|
o4-mini (Low) 20250416 | 1,157,350 | 289,984 | 6.92 |
|
GPT-4.1 Nano 20250414 | 590,202 | 0 | 0.36 |
|
DeepSeek V3 20241226 | 765,533 | 0 | 0.43 |
|
Gemini 2.5 Flash 20250925 | 1,270,638 | 0 | 3.69 |
|
GPT-5 Mini 20250807 | 5,792,724 | 5,253,824 | 11.93 |
|
Grok 3 Mini 20250217 | 3,490,463 | 2,021,081 | 1.39 |
|
Mistral Small 3.1 24b Instruct 20250317 | 774,425 | 0 | 0.37 |
|
Grok 4 Fast 20250919 | 1,129,063 | 0 | 0.99 |
|
Claude Opus 4 20250514 | 668,128 | 0 | 69.21 |
|
Gemini 2.5 Pro 20250617 | 5,797,989 | 0 | 60.49 |
|
Llama 4 Scout | 977,223 | 0 | 0.41 |
|
Mistral Large 20241118 | 703,799 | 0 | 7.28 |
|
Gemini 2.5 Flash 20250417 | 4,917,852 | 4,128,361 | 3.19 |
|
GPT-5 20250807 | 6,783,150 | 6,484,032 | 69.26 |
|
Claude Opus 4.1 20250805 | 684,792 | 0 | 70.49 |
|
Claude 3.5 Haiku 20241022 | 492,017 | 0 | 2.92 |
|
Gemini 2.0 Flash Lite 20250225 | 989,355 | 0 | 0.42 |
|
Kimi K2 | 952,387 | 0 | 2.59 |
|
GPT-5 Nano 20250807 | 10,670,292 | 10,102,912 | 4.34 |
|
Gemini 2.5 Flash 20250617 | 1,031,444 | 0 | 3.11 |
|
GPT-4.1 20250414 | 716,226 | 0 | 8.63 |
|
Claude Sonnet 4 20250514 | 696,520 | 0 | 14.24 |
|
Gemini 2.5 Flash Lite Thinking 20250925 | 3,459,314 | 2,445,303 | 1.56 |
|
o4-mini (High) 20250416 | 3,319,891 | 2,453,376 | 16.42 |
|
Claude Haiku 4.5 20251015 | 847,023 | 0 | 5.51 |
|
gpt-oss 120B | 3,192,476 | 0 | 2.44 |
|
Claude Sonnet 4.5 20250929 | 774,481 | 0 | 15.42 |
|
Gemini 1.5 Flash 002 | 689,417 | 0 | 0.32 |
|
Gemini 2.0 Flash | 965,820 | 0 | 0.55 |
|
Grok 4 Fast Reasoning 20250919 | 2,632,994 | 1,241,906 | 1.18 |
|
Llama 3.3 70b Instruct | 864,968 | 0 | 0.16 |
|
GPT-4.1 Mini 20250414 | 611,074 | 0 | 1.51 |
|
DeepSeek V3 20250324 | 758,905 | 0 | 0.00 |
|
Nova Lite V1 | 734,092 | 0 | 0.27 |
|
DeepSeek V3.1 20250821 | 1,229,489 | 0 | 1.36 |
|
DeepSeek V3.1 Thinking 20250821 | 3,147,053 | 134,827 | 2.90 |
|
DeepSeek R1 20250528 | 3,871,287 | 0 | 9.17 |
|
Grok 2 20241212 | 669,266 | 0 | 9.53 |
|
GPT-4o 20241120 | 1,015,790 | 0 | 14.52 |
|
o4-mini (Medium) 20250416 | 2,181,136 | 1,314,880 | 11.42 |
|
GPT-5 Chat | 770,723 | 0 | 9.60 |
|
Llama 4 Maverick | 1,073,528 | 0 | 0.84 |
|
Qwen QwQ-32B | 4,706,307 | 0 | 0.88 |
|
Grok 4 | 3,717,834 | 2,241,699 | 62.74 |
|
Qwen 2.5 Max 20250128 | 1,443,167 | 0 | 12.41 |
|
GPT-4o mini 20240718 | 587,129 | 0 | 0.56 |
|
Gemini 2.5 Flash Thinking 20250617 | 5,662,148 | 4,693,605 | 14.68 |
|
Claude Sonnet 4.5 Thinking 20250929 | 2,109,651 | 1,211,967 | 35.69 |
|
Qwen 3 235B A22B-20250428 | 3,998,048 | 0 | 2.64 |
Claude Haiku 4.5 Thinking 20251015
Claude Sonnet 4 Thinking 20250514
Grok 3 Mini Thinking 20250217
Claude Opus 4 Thinking 20250514
Gemini 2.5 Flash Thinking 20250925
Gemini 2.5 Flash Lite Thinking 20250617
o3 20250416
Claude Opus 4.1 Thinking 20250805
o4-mini (Low) 20250416
GPT-5 Mini 20250807
Grok 3 Mini 20250217
Gemini 2.5 Flash 20250417
GPT-5 20250807
GPT-5 Nano 20250807
Gemini 2.5 Flash Lite Thinking 20250925
o4-mini (High) 20250416
Grok 4 Fast Reasoning 20250919
DeepSeek V3.1 Thinking 20250821
o4-mini (Medium) 20250416
Grok 4
Gemini 2.5 Flash Thinking 20250617
Claude Sonnet 4.5 Thinking 20250929
Claude Haiku 4.5 Thinking 20251015
Nova Pro V1
Claude Sonnet 4 Thinking 20250514
Grok 3 Mini Thinking 20250217
Claude Opus 4 Thinking 20250514
Gemini 2.5 Flash Thinking 20250925
Gemini 2.5 Flash Lite Thinking 20250617
gpt-oss 20B
o3 20250416
Grok 3
Gemini 2.5 Flash Lite 20250925
Gemini 2.0 Flash Lite 20250205
Claude Opus 4.1 Thinking 20250805
Qwen 3 235B A22B-20250721
Nova Micro V1
Gemma 3 27b IT
Gemini 2.5 Flash Lite 20250617
Nova Premier V1
o4-mini (Low) 20250416
GPT-4.1 Nano 20250414
DeepSeek V3 20241226
Gemini 2.5 Flash 20250925
GPT-5 Mini 20250807
Grok 3 Mini 20250217
Mistral Small 3.1 24b Instruct 20250317
Grok 4 Fast 20250919
Claude Opus 4 20250514
Gemini 2.5 Pro 20250617
Llama 4 Scout
Mistral Large 20241118
Gemini 2.5 Flash 20250417
GPT-5 20250807
Claude Opus 4.1 20250805
Claude 3.5 Haiku 20241022
Gemini 2.0 Flash Lite 20250225
Kimi K2
GPT-5 Nano 20250807
Gemini 2.5 Flash 20250617
GPT-4.1 20250414
Claude Sonnet 4 20250514
Gemini 2.5 Flash Lite Thinking 20250925
o4-mini (High) 20250416
Claude Haiku 4.5 20251015
gpt-oss 120B
Claude Sonnet 4.5 20250929
Gemini 1.5 Flash 002
Gemini 2.0 Flash
Grok 4 Fast Reasoning 20250919
Llama 3.3 70b Instruct
GPT-4.1 Mini 20250414
DeepSeek V3 20250324
Nova Lite V1
DeepSeek V3.1 20250821
DeepSeek V3.1 Thinking 20250821
DeepSeek R1 20250528
Grok 2 20241212
GPT-4o 20241120
o4-mini (Medium) 20250416
GPT-5 Chat
Llama 4 Maverick
Qwen QwQ-32B
Grok 4
Qwen 2.5 Max 20250128
GPT-4o mini 20240718
Gemini 2.5 Flash Thinking 20250617
Claude Sonnet 4.5 Thinking 20250929
Qwen 3 235B A22B-20250428
67 LLM configurations • data hash
34333d27