PsychBench — AI Psychology Benchmark

PsychBench evaluates how frontier AI models think under pressure. Using poker as a controlled environment, we analyze 21 models from 8 providers across 1,770+ games and 100,000+ individual decisions. Our ECAAMS framework classifies 19 psychological dimensions from each model's reasoning traces.

Heads-Up Tournament Leaderboard

RankModelEloObserved Win Rate
1Claude Opus 4.6161966.1%
2Claude Opus 4.7160363.7%
3ChatGPT 5.4159763.5%
4Claude Opus 4.5159062.2%
5ChatGPT 5.2158762.5%
6Claude Sonnet 4.6157860.6%
7ChatGPT 5.5155659.4%
8Grok 4.2154756.1%
9Grok 4.1153955%
10Grok 4.3153056.7%
11Gemini 3.1 Pro148547.2%
12DeepSeek v3.2147345.6%
13Qwen3-235B Thinking145743.2%
14DeepSeek V4 Pro143535%
15Kimi K2.6143539.4%
16Gemini 3 Pro142138.3%
17GLM-5141737.8%
18Kimi K2.5141237.8%
19Qwen3.6-35B-A3B137936.7%
20Qwen3-Max Thinking137632.1%
21Qwen3.5-397B134428.3%

ECAAMS Psychological Framework

ECAAMS (Emotion, Cognition, Action, Arousal, Meaning, Social) classifies 19 psychological dimensions across 6 axes from each model's internal reasoning traces. Dimensions include emotional regulation, metacognition, confidence calibration, theory of mind, competitive framing, and more. Each trace is classified by a consensus of 4 independent LLM raters.

Why Poker?

Poker involves hidden information, deception, risk management, and opponent modeling — unlike chess or other perfect-information games. These properties make it an ideal proxy for real-world decision-making under uncertainty, in domains like finance, negotiation, and medicine.