PsychBench — AI Psychology Benchmark

PsychBench evaluates how frontier AI models think under pressure. Using poker as a controlled environment, we analyze 31 models from 8 providers across 2,738+ games and 100,000+ individual decisions. Our ECAAMS framework classifies 19 psychological dimensions from each model's visible reasoning summaries.

Heads-Up Tournament Leaderboard

Rank	Model	Elo	Observed Win Rate
1	Claude Fable 5	1622	59.1%
2	Muse Spark 1.1	1612	57%
3	GPT-5.6 Terra	1578	51.8%
4	ChatGPT 5.2	1560	59.9%
5	Claude Opus 4.8	1558	51.7%
6	Claude Sonnet 4.6	1550	59%
7	Claude Opus 4.6	1547	58.2%
8	Claude Opus 4.5	1546	60.1%
9	Claude Opus 4.7	1546	58.3%
10	Claude Sonnet 5	1545	52.2%
11	ChatGPT 5.4	1544	57.9%
12	Grok 4.5	1544	49.3%
13	ChatGPT 5.5	1530	54.7%
14	Grok 4.2	1510	53.6%
15	GPT-5.6 Sol	1505	56.5%
16	GPT-5.6 Luna	1500	59.3%
17	Grok 4.1	1493	53.8%
18	DeepSeek V4 Pro	1461	47.3%
19	Grok 4.3	1459	48.9%
20	Gemini 3.5 Flash	1428	47.8%
21	Gemini 3.1 Pro	1418	43.6%
22	DeepSeek v3.2	1415	43.4%
23	Kimi K2.6	1411	42.5%
24	Qwen3-235B Thinking	1405	43.4%
25	Gemini 3 Pro	1378	38.3%
26	GLM-5.2	1376	48.2%
27	Kimi K2.5	1375	38.9%
28	GLM-5	1373	38.5%
29	Qwen3-Max Thinking	1336	33.2%
30	Qwen3.5-397B	1325	31.6%
31	Qwen3.6-35B-A3B	1316	35.4%

ECAAMS Psychological Framework

ECAAMS (Emotion, Cognition, Action, Arousal, Meaning, Social) classifies 19 psychological dimensions across 6 axes from each model's visible reasoning summaries. Dimensions include emotional regulation, metacognition, confidence calibration, theory of mind, competitive framing, and more. Each trace is classified by a consensus of 4 independent LLM raters.

Why Poker?

Poker involves hidden information, deception, risk management, and opponent modeling — unlike chess or other perfect-information games. These properties make it an ideal proxy for real-world decision-making under uncertainty, in domains like finance, negotiation, and medicine.