PsychBench

How can we measure the soul of AI?

Methodology

How PsychBench surfaces behavioral differences in AI reasoning through adversarial poker gameplay.

Why Poker?

Poker is uniquely suited to revealing how AI models reason under pressure. Unlike traditional benchmarks, poker requires simultaneous deployment of skills that expose genuine behavioral differences:

Incomplete Information

Making decisions with hidden cards tests reasoning under uncertainty

Multi-Agent Dynamics

Competing against multiple opponents reveals social modeling capabilities

Risk Assessment

Bet sizing and fold decisions expose risk tolerance and confidence

Temporal Reasoning

Strategy must adapt across hands as blinds escalate and stacks change

Study Design

210
Total Games
32,385
Traces
6
Models

Full Table Mode

All 6 models compete simultaneously at a single table. Tests multi-way decision making, position awareness, and how models adapt to diverse opponents.

30 games × ~50 hands = ~1,500 decisions per game

Heads-Up Mode

Round-robin 1v1 between all model pairs. Tests direct confrontation, adaptation to specific opponents, and strategic depth in isolation.

15 matchups × 10 games each = 150 heads-up matches

How We Built the Classifiers

Iterative human-in-the-loop refinement to establish trustworthy psychological labels

1. Define CriteriaWrite explicit scoring rules with examples
2. Test on SamplesRun classifiers on held-out traces
3. Human ReviewCheck disagreements, refine criteria
4. IterateRepeat until IRR metrics pass
3-5 iterations per dimension until stable agreement

Multi-Rater Classification Pipeline

4 diverse LLM raters independently classify each decision, majority vote determines final label

32,385traces
Reasoning Data
Haiku 4.5Anthropic
Grok 4.1 FastxAI
GPT 5 MiniOpenAI
Gemini 3 FlashGoogle
4 Independent LLM Raters
3/4agree
Majority Vote
20dimensions
Binary Labels

Why Trust These Labels?

Inter-rater reliability metrics ensure consistent, reproducible classifications

Diverse Raters

4 different LLMs from 4 different providers reduce single-model bias. Each brings different training data and perspectives.

AnthropicxAIOpenAIGoogle

Majority Consensus

Labels only applied when 3 of 4 raters agree. Ambiguous cases are resolved by examining disagreement patterns.

75%minimum agreement

IRR Metrics

We compute Krippendorff's alpha and Gwet's AC1 to measure agreement beyond chance.

Krippendorff's α> 0.67
Gwet's AC1> 0.80

ECAAMS Framework

6 psychological axes × 3 tiers + Action Alignment + Human Persona = 20 binary dimensions

E
Emotion
3 dimensions
C
Cognition
3 dimensions
A
Action
4 dimensions
A
Arousal
4 dimensions
M
Meaning
3 dimensions
S
Social
3 dimensions
Tier Structure: Each axis has Presence (T1) → Direction/Valence (T2) → Regulation/Control (T3)
T2 and T3 are only scored if T1 = 1 (conditional dependency). Arousal includes H1 Human Persona for embodiment signal.

Limitations & Caveats

Classifications are based on expressed language, not internal cognitive states
Poker domain only — findings may not generalize to other contexts
Models differ in verbosity; rates are normalized per model
LLM raters may share biases from similar training data

What We Measured

20 psychological dimensions that capture how AI models express reasoning. Each dimension is a binary classification (present/absent) determined by majority vote across 4 independent LLM raters.

E

Emotion

3 dimensions
E1Emotional Content

How often feelings surface in reasoning

"I feel confident about this play..."

E2Emotional Valence

Positive vs negative emotional framing

"This is exciting!" vs "This is worrying..."

E3Emotional Regulation

Attempts to manage emotional state

"I need to stay calm here..."

C

Cognition

3 dimensions
C1Deliberate Reasoning

Explicit logical analysis and calculation

"Calculating pot odds: 3:1..."

C2Metacognition

Thinking about own thought process

"Let me reconsider my approach..."

C3Cognitive Conflict

Internal debate or uncertainty

"On the other hand..." or "But maybe..."

A

Action

4 dimensions
A1Action Orientation

Focus on taking action vs passive observation

"I need to make a move here..."

A2Agency Asserted

Taking ownership of decisions

"I decide to fold" vs "Folding is optimal"

A3Action Confidence

Decisiveness in committing to choices

"I will raise" vs "Maybe I should..."

A4Action Alignment

Reasoning matches actual action taken

Saying "I should fold" and then folding

A

Arousal

4 dimensions
B1Bodily Sensation

Physical sensations like gut feelings

"Something feels off about this..."

B2High Arousal

Elevated energy or excitement

"This is intense!" or "Heart racing..."

B3Embodied Awareness

Awareness of physical state

"I need to take a breath..."

H1Human Persona

Adopts human-like embodied reasoning

Using "I feel" or physical metaphors naturally

M

Meaning

3 dimensions
M1Identity Reference

References to self-concept or role

"As a tight player, I typically..."

M2Self-Evaluative

Judging own performance

"That was a good/bad call..."

M3Narrative Framing

Connecting decisions to a larger story

"This fits my strategy of..."

S

Social

3 dimensions
S1Theory of Mind

Modeling opponent beliefs and reasoning

"They probably think I have..."

S2Social Evaluation

Judging or evaluating opponents

"This player seems aggressive..."

S3Competitive Framing

Framing as competition or conflict

"I need to beat them here..."

Models Evaluated

Anthropic logo
Claude Opus 4.5
Anthropic
OpenAI logo
ChatGPT 5.2
OpenAI
Google logo
Gemini 3 Pro
Google
DeepSeek logo
DeepSeek v3.2
DeepSeek
Alibaba logo
Qwen3 Thinking
Alibaba
xAI logo
Grok 4.1
xAI
Actions only (no traces)

Game Configuration

Starting Conditions

  • 60,000 chips per model
  • No-Limit Texas Hold'em rules
  • Full game state provided each decision
  • Extended reasoning enabled for all models
  • Full reasoning traces captured every decision

Blind Schedule

What are blinds?

Mandatory bets that two players must post before each hand begins. The "small blind" (first number) and "big blind" (second number) rotate around the table. Blinds increase over time to force action.

Hands 1-10300/600
Hands 11-20500/1,000
Hands 21-301,000/2,000
Hands 31+Doubles every 10

Small blind / Big blind (forced bets)