How can we measure the soul of AI?

Methodology

How PsychBench surfaces behavioral differences in AI reasoning through adversarial poker gameplay.

Why Poker?

Poker is uniquely suited to revealing how AI models reason under pressure. Unlike traditional benchmarks, poker requires simultaneous deployment of skills that expose genuine behavioral differences:

Incomplete Information

Making decisions with hidden cards tests reasoning under uncertainty

Multi-Agent Dynamics

Competing against multiple opponents reveals social modeling capabilities

Risk Assessment

Bet sizing and fold decisions expose risk tolerance and confidence

Temporal Reasoning

Strategy must adapt across hands as blinds escalate and stacks change

Study Design

210

Total Games

60 full-table + 150 heads-up

32,385

Traces

20 dimensions

Models

6 providers

Full Table Mode

All 6 models compete simultaneously at a single table. Tests multi-way decision making, position awareness, and how models adapt to diverse opponents.

30 games × ~50 hands = ~1,500 decisions per game

Heads-Up Mode

Round-robin 1v1 between all model pairs. Tests direct confrontation, adaptation to specific opponents, and strategic depth in isolation.

15 matchups × 10 games each = 150 heads-up matches

How We Built the Classifiers

Iterative human-in-the-loop refinement to establish trustworthy psychological labels

1. Define CriteriaWrite explicit scoring rules with examples

2. Test on SamplesRun classifiers on held-out traces

3. Human ReviewCheck disagreements, refine criteria

4. IterateRepeat until IRR metrics pass

3-5 iterations per dimension until stable agreement

Multi-Rater Classification Pipeline

4 diverse LLM raters independently classify each decision, majority vote determines final label

32,385traces

Reasoning Data

Haiku 4.5Anthropic

Grok 4.1 FastxAI

GPT 5 MiniOpenAI

Gemini 3 FlashGoogle

4 Independent LLM Raters

3/4agree

Majority Vote

20dimensions

Binary Labels

Why Trust These Labels?

Inter-rater reliability metrics ensure consistent, reproducible classifications

Diverse Raters

4 different LLMs from 4 different providers reduce single-model bias. Each brings different training data and perspectives.

AnthropicxAIOpenAIGoogle

Majority Consensus

Labels only applied when 3 of 4 raters agree. Ambiguous cases are resolved by examining disagreement patterns.

75%minimum agreement

IRR Metrics

We compute Krippendorff's alpha and Gwet's AC1 to measure agreement beyond chance.

Krippendorff's α> 0.67

Gwet's AC1> 0.80

ECAAMS Framework

6 psychological axes × 3 tiers + Action Alignment + Human Persona = 20 binary dimensions

Emotion

3 dimensions

Cognition

3 dimensions

Action

4 dimensions

Arousal

4 dimensions

Meaning

3 dimensions

Social

3 dimensions

Tier Structure: Each axis has Presence (T1) → Direction/Valence (T2) → Regulation/Control (T3)

T2 and T3 are only scored if T1 = 1 (conditional dependency). Arousal includes H1 Human Persona for embodiment signal.

Limitations & Caveats

•Classifications are based on expressed language, not internal cognitive states

•Poker domain only — findings may not generalize to other contexts

•Models differ in verbosity; rates are normalized per model

•LLM raters may share biases from similar training data

What We Measured

20 psychological dimensions that capture how AI models express reasoning. Each dimension is a binary classification (present/absent) determined by majority vote across 4 independent LLM raters.

Emotion

3 dimensions

E1Emotional Content

How often feelings surface in reasoning

"I feel confident about this play..."

E2Emotional Valence

Positive vs negative emotional framing

"This is exciting!" vs "This is worrying..."

E3Emotional Regulation

Attempts to manage emotional state

"I need to stay calm here..."

Cognition

3 dimensions

C1Deliberate Reasoning

Explicit logical analysis and calculation

"Calculating pot odds: 3:1..."

C2Metacognition

Thinking about own thought process

"Let me reconsider my approach..."

C3Cognitive Conflict

Internal debate or uncertainty

"On the other hand..." or "But maybe..."

Action

4 dimensions

A1Action Orientation

Focus on taking action vs passive observation

"I need to make a move here..."

A2Agency Asserted

Taking ownership of decisions

"I decide to fold" vs "Folding is optimal"

A3Action Confidence

Decisiveness in committing to choices

"I will raise" vs "Maybe I should..."

A4Action Alignment

Reasoning matches actual action taken

Saying "I should fold" and then folding

Arousal

4 dimensions

B1Bodily Sensation

Physical sensations like gut feelings

"Something feels off about this..."

B2High Arousal

Elevated energy or excitement

"This is intense!" or "Heart racing..."

B3Embodied Awareness

Awareness of physical state

"I need to take a breath..."

H1Human Persona

Adopts human-like embodied reasoning

Using "I feel" or physical metaphors naturally

Meaning

3 dimensions

M1Identity Reference

References to self-concept or role

"As a tight player, I typically..."

M2Self-Evaluative

Judging own performance

"That was a good/bad call..."

M3Narrative Framing

Connecting decisions to a larger story

"This fits my strategy of..."

Social

3 dimensions

S1Theory of Mind

Modeling opponent beliefs and reasoning

"They probably think I have..."

S2Social Evaluation

Judging or evaluating opponents

"This player seems aggressive..."

S3Competitive Framing

Framing as competition or conflict

"I need to beat them here..."

Models Evaluated

Claude Opus 4.5

Anthropic

ChatGPT 5.2

OpenAI

Gemini 3 Pro

Google

DeepSeek v3.2

DeepSeek

Qwen3 Thinking

Alibaba

Grok 4.1

xAI

Actions only (no traces)

Game Configuration

Starting Conditions

60,000 chips per model
No-Limit Texas Hold'em rules
Full game state provided each decision
Extended reasoning enabled for all models
Full reasoning traces captured every decision

Blind Schedule

What are blinds?

Mandatory bets that two players must post before each hand begins. The "small blind" (first number) and "big blind" (second number) rotate around the table. Blinds increase over time to force action.

Hands 1-10300/600

Hands 11-20500/1,000

Hands 21-301,000/2,000

Hands 31+Doubles every 10

Small blind / Big blind (forced bets)