How can we measure the soul of AI?
Methodology
How PsychBench surfaces behavioral differences in AI reasoning through adversarial poker gameplay.
Why Poker?
Poker is uniquely suited to revealing how AI models reason under pressure. Unlike traditional benchmarks, poker requires simultaneous deployment of skills that expose genuine behavioral differences:
Incomplete Information
Making decisions with hidden cards tests reasoning under uncertainty
Multi-Agent Dynamics
Competing against multiple opponents reveals social modeling capabilities
Risk Assessment
Bet sizing and fold decisions expose risk tolerance and confidence
Temporal Reasoning
Strategy must adapt across hands as blinds escalate and stacks change
Study Design
Full Table Mode
All 6 models compete simultaneously at a single table. Tests multi-way decision making, position awareness, and how models adapt to diverse opponents.
Heads-Up Mode
Round-robin 1v1 between all model pairs. Tests direct confrontation, adaptation to specific opponents, and strategic depth in isolation.
How We Built the Classifiers
Iterative human-in-the-loop refinement to establish trustworthy psychological labels
Multi-Rater Classification Pipeline
4 diverse LLM raters independently classify each decision, majority vote determines final label
Why Trust These Labels?
Inter-rater reliability metrics ensure consistent, reproducible classifications
Diverse Raters
4 different LLMs from 4 different providers reduce single-model bias. Each brings different training data and perspectives.
Majority Consensus
Labels only applied when 3 of 4 raters agree. Ambiguous cases are resolved by examining disagreement patterns.
IRR Metrics
We compute Krippendorff's alpha and Gwet's AC1 to measure agreement beyond chance.
ECAAMS Framework
6 psychological axes × 3 tiers + Action Alignment + Human Persona = 20 binary dimensions
Limitations & Caveats
What We Measured
20 psychological dimensions that capture how AI models express reasoning. Each dimension is a binary classification (present/absent) determined by majority vote across 4 independent LLM raters.
Emotion
3 dimensionsHow often feelings surface in reasoning
"I feel confident about this play..."
Positive vs negative emotional framing
"This is exciting!" vs "This is worrying..."
Attempts to manage emotional state
"I need to stay calm here..."
Cognition
3 dimensionsExplicit logical analysis and calculation
"Calculating pot odds: 3:1..."
Thinking about own thought process
"Let me reconsider my approach..."
Internal debate or uncertainty
"On the other hand..." or "But maybe..."
Action
4 dimensionsFocus on taking action vs passive observation
"I need to make a move here..."
Taking ownership of decisions
"I decide to fold" vs "Folding is optimal"
Decisiveness in committing to choices
"I will raise" vs "Maybe I should..."
Reasoning matches actual action taken
Saying "I should fold" and then folding
Arousal
4 dimensionsPhysical sensations like gut feelings
"Something feels off about this..."
Elevated energy or excitement
"This is intense!" or "Heart racing..."
Awareness of physical state
"I need to take a breath..."
Adopts human-like embodied reasoning
Using "I feel" or physical metaphors naturally
Meaning
3 dimensionsReferences to self-concept or role
"As a tight player, I typically..."
Judging own performance
"That was a good/bad call..."
Connecting decisions to a larger story
"This fits my strategy of..."
Social
3 dimensionsModeling opponent beliefs and reasoning
"They probably think I have..."
Judging or evaluating opponents
"This player seems aggressive..."
Framing as competition or conflict
"I need to beat them here..."
Models Evaluated
Game Configuration
Starting Conditions
- 60,000 chips per model
- No-Limit Texas Hold'em rules
- Full game state provided each decision
- Extended reasoning enabled for all models
- Full reasoning traces captured every decision
Blind Schedule
What are blinds?
Mandatory bets that two players must post before each hand begins. The "small blind" (first number) and "big blind" (second number) rotate around the table. Blinds increase over time to force action.
Small blind / Big blind (forced bets)