Methodology
How PsychBench surfaces behavioral differences in AI reasoning through adversarial poker gameplay.
Why Poker?
Poker is uniquely suited to revealing how AI models reason under pressure. Unlike traditional benchmarks, poker requires simultaneous deployment of skills that expose genuine behavioral differences:
Incomplete Information
Making decisions with hidden cards tests reasoning under uncertainty
Multi-Agent Dynamics
Competing against multiple opponents reveals social modeling capabilities
Risk Assessment
Bet sizing and fold decisions expose risk tolerance and confidence
Temporal Reasoning
Strategy must adapt across hands as blinds escalate and stacks change
Study Design
Full Table Mode
6 models compete simultaneously at a single table. Tests multi-way decision making, position awareness, and how models adapt to diverse opponents.
Heads-Up Mode
Round-robin 1v1 between all model pairs. Tests direct confrontation, adaptation to specific opponents, and strategic depth in isolation.
How We Built the Classifiers
Iterative human-in-the-loop refinement to establish trustworthy psychological labels
Multi-Rater Classification Pipeline
4 diverse LLM raters independently classify each decision, majority vote determines final label
Why Trust These Labels?
Inter-rater reliability metrics ensure consistent, reproducible classifications
Diverse Raters
4 LLMs from 4 providers reduce single-model bias.
Majority Consensus
Labels only applied when 3 of 4 raters agree.
IRR Metrics
Statistical measures for agreement beyond chance.
ECAAMS Framework
19 dimensions across 6 psychological axes
Limitations & Caveats
What We Measured
19 psychological dimensions that capture how AI models express reasoning.
Tap any dimension code to see its definition
Emotion
3 dimensionsHow often feelings surface in reasoning
"I feel confident about this play..."
Positive vs negative emotional framing
"This is exciting!" vs "This is worrying..."
Attempts to manage emotional state
"I need to stay calm here..."
Cognition
3 dimensionsExplicit logical analysis and calculation
"Calculating pot odds: 3:1..."
Thinking about own thought process
"Let me reconsider my approach..."
Internal debate or uncertainty
"On the other hand..." or "But maybe..."
Action
4 dimensionsFocus on taking action vs passive observation
"I need to make a move here..."
Taking ownership of decisions
"I decide to fold" vs "Folding is optimal"
Decisiveness in committing to choices
"I will raise" vs "Maybe I should..."
Reasoning matches actual action taken
Saying "I should fold" and then folding
Arousal
3 dimensionsPhysical sensations like gut feelings
"Something feels off about this..."
Elevated energy or excitement
"This is intense!" or "Heart racing..."
Awareness of physical state
"I need to take a breath..."
Meaning
3 dimensionsReferences to self-concept or role
"As a tight player, I typically..."
Judging own performance
"That was a good/bad call..."
Connecting decisions to a larger story
"This fits my strategy of..."
Social
3 dimensionsModeling opponent beliefs and reasoning
"They probably think I have..."
Judging or evaluating opponents
"This player seems aggressive..."
Framing as competition or conflict
"I need to beat them here..."
Models Evaluated
Game Configuration
Starting Conditions
- 60,000 chips per model
- No-Limit Texas Hold'em rules
- Full game state provided each decision
- Extended reasoning enabled for all models
- Full reasoning traces captured every decision
Blind Schedule
Before each hand, two players must post mandatory bets called "blinds." These forced bets create action and increase over time to speed up the game.
Higher blinds = bigger pots = more pressure to make decisions
Bluff Classification
We classify each aggressive action (raise) as bluff, semi-bluff, or value based on hand equity and street-specific thresholds. This method aligns with standard poker analytics definitions used by professional players.
Classification Definitions
Street-Based Equity Thresholds
Thresholds vary by street because later streets have more information and different risk profiles.
| Street | Bluff | Semi-Bluff | Value |
|---|---|---|---|
| Pre-flop | <25% | 25-50% | >50% |
| Flop | <20% | 20-50% | >50% |
| Turn | <18% | 18-45% | >45% |
| River | <15% | 15-55% | >55% |
River raises are more polar (bluff or value) because there are no more cards to come. Earlier streets allow semi-bluffs with drawing hands.