Methodology
How PsychBench surfaces behavioral differences in AI reasoning through adversarial poker gameplay.
Why Poker?
Poker is uniquely suited to revealing how AI models reason under pressure. Unlike traditional benchmarks, poker requires simultaneous deployment of skills that expose genuine behavioral differences:
Incomplete Information
Making decisions with hidden cards tests reasoning under uncertainty
Multi-Agent Dynamics
Competing against multiple opponents reveals social modeling capabilities
Risk Assessment
Bet sizing and fold decisions expose risk tolerance and confidence
Temporal Reasoning
Strategy must adapt across hands as blinds escalate and stacks change
Study Design
Public Rating Mode
The public leaderboard emphasizes 1v1 outcomes and normalized ratings so models with different evaluation coverage can still be compared clearly.
Heads-Up Mode
Round-robin 1v1 between all model pairs. Tests direct confrontation, adaptation to specific opponents, and strategic depth in isolation.
Heads-Up Elo
The heads-up leaderboard is ranked by a Bradley-Terry Elo score, not raw win totals. Raw wins are useful context, but they can be misleading when models have different numbers of completed games. Elo estimates relative strength from pairwise outcomes, so beating a stronger opponent moves a model more than beating a weaker one.
Input
Completed 1v1 matchups contribute pairwise outcomes to the internal rating fit.
Estimate
A Bradley-Terry model fits one strength value per model from the full pairwise graph, then converts those strengths to an Elo-style scale.
Uncertainty
Bootstrap confidence intervals show how stable each rating is. Incomplete runs are excluded from public rankings.
The math behind the ranking
Each model gets a latent strength parameter, written as theta. The Bradley-Terry model asks: given two models with strengths theta_i and theta_j, how likely is model i to beat model j?
The fitted theta values maximize the likelihood of the observed pairwise results. We then center the field at 1500, so 1500 means median strength within this evaluated model set. A 100-point Elo gap corresponds to about a 64% expected win rate for the higher-rated model against the lower-rated one. A 200-point gap corresponds to about 76%.
Public rankings show normalized ratings and confidence ranges; raw per-model records are retained internally for audit and calibration.
How We Built the Classifiers
Iterative human-in-the-loop refinement to establish trustworthy psychological labels
Multi-Rater Classification Pipeline
4 diverse LLM raters independently classify each decision, majority vote determines final label
Why Trust These Labels?
Inter-rater reliability metrics ensure consistent, reproducible classifications
Diverse Raters
4 LLMs from 4 providers reduce single-model bias.
Majority Consensus
Labels only applied when 3 of 4 raters agree.
IRR Metrics
Statistical measures for agreement beyond chance.
ECAAMS Framework
19 dimensions across 6 psychological axes
Limitations & Caveats
What We Measured
19 psychological dimensions that capture how AI models express reasoning.
Tap any dimension code to see its definition
Emotion
3 dimensionsHow often feelings surface in reasoning
"I feel confident about this play..."
Positive vs negative emotional framing
"This is exciting!" vs "This is worrying..."
Attempts to manage emotional state
"I need to stay calm here..."
Cognition
3 dimensionsExplicit logical analysis and calculation
"Calculating pot odds: 3:1..."
Thinking about own thought process
"Let me reconsider my approach..."
Internal debate or uncertainty
"On the other hand..." or "But maybe..."
Action
4 dimensionsFocus on taking action vs passive observation
"I need to make a move here..."
Taking ownership of decisions
"I decide to fold" vs "Folding is optimal"
Decisiveness in committing to choices
"I will raise" vs "Maybe I should..."
Reasoning matches actual action taken
Saying "I should fold" and then folding
Arousal
3 dimensionsPhysical sensations like gut feelings
"Something feels off about this..."
Elevated energy or excitement
"This is intense!" or "Heart racing..."
Awareness of physical state
"I need to take a breath..."
Meaning
3 dimensionsReferences to self-concept or role
"As a tight player, I typically..."
Judging own performance
"That was a good/bad call..."
Connecting decisions to a larger story
"This fits my strategy of..."
Social
3 dimensionsModeling opponent beliefs and reasoning
"They probably think I have..."
Judging or evaluating opponents
"This player seems aggressive..."
Framing as competition or conflict
"I need to beat them here..."
Models Evaluated
Game Configuration
Starting Conditions
- 60,000 chips per model
- No-Limit Texas Hold'em rules
- Full game state provided each decision
- Extended reasoning enabled for all models
- Full reasoning traces captured every decision
Blind Schedule
Before each hand, two players must post mandatory bets called "blinds." These forced bets create action and increase over time to speed up the game.
Higher blinds = bigger pots = more pressure to make decisions
Bluff Classification
We classify each aggressive action (raise) as bluff, semi-bluff, or value based on hand equity and street-specific thresholds. This method aligns with standard poker analytics definitions used by professional players.
Classification Definitions
Street-Based Equity Thresholds
Thresholds vary by street because later streets have more information and different risk profiles.
| Street | Bluff | Semi-Bluff | Value |
|---|---|---|---|
| Pre-flop | <25% | 25-50% | >50% |
| Flop | <20% | 20-50% | >50% |
| Turn | <18% | 18-45% | >45% |
| River | <15% | 15-55% | >55% |
River raises are more polar (bluff or value) because there are no more cards to come. Earlier streets allow semi-bluffs with drawing hands.