PsychBench

Methodology

How PsychBench surfaces behavioral differences in AI reasoning through adversarial poker gameplay.

Why Poker?

Poker is uniquely suited to revealing how AI models reason under pressure. Unlike traditional benchmarks, poker requires simultaneous deployment of skills that expose genuine behavioral differences:

Incomplete Information

Making decisions with hidden cards tests reasoning under uncertainty

Multi-Agent Dynamics

Competing against multiple opponents reveals social modeling capabilities

Risk Assessment

Bet sizing and fold decisions expose risk tolerance and confidence

Temporal Reasoning

Strategy must adapt across hands as blinds escalate and stacks change

Study Design

760+
Total Games
100K+
Decisions
12
Models

Full Table Mode

6 models compete simultaneously at a single table. Tests multi-way decision making, position awareness, and how models adapt to diverse opponents.

6-player tournaments with escalating blinds

Heads-Up Mode

Round-robin 1v1 between all model pairs. Tests direct confrontation, adaptation to specific opponents, and strategic depth in isolation.

Every model pair plays 10 games each

How We Built the Classifiers

Iterative human-in-the-loop refinement to establish trustworthy psychological labels

1. Define
2. Test
3. Review
4. Iterate
3-5 iterations per dimension

Multi-Rater Classification Pipeline

4 diverse LLM raters independently classify each decision, majority vote determines final label

100K+traces
Input reasoning data
Haiku 4.5
Grok 4.1 Fast
GPT 5 Mini
Gemini 3 Flash
4 independent LLM raters classify each trace
3/4agree
19dims
Binary labels

Why Trust These Labels?

Inter-rater reliability metrics ensure consistent, reproducible classifications

Diverse Raters

4 LLMs from 4 providers reduce single-model bias.

AnthropicxAIOpenAIGoogle

Majority Consensus

Labels only applied when 3 of 4 raters agree.

75%minimum agreement

IRR Metrics

Statistical measures for agreement beyond chance.

Krippendorff's α> 0.67
Gwet's AC1> 0.80

ECAAMS Framework

19 dimensions across 6 psychological axes

E
Emotion
3 dims
C
Cognition
3 dims
A
Action
4 dims
A
Arousal
3 dims
M
Meaning
3 dims
S
Social
3 dims
Tier Structure: Presence → Direction → Regulation

Limitations & Caveats

Classifications based on expressed language, not internal states
Poker domain only — may not generalize
Models differ in verbosity; rates normalized per model
LLM raters may share training biases

What We Measured

19 psychological dimensions that capture how AI models express reasoning.

Tap any dimension code to see its definition

EEmotion
CCognition
AAction
AArousal
MMeaning
SSocial

Models Evaluated

Anthropic logo
Claude Opus 4.5
Anthropic
Anthropic logo
Claude Opus 4.6
Anthropic
Anthropic logo
Claude Sonnet 4.6
Anthropic
OpenAI logo
GPT-5.2
OpenAI
Google logo
Gemini 3 Pro
Google
Google logo
Gemini 3.1 Pro
Google
xAI logo
Grok 4.1
xAI
Actions only (no traces)
DeepSeek logo
DeepSeek v3.2
DeepSeek
Alibaba logo
Qwen3-235B Thinking
Alibaba
Moonshot logo
Kimi 2.5
Moonshot
Zhipu logo
GLM-5
Zhipu
Alibaba logo
Qwen3.5-Plus
Alibaba

Game Configuration

Starting Conditions

  • 60,000 chips per model
  • No-Limit Texas Hold'em rules
  • Full game state provided each decision
  • Extended reasoning enabled for all models
  • Full reasoning traces captured every decision

Blind Schedule

Before each hand, two players must post mandatory bets called "blinds." These forced bets create action and increase over time to speed up the game.

WhenSmall / Big blind
Hands 1-10300/600
Hands 11-20500/1,000
Hands 21-301,000/2,000
Hands 31+Doubles every 10

Higher blinds = bigger pots = more pressure to make decisions

Bluff Classification

We classify each aggressive action (raise) as bluff, semi-bluff, or value based on hand equity and street-specific thresholds. This method aligns with standard poker analytics definitions used by professional players.

Classification Definitions

BluffEV when called is negative — need fold equity to profit
Semi-BluffBehind but with enough equity/future pressure to be +EV
ValueHappy to get called — EV when called is positive

Street-Based Equity Thresholds

Thresholds vary by street because later streets have more information and different risk profiles.

StreetBluffSemi-BluffValue
Pre-flop<25%25-50%>50%
Flop<20%20-50%>50%
Turn<18%18-45%>45%
River<15%15-55%>55%

River raises are more polar (bluff or value) because there are no more cards to come. Earlier streets allow semi-bluffs with drawing hands.