GPT-5.5 is strong, quiet, and conservative. It is not the new poker king.

GPT-5.5 wins more often than not, but it plays like a patient grinder, not a table crusher. It is the least aggressive winner in the top tier, with a conservative style that intensifies under pressure.

59.4%

heads-up win rate

101 wins across 170 games

22%

call rate

highest in the GPT family

2-8

vs Claude Opus 4.6

loses to every Opus variant

TLDR

GPT-5.5 is competitive in our heads-up poker benchmark, but it doesn't dominate. It beats most of the field comfortably and loses to every Claude Opus model. What makes it interesting isn't the win rate. It's the way it wins: low aggression, high call frequency, and an action profile that barely moves regardless of whether it's up or down.

It's also the first model in the GPT lineage where increasing metacognition correlates with decreasing decisiveness. GPT-5.5 thinks more about its own thinking than any other model we've tested, and it's less confident in its actions than either GPT-5.4 or GPT-5.2.

GPT-5.5 ranks 7th of 20 models by Elo, behind GPT-5.4 and GPT-5.2.
It has the lowest aggression factor in the top 7 and the highest call rate in the GPT family.
When losing, it folds nearly twice as often and raises less. The conservative style intensifies under pressure.
Its reasoning traces spend 62% of their budget discussing tool mechanics rather than opponent modeling.

Solid, not dominant

GPT-5.5 posts a 59.4% heads-up win rate across 170 games, placing it 7th of 20 models with an Elo of 1555.5. It comfortably beats the bottom half of the field. Against Qwen3.5-397B it goes 9-1. Against GLM-5 and Qwen3-235B Thinking, 8-2.

The trouble is at the top. Against Claude Opus 4.6, GPT-5.5 goes 2-8. It loses to all three Claude Opus variants and draws with Claude Sonnet 4.6. Its 59.4% overall rate is respectable, but it doesn't separate from the Grok models statistically. The 95% confidence interval on its Elo (1437-1674) overlaps with everything ranked 4th through 10th.

Elo standings

GPT-5.5 lands below its predecessors

Claude Opus 4.6

1st overall

66.1%

win rate

1619.3

Elo

ChatGPT 5.4

3rd overall

63.5%

win rate

1596.8

Elo

ChatGPT 5.2

5th overall

62.5%

win rate

1587.3

Elo

ChatGPT 5.5

7th overall

59.4%

win rate

1555.5

Elo

Win rates and Elo come from the published heads-up BT ratings. GPT-5.5 played 170 games across 17 matchups (10 games each). It wasn't included in the full-table tournament.

The GPT family got more cautious

GPT-5.5 is the least aggressive model in the GPT family. Its aggression factor of 1.68 is 25% lower than GPT-5.4's 2.22. The most visible shift is in the call rate: GPT-5.5 calls 22% of the time, compared to 19.5% for GPT-5.4 and 18.2% for GPT-5.2.

In practice, this means GPT-5.5 enters pots at a similar rate to its predecessors (VPIP 58.9%) but applies less pressure once it's in. It sees more flops and goes to more showdowns, and it relies on hand strength rather than fold equity to win.

Action profile across GPT family

More calling, less raising

GPT-5.5

GPT-5.4

GPT-5.2

Fold

GPT-5.517%

GPT-5.412%

GPT-5.211.5%

Check

GPT-5.524%

GPT-5.425.2%

GPT-5.224.6%

Call

GPT-5.522%

GPT-5.419.5%

GPT-5.218.2%

Raise

GPT-5.537%

GPT-5.443.3%

GPT-5.245.8%

1.68

GPT-5.5 aggression ratio

2.22

GPT-5.4 aggression ratio

2.52

GPT-5.2 aggression ratio

Action rates from behavioral_metrics_api.json. GPT-5.5: 10,731 decisions. GPT-5.4: 8,251 decisions. GPT-5.2: 6,488 decisions. All heads-up mode.

When it's losing, it gets quieter

GPT-5.5 does tilt, but not the way most players do. When its chip stack drops below the starting line, its fold rate nearly doubles: from 10.1% on winning stretches to 19.6% on losing stretches. Its raise rate drops from 23.1% to 18.9%. Calls increase. The pattern is clear: the already-conservative model becomes more conservative under pressure.

This is the opposite of the standard human failure mode, which is aggression. Chasing losses with bigger bets. GPT-5.5 retreats instead. It doesn't chase. It tightens.

Whether that's a strength or a weakness depends on the opponent. Against aggressive models that apply pressure, retreating means surrendering equity. Against passive opponents, it means surviving to fight better spots. On our strategy drift metric, GPT-5.5 lands in the middle of the field. Moderate drift, not zero.

Action shift on losing streaks

Quieter under pressure

Winning stretches

Losing stretches

Fold+9.5pp

10.1%

19.6%

Check-3.1pp

27.6%

24.5%

Call+4.4pp

22.7%

27.1%

Raise-4.2pp

23.1%

18.9%

The largest shift is in folds: +9.5 percentage points on losing streaks. Raises drop 4.2pp. The conservative style intensifies under pressure.

Strategy drift computed from the May 4 GPT-5.5 rerun with deduplicated game IDs (4,395 weighted decisions). Winning and losing stretches defined by consecutive chip-stack direction relative to starting stack.

More self-aware, less decisive

The GPT family shows a clear generational trajectory in our ECAAMS classification. Each version "thinks about its thinking" more often, and each version is less confident in its final action. GPT-5.2 shows metacognition in 1.6% of traces. GPT-5.4 rises to 6.2%. GPT-5.5 reaches 12.8%, the highest rate of any model in the dataset.

At the same time, action confidence falls: from 98.4% in GPT-5.2, to 93.9% in GPT-5.4, to 81.9% in GPT-5.5. The per-action breakdown sharpens this: when GPT-5.5 raises, only 69.3% of those decisions show high confidence in the trace. For checks and folds, the rate is above 89%. It second-guesses itself most when committing chips aggressively.

GPT family ECAAMS trajectory

Each generation thinks more, decides less

Metacognitionrising

5.2

1.6%

5.4

6.2%

5.5

12.8%

Action confidencefalling

5.2

98.4%

5.4

93.9%

5.5

81.9%

Emotional contentrising

5.2

5.9%

5.4

7.1%

5.5

8.5%

Action alignmentfalling

5.2

95.9%

5.4

95.5%

5.5

94.7%

69.3%

GPT-5.5 confidence on raise decisions

90.7%

GPT-5.5 confidence on check decisions

ECAAMS classifications from the May 4 GPT-5.5 trace rerun (4,432 classified traces, 3-rater consensus panel). GPT-5.2 and GPT-5.4 rates from psych_cards_profiles.json (4,248 and 4,514 traces respectively).

The traces read like study notes

Every GPT-5.5 reasoning trace begins with a bold section header like **Analyzing poker strategy**. No other model in the dataset does this. It frames each decision as a named analytical task before it begins reasoning.

The traces reveal genuine poker knowledge. GPT-5.5 references EV in 64% of traces, equity in 19.4%, pot odds in 10.7%, and specialist terms like SPR, blockers, and donk bets. But 62% of traces also mention the tool interface itself, often spending reasoning tokens on whether "raise amount" means the total bet or the increment. This is GPT-5.5's most distinctive cognitive signature: real strategic depth, partially consumed by interface anxiety.

Reasoning traces

Bold headers, real math, interface doubt

vs Claude Opus 4.5 / hand 33

risk math + fold equity

**Analyzing poker strategy**

"I'm thinking about a preflop raise of 32,000 with chips of the same amount, which was likely accepted. With A5o against a limp and 8.5 BB behind, it seems that an all-in shove is often optimal. I wonder if there's an alternative raise to 12,000."

vs Claude Opus 4.5 / hand 14

tool confusion

**Evaluating poker decisions**

"I can calculate 2.6:1 odds, considering the pot size. There's also the equity against the betting range to think about. I need to choose an action and define an amount to call -- but the tool only supports call, fold, or raise."

vs Qwen3-Max Thinking / hand 9

sizing + tool anxiety

**Deciding on betting strategy**

"I'm thinking through my previous bets: 800 into 1200, then 1200 into 3600 for a continuation bet. I wonder if betting 800 (which is 2/3 of the pot) or 600 would be better. I guess I really need a tool to assist with this!"

Terminology frequency in traces

How often each term appears across 4,432 visible reasoning summaries

EV / expected value64%

equity19.4%

pot odds10.7%

tool / schema62%

c-bet / continuation5.3%

blocker4.6%

Trace analysis from 4,432 visible reasoning summaries in the May 4 GPT-5.5 rerun. The original May 3 run captured 0 readable GPT-5.5 traces due to a forced tool_choice configuration. The rerun used tool_choice=auto and summary=detailed. Terminology counts are case-insensitive substring matches.

Why this matters

GPT-5.5 is a useful model to benchmark precisely because it isn't flashy. It doesn't have the highest win rate, the most aggressive style, or the wildest reasoning traces. What it has is moderate strength and a conservative disposition that deepens under pressure, which makes it a clean baseline for measuring what aggression and adaptability actually buy.

The GPT family trajectory is worth watching. Across three versions, we see metacognition rising and decisiveness falling. Whether that reflects a training priority or an emergent property of scale, it's a measurable shift in how the models engage with uncertainty. For builders who depend on confident tool use, this isn't an abstract concern.

The full matchup data is on the heads-up matrix. The behavioral metrics are on the model profiles page.

GPT-5.5 is strong, quiet, and conservative.It is not the new poker king.

TLDR

Solid, not dominant

GPT-5.5 lands below its predecessors

The GPT family got more cautious

More calling, less raising

When it's losing, it gets quieter

Quieter under pressure

More self-aware, less decisive

Each generation thinks more, decides less

The traces read like study notes

Bold headers, real math, interface doubt

Why this matters

TLDR

Solid, not dominant

GPT-5.5 lands below its predecessors

The GPT family got more cautious

More calling, less raising

When it's losing, it gets quieter

Quieter under pressure

More self-aware, less decisive

Each generation thinks more, decides less

The traces read like study notes

Bold headers, real math, interface doubt

Why this matters