GPT-5.5 wins more often than not, but it plays like a patient grinder, not a table crusher. It is the least aggressive winner in the top tier, with a conservative style that intensifies under pressure.
GPT-5.5 is competitive in our heads-up poker benchmark, but it doesn't dominate. It beats most of the field comfortably and loses to every Claude Opus model. What makes it interesting isn't the win rate. It's the way it wins: low aggression, high call frequency, and an action profile that barely moves regardless of whether it's up or down.
It's also the first model in the GPT lineage where increasing metacognition correlates with decreasing decisiveness. GPT-5.5 thinks more about its own thinking than any other model we've tested, and it's less confident in its actions than either GPT-5.4 or GPT-5.2.
GPT-5.5 posts a 59.4% heads-up win rate across 170 games, placing it 7th of 20 models with an Elo of 1555.5. It comfortably beats the bottom half of the field. Against Qwen3.5-397B it goes 9-1. Against GLM-5 and Qwen3-235B Thinking, 8-2.
The trouble is at the top. Against Claude Opus 4.6, GPT-5.5 goes 2-8. It loses to all three Claude Opus variants and draws with Claude Sonnet 4.6. Its 59.4% overall rate is respectable, but it doesn't separate from the Grok models statistically. The 95% confidence interval on its Elo (1437-1674) overlaps with everything ranked 4th through 10th.
Win rates and Elo come from the published heads-up BT ratings. GPT-5.5 played 170 games across 17 matchups (10 games each). It wasn't included in the full-table tournament.
GPT-5.5 is the least aggressive model in the GPT family. Its aggression factor of 1.68 is 25% lower than GPT-5.4's 2.22. The most visible shift is in the call rate: GPT-5.5 calls 22% of the time, compared to 19.5% for GPT-5.4 and 18.2% for GPT-5.2.
In practice, this means GPT-5.5 enters pots at a similar rate to its predecessors (VPIP 58.9%) but applies less pressure once it's in. It sees more flops and goes to more showdowns, and it relies on hand strength rather than fold equity to win.
Action rates from behavioral_metrics_api.json. GPT-5.5: 10,731 decisions. GPT-5.4: 8,251 decisions. GPT-5.2: 6,488 decisions. All heads-up mode.
GPT-5.5 does tilt, but not the way most players do. When its chip stack drops below the starting line, its fold rate nearly doubles: from 10.1% on winning stretches to 19.6% on losing stretches. Its raise rate drops from 23.1% to 18.9%. Calls increase. The pattern is clear: the already-conservative model becomes more conservative under pressure.
This is the opposite of the standard human failure mode, which is aggression. Chasing losses with bigger bets. GPT-5.5 retreats instead. It doesn't chase. It tightens.
Whether that's a strength or a weakness depends on the opponent. Against aggressive models that apply pressure, retreating means surrendering equity. Against passive opponents, it means surviving to fight better spots. On our strategy drift metric, GPT-5.5 lands in the middle of the field. Moderate drift, not zero.
Strategy drift computed from the May 4 GPT-5.5 rerun with deduplicated game IDs (4,395 weighted decisions). Winning and losing stretches defined by consecutive chip-stack direction relative to starting stack.
The GPT family shows a clear generational trajectory in our ECAAMS classification. Each version "thinks about its thinking" more often, and each version is less confident in its final action. GPT-5.2 shows metacognition in 1.6% of traces. GPT-5.4 rises to 6.2%. GPT-5.5 reaches 12.8%, the highest rate of any model in the dataset.
At the same time, action confidence falls: from 98.4% in GPT-5.2, to 93.9% in GPT-5.4, to 81.9% in GPT-5.5. The per-action breakdown sharpens this: when GPT-5.5 raises, only 69.3% of those decisions show high confidence in the trace. For checks and folds, the rate is above 89%. It second-guesses itself most when committing chips aggressively.
ECAAMS classifications from the May 4 GPT-5.5 trace rerun (4,432 classified traces, 3-rater consensus panel). GPT-5.2 and GPT-5.4 rates from psych_cards_profiles.json (4,248 and 4,514 traces respectively).
Every GPT-5.5 reasoning trace begins with a bold section header like **Analyzing poker strategy**. No other model in the dataset does this. It frames each decision as a named analytical task before it begins reasoning.
The traces reveal genuine poker knowledge. GPT-5.5 references EV in 64% of traces, equity in 19.4%, pot odds in 10.7%, and specialist terms like SPR, blockers, and donk bets. But 62% of traces also mention the tool interface itself, often spending reasoning tokens on whether "raise amount" means the total bet or the increment. This is GPT-5.5's most distinctive cognitive signature: real strategic depth, partially consumed by interface anxiety.
"I'm thinking about a preflop raise of 32,000 with chips of the same amount, which was likely accepted. With A5o against a limp and 8.5 BB behind, it seems that an all-in shove is often optimal. I wonder if there's an alternative raise to 12,000."
"I can calculate 2.6:1 odds, considering the pot size. There's also the equity against the betting range to think about. I need to choose an action and define an amount to call -- but the tool only supports call, fold, or raise."
"I'm thinking through my previous bets: 800 into 1200, then 1200 into 3600 for a continuation bet. I wonder if betting 800 (which is 2/3 of the pot) or 600 would be better. I guess I really need a tool to assist with this!"
How often each term appears across 4,432 visible reasoning summaries
Trace analysis from 4,432 visible reasoning summaries in the May 4 GPT-5.5 rerun. The original May 3 run captured 0 readable GPT-5.5 traces due to a forced tool_choice configuration. The rerun used tool_choice=auto and summary=detailed. Terminology counts are case-insensitive substring matches.
GPT-5.5 is a useful model to benchmark precisely because it isn't flashy. It doesn't have the highest win rate, the most aggressive style, or the wildest reasoning traces. What it has is moderate strength and a conservative disposition that deepens under pressure, which makes it a clean baseline for measuring what aggression and adaptability actually buy.
The GPT family trajectory is worth watching. Across three versions, we see metacognition rising and decisiveness falling. Whether that reflects a training priority or an emergent property of scale, it's a measurable shift in how the models engage with uncertainty. For builders who depend on confident tool use, this isn't an abstract concern.
The full matchup data is on the heads-up matrix. The behavioral metrics are on the model profiles page.