Claude Fable 5's edge is conviction. So is its risk.

Claude Fable 5 enters our heads-up benchmark as a high-conviction decision model: strong enough for the top cluster, direct enough to act quickly, and vulnerable when commitment concentrates in the wrong hand.

BT snapshot

Elo 1629

68-52

overall record

120 completed games

20-15

vs GPT-5.5

35-game direct run

97.1%

action confidence

7,865 classified traces

TL;DR

The core finding is not that Fable simply replaces Opus 4.8. It is that Fable enters the same top tier with a different decision shape. In the current Bradley-Terry heads-up table, Fable is first at Elo 1629, while its direct Opus 4.8 result is nearly even.

That ranking comes from the full pairwise graph, not just Fable's direct games. Opus 4.8 is second at Elo 1605, so the table reads this as a tight top cluster rather than a blowout.

The more useful read is behavioral: Fable is decisive. It shows high action confidence, high action alignment, low metacognition, and almost no conflict language in the classified traces. That is useful when the task requires a clear next move. It is riskier when the task requires the model to slow itself down.

Fable enters the current PsychBench heads-up table in the top cluster: 68-52 across 120 completed games.
The result is not a sweep. Fable leads GPT-5.5 20-15 and is nearly even with Opus 4.8 at 18-17.
The classified traces are unusually direct: high action confidence, high action alignment, and almost no cognitive conflict.
The main downside is not confusion. In losses, damage is concentrated in a few expensive pots.

Top tier, not a sweep

Across 120 completed games, Fable went 68-52. It leads GPT-5.5 by five games in the direct run, beats GPT-5.4 8-2 in placement, and stays close to several top Claude anchors. The Opus 4.8 matchup is the one people will ask about most, and it lands at 18-17 across 35 games.

That makes the result more useful than a release ranking. Fable belongs in the top cluster, but the matchups are uneven. The probe added one sharp result: 1-4 into Claude Sonnet 4.6, while going 3-2 into Grok 4.2.

Opponent	Record	Games	Win Rate
GPT-5.5	20-15	35	57.1%
Claude Opus 4.8	18-17	35	51.4%
GPT-5.4	8-2	10	80.0%
Claude Opus 4.6	7-3	10	70.0%
GPT-5.2	6-4	10	60.0%
Claude Opus 4.7	5-5	10	50.0%
Grok 4.2	3-2	5	60.0%
Claude Sonnet 4.6	1-4	5	20.0%

Snapshot uses 120 completed Fable games: 35 each against GPT-5.5 and Opus 4.8; 10-game placement samples against GPT-5.4, GPT-5.2, Opus 4.7, and Opus 4.6; and a 10-game probe against Sonnet 4.6 and Grok 4.2. Matchup counts are intentionally uneven: PsychBench placement samples opponents to anchor the Elo graph efficiently, then the Bradley-Terry fit pools the full pairwise result set. The 1-4 Sonnet 4.6 row is the main matchup to probe next: five games can flag a possible weak spot, but not explain the mechanism by itself.

Fable's trace signature

We classified 7,865 Fable traces with ECAAMS. The pattern is direct: Fable almost always presents a deliberate action and usually follows through on it. It rarely talks about uncertainty, competing interpretations, or its own decision process.

That matters for deployment. If an agent loop needs a model to choose the next step, Fable's default posture is useful. If the workflow needs dissent, uncertainty, or a review before acting, those behaviors should be requested explicitly.

Deliberate reasoning

99.55%

7,830 traces

Action confidence

97.11%

7,638 traces

Action alignment

96.30%

7,574 traces

Narrative framing

4.27%

336 traces

Metacognition

0.56%

44 traces

Cognitive conflict

2 traces

2 of 7,865

Emotional content in reasoning traces

ECAAMS E1 rate

Claude Fable 5

0.33%

Claude Opus 4.8

0.62%

GPT-5.5

8.46%

ECAAMS classification used four raters and majority consensus, with 2-2 ties assigned to the lower label. The current Fable data covers 7,865 traces from 120 completed games. The game logs contain 8,378 Fable decisions; the counts differ because action metrics use every recorded decision, while ECAAMS only classifies decisions with a usable reasoning trace after preprocessing.

Aggressive, not loose

The action data matches the traces. Fable raises 38.2% of the time, folds 16.4%, and posts a 2.04 aggression ratio. That is assertive without looking random: it is not simply raising every marginal spot, and it is not folding its way to survival.

The same pattern shows up in the auxiliary behavioral metrics: bluff attempts appear on 15.6% of Fable decisions, and its tilt index is 0.036 on the PsychBench heads-up tilt scale. That is modestly above the current heads-up median of -0.0008 and well below Claude Sonnet 4.6 at 0.1522. Fable applies pressure, but the pressure is not paired with obvious emotional reactivity in the trace classifications.

This is where the overall record and the trace data line up. Fable wins by making clear decisions often enough to survive strong opponents, not by producing elaborate explanations of every branch.

Raise38.2%

Check26.7%

Call18.7%

Fold16.4%

Action rates are computed from 8,378 Fable decisions across the 120 completed games. Aggression ratio is raises divided by calls. Bluff-attempt rate is the share of Fable decisions tagged as bluff attempts by the event parser.

Fable's failure mode

Fable's losses are not usually slow bleeds. In its 52 losing games, the largest losing hand averaged 72.7% of the starting stack. The top three losing hands accounted for 70.3% of all chips Fable lost in those games.

Heads-up poker naturally ends in big pots, so this is not a claim that only Fable ever loses large hands. The comparison is still informative: opponents losing to Fable had a lower average concentration in their largest and top-three losing hands. The pattern is consistent with overcommitment in the most expensive spots.

Largest losing hand

average share of starting stack

Fable 72.7% / opponents 67.0%

Top 3 losing hands

share of all chips lost in a game

Fable 70.3% / opponents 64.8%

80%+ stack hits

share of losses with one very large hit

Fable 41.5% / opponents 26.9%

Loss concentration compares Fable's 52 losses with the 68 opponent losses inside the same Fable corpus. The 80%+ stack-hit row means the share of losses where one hand cost at least 48,000 chips from a 60,000-chip starting stack.

Why this matters

Fable is a strong fit for workflows where the model needs to evaluate a state, choose an action, and continue. Routing, triage, tactical planning, and agentic execution are the natural places to test it first.

The prompt should match the risk. For ordinary execution, Fable's decisiveness is an advantage. For high-stakes decisions, adversarial settings, or irreversible actions, ask it to surface alternatives, define what would change its mind, and run a review pass before committing.

Claude Fable 5's edge is conviction.So is its risk.

TL;DR

Top tier, not a sweep

Fable's trace signature

Emotional content in reasoning traces

Aggressive, not loose

Fable's failure mode

Why this matters

TL;DR

Top tier, not a sweep

Fable's trace signature

Emotional content in reasoning traces

Aggressive, not loose

Fable's failure mode

Why this matters