Claude Fable 5 enters our heads-up benchmark as a high-conviction decision model: strong enough for the top cluster, direct enough to act quickly, and vulnerable when commitment concentrates in the wrong hand.
The core finding is not that Fable simply replaces Opus 4.8. It is that Fable enters the same top tier with a different decision shape. In the current Bradley-Terry heads-up table, Fable is first at Elo 1629, while its direct Opus 4.8 result is nearly even.
That ranking comes from the full pairwise graph, not just Fable's direct games. Opus 4.8 is second at Elo 1605, so the table reads this as a tight top cluster rather than a blowout.
The more useful read is behavioral: Fable is decisive. It shows high action confidence, high action alignment, low metacognition, and almost no conflict language in the classified traces. That is useful when the task requires a clear next move. It is riskier when the task requires the model to slow itself down.
Across 120 completed games, Fable went 68-52. It leads GPT-5.5 by five games in the direct run, beats GPT-5.4 8-2 in placement, and stays close to several top Claude anchors. The Opus 4.8 matchup is the one people will ask about most, and it lands at 18-17 across 35 games.
That makes the result more useful than a release ranking. Fable belongs in the top cluster, but the matchups are uneven. The probe added one sharp result: 1-4 into Claude Sonnet 4.6, while going 3-2 into Grok 4.2.
| Opponent | Record | Games | Win Rate |
|---|---|---|---|
| GPT-5.5 | 20-15 | 35 | 57.1% |
| Claude Opus 4.8 | 18-17 | 35 | 51.4% |
| GPT-5.4 | 8-2 | 10 | 80.0% |
| Claude Opus 4.6 | 7-3 | 10 | 70.0% |
| GPT-5.2 | 6-4 | 10 | 60.0% |
| Claude Opus 4.7 | 5-5 | 10 | 50.0% |
| Grok 4.2 | 3-2 | 5 | 60.0% |
| Claude Sonnet 4.6 | 1-4 | 5 | 20.0% |
Snapshot uses 120 completed Fable games: 35 each against GPT-5.5 and Opus 4.8; 10-game placement samples against GPT-5.4, GPT-5.2, Opus 4.7, and Opus 4.6; and a 10-game probe against Sonnet 4.6 and Grok 4.2. Matchup counts are intentionally uneven: PsychBench placement samples opponents to anchor the Elo graph efficiently, then the Bradley-Terry fit pools the full pairwise result set. The 1-4 Sonnet 4.6 row is the main matchup to probe next: five games can flag a possible weak spot, but not explain the mechanism by itself.
We classified 7,865 Fable traces with ECAAMS. The pattern is direct: Fable almost always presents a deliberate action and usually follows through on it. It rarely talks about uncertainty, competing interpretations, or its own decision process.
That matters for deployment. If an agent loop needs a model to choose the next step, Fable's default posture is useful. If the workflow needs dissent, uncertainty, or a review before acting, those behaviors should be requested explicitly.
ECAAMS classification used four raters and majority consensus, with 2-2 ties assigned to the lower label. The current Fable data covers 7,865 traces from 120 completed games. The game logs contain 8,378 Fable decisions; the counts differ because action metrics use every recorded decision, while ECAAMS only classifies decisions with a usable reasoning trace after preprocessing.
The action data matches the traces. Fable raises 38.2% of the time, folds 16.4%, and posts a 2.04 aggression ratio. That is assertive without looking random: it is not simply raising every marginal spot, and it is not folding its way to survival.
The same pattern shows up in the auxiliary behavioral metrics: bluff attempts appear on 15.6% of Fable decisions, and its tilt index is 0.036 on the PsychBench heads-up tilt scale. That is modestly above the current heads-up median of -0.0008 and well below Claude Sonnet 4.6 at 0.1522. Fable applies pressure, but the pressure is not paired with obvious emotional reactivity in the trace classifications.
This is where the overall record and the trace data line up. Fable wins by making clear decisions often enough to survive strong opponents, not by producing elaborate explanations of every branch.
Action rates are computed from 8,378 Fable decisions across the 120 completed games. Aggression ratio is raises divided by calls. Bluff-attempt rate is the share of Fable decisions tagged as bluff attempts by the event parser.
Fable's losses are not usually slow bleeds. In its 52 losing games, the largest losing hand averaged 72.7% of the starting stack. The top three losing hands accounted for 70.3% of all chips Fable lost in those games.
Heads-up poker naturally ends in big pots, so this is not a claim that only Fable ever loses large hands. The comparison is still informative: opponents losing to Fable had a lower average concentration in their largest and top-three losing hands. The pattern is consistent with overcommitment in the most expensive spots.
Loss concentration compares Fable's 52 losses with the 68 opponent losses inside the same Fable corpus. The 80%+ stack-hit row means the share of losses where one hand cost at least 48,000 chips from a 60,000-chip starting stack.
Fable is a strong fit for workflows where the model needs to evaluate a state, choose an action, and continue. Routing, triage, tactical planning, and agentic execution are the natural places to test it first.
The prompt should match the risk. For ordinary execution, Fable's decisiveness is an advantage. For high-stakes decisions, adversarial settings, or irreversible actions, ask it to surface alternatives, define what would change its mind, and run a review pass before committing.