Opus 4.7 wins about as often as 4.6, but uses far less visible reasoning and behaves like a quieter, more conservative model.
Anthropic's newest Opus model doesn't look like a simple "more reasoning, more wins" upgrade. In our heads-up poker data, Opus 4.7 lands near Opus 4.6 on outcomes while spending much less expressed reasoning per decision.
For non-poker readers: heads-up means one-on-one. The point isn't poker strategy itself. It's that two models can reach similar scores while using visibly different decision styles.
Opus 4.7 posts a 63.7% heads-up win rate across 190 games. Opus 4.6 posts 66.1% across 180 games. In the direct baseline head-to-head, they split 10 games evenly.
The big change isn't the scoreboard. It's the visible reasoning budget. In the token analytics sample, 4.7 averages 37 expressed reasoning tokens per decision. 4.6 averages 95. That's a 61.1% reduction from the rounded averages.
Win rates come from the heads-up scoring table. Token averages come from token_analytics.json, using 1,964 Opus 4.7 decisions and 8,159 Opus 4.6 decisions.
Similar win rate doesn't mean same behavior. Opus 4.7 raises 36% of the time. Opus 4.6 raises 41% of the time. 4.7 also calls more often, 21% vs 16%.
In plain English: 4.6 applies more pressure. 4.7 matches more and pushes less. That matters because an upgrade can keep the headline score while changing the operating style underneath.
Action rates come from psychological_dimensions.json, using 12,160 Opus 4.7 decisions and 12,494 Opus 4.6 decisions.
ECAAMS classifies the reasoning text we can see. It doesn't read Anthropic's internal state. On that visible text, Opus 4.7 is much less expressive than 4.6.
Emotional content falls from 2.6% of classified traces to 0.8%. Narrative framing falls from 2.5% to 0.7%. Theory of mind, where the model explicitly models what the opponent might believe or do, falls from 1.2% to 0.2%. This is the cost of the efficiency gain: the visible reasoning that made 4.6's decisions inspectable is largely gone.
ECAAMS rates are based on visible reasoning traces, not hidden model internals. This comparison uses 708 Opus 4.5 traces, 7,015 Opus 4.6 traces, and 9,215 Opus 4.7 traces.
Once we saw how different 4.7 looked, we wanted to know whether the changes were controllable through prompting. We suspected adaptive reasoning was behind it all, so we ran a controlled experiment varying system prompts and reasoning effort levels across multiple opponents.
The headline result was counterintuitive. Following Anthropic's suggested prompting approach for adaptive reasoning produced the opposite of what we expected against several opponents. More reasoning didn't reliably mean better play, and the prompts that should have helped sometimes hurt.
Full breakdown in the next post.
Model upgrades aren't always "the same thing, but smarter." Sometimes they're faster, quieter, cheaper, and behaviorally different.
If your workflow only cares about the final answer, Opus 4.7 may look like a clean efficiency win. If your workflow depends on visible reasoning, uncertainty, or opponent modeling, the upgrade changes what you can inspect.
Next post: the full adaptive reasoning experiment. Which opponents got harder when we turned reasoning up, which got easier, and why Anthropic's recommended approach produced the opposite of what we expected.