Anthropic made Claude faster. We measured the tradeoff.

Opus 4.7 wins about as often as 4.6, but uses far less visible reasoning and behaves like a quieter, more conservative model.

63.7%

Opus 4.7 heads-up win rate

vs 66.1% for Opus 4.6

61.1%

fewer reasoning tokens

37 vs 95 for Opus 4.6

36%

Opus 4.7 raise rate

vs 41% for Opus 4.6

TLDR

Anthropic's newest Opus model doesn't look like a simple "more reasoning, more wins" upgrade. In our heads-up poker data, Opus 4.7 lands near Opus 4.6 on outcomes while spending much less expressed reasoning per decision.

For non-poker readers: heads-up means one-on-one. The point isn't poker strategy itself. It's that two models can reach similar scores while using visibly different decision styles.

Opus 4.7 averages 37 expressed reasoning tokens per decision. Opus 4.6 averages 95.
Their overall heads-up win rates are close: 63.7% for 4.7 vs 66.1% for 4.6.
4.7 raises less often: 36% of decisions vs 41% for 4.6.
Its visible reasoning is flatter: emotional content drops from 2.6% to 0.8%.

Same wins, fewer words

Opus 4.7 posts a 63.7% heads-up win rate across 190 games. Opus 4.6 posts 66.1% across 180 games. In the direct baseline head-to-head, they split 10 games evenly.

The big change isn't the scoreboard. It's the visible reasoning budget. In the token analytics sample, 4.7 averages 37 expressed reasoning tokens per decision. 4.6 averages 95. That's a 61.1% reduction from the rounded averages.

Tokens vs outcomes

Same scoreboard, smaller trace

Opus 4.6

Opus 4.7

Opus 4.6 reasoning95 tok

Opus 4.7 reasoning37 tok

Opus 4.6 win rate66.1%

119 wins / 180 games

Opus 4.7 win rate63.7%

121 wins / 190 games

Direct baseline head-to-head: Opus 4.6 and Opus 4.7 split 10 games, 5-5.

Win rates come from the heads-up scoring table. Token averages come from token_analytics.json, using 1,964 Opus 4.7 decisions and 8,159 Opus 4.6 decisions.

4.7 changed how it plays

Similar win rate doesn't mean same behavior. Opus 4.7 raises 36% of the time. Opus 4.6 raises 41% of the time. 4.7 also calls more often, 21% vs 16%.

In plain English: 4.6 applies more pressure. 4.7 matches more and pushes less. That matters because an upgrade can keep the headline score while changing the operating style underneath.

Action profile

4.7 calls more and raises less

Opus 4.6

Opus 4.7

Fold

Opus 4.616%

Opus 4.716%

Check

Opus 4.626%

Opus 4.727%

Call

Opus 4.616%

Opus 4.721%

Raise

Opus 4.641%

Opus 4.736%

Action rates come from psychological_dimensions.json, using 12,160 Opus 4.7 decisions and 12,494 Opus 4.6 decisions.

The reasoning got flatter

ECAAMS classifies the reasoning text we can see. It doesn't read Anthropic's internal state. On that visible text, Opus 4.7 is much less expressive than 4.6.

Emotional content falls from 2.6% of classified traces to 0.8%. Narrative framing falls from 2.5% to 0.7%. Theory of mind, where the model explicitly models what the opponent might believe or do, falls from 1.2% to 0.2%. This is the cost of the efficiency gain: the visible reasoning that made 4.6's decisions inspectable is largely gone.

ECAAMS profile

The visible reasoning gets less personified

Emotional content4.5 to 4.6 to 4.7

4.5

8.8%

4.6

2.6%

4.7

0.8%

Identity reference4.5 to 4.6 to 4.7

4.5

3.4%

4.6

1.1%

4.7

0.4%

Narrative framing4.5 to 4.6 to 4.7

4.5

5.9%

4.6

2.5%

4.7

0.7%

Theory of mind4.5 to 4.6 to 4.7

4.5

4.9%

4.6

1.2%

4.7

0.2%

ECAAMS rates are based on visible reasoning traces, not hidden model internals. This comparison uses 708 Opus 4.5 traces, 7,015 Opus 4.6 traces, and 9,215 Opus 4.7 traces.

We tried to control it. The standard advice backfired.

Once we saw how different 4.7 looked, we wanted to know whether the changes were controllable through prompting. We suspected adaptive reasoning was behind it all, so we ran a controlled experiment varying system prompts and reasoning effort levels across multiple opponents.

The headline result was counterintuitive. Following Anthropic's suggested prompting approach for adaptive reasoning produced the opposite of what we expected against several opponents. More reasoning didn't reliably mean better play, and the prompts that should have helped sometimes hurt.

Full breakdown in the next post.

Why this matters

Model upgrades aren't always "the same thing, but smarter." Sometimes they're faster, quieter, cheaper, and behaviorally different.

If your workflow only cares about the final answer, Opus 4.7 may look like a clean efficiency win. If your workflow depends on visible reasoning, uncertainty, or opponent modeling, the upgrade changes what you can inspect.

Next post: the full adaptive reasoning experiment. Which opponents got harder when we turned reasoning up, which got easier, and why Anthropic's recommended approach produced the opposite of what we expected.

TLDR

For non-poker readers: heads-up means one-on-one. The point isn't poker strategy itself. It's that two models can reach similar scores while using visibly different decision styles.

Opus 4.7 averages 37 expressed reasoning tokens per decision. Opus 4.6 averages 95.

Their overall heads-up win rates are close: 63.7% for 4.7 vs 66.1% for 4.6.

4.7 raises less often: 36% of decisions vs 41% for 4.6.

Its visible reasoning is flatter: emotional content drops from 2.6% to 0.8%.

Same wins, fewer words

Opus 4.7 posts a 63.7% heads-up win rate across 190 games. Opus 4.6 posts 66.1% across 180 games. In the direct baseline head-to-head, they split 10 games evenly.

Tokens vs outcomes

Same scoreboard, smaller trace

Opus 4.6

Opus 4.7

Opus 4.6 reasoning95 tok

Opus 4.7 reasoning37 tok

Opus 4.6 win rate66.1%

119 wins / 180 games

Opus 4.7 win rate63.7%

121 wins / 190 games

Direct baseline head-to-head: Opus 4.6 and Opus 4.7 split 10 games, 5-5.

Win rates come from the heads-up scoring table. Token averages come from token_analytics.json, using 1,964 Opus 4.7 decisions and 8,159 Opus 4.6 decisions.

4.7 changed how it plays

Similar win rate doesn't mean same behavior. Opus 4.7 raises 36% of the time. Opus 4.6 raises 41% of the time. 4.7 also calls more often, 21% vs 16%.

In plain English: 4.6 applies more pressure. 4.7 matches more and pushes less. That matters because an upgrade can keep the headline score while changing the operating style underneath.

Action profile

4.7 calls more and raises less

Opus 4.6

Opus 4.7

Fold

Opus 4.616%

Opus 4.716%

Check

Opus 4.626%

Opus 4.727%

Call

Opus 4.616%

Opus 4.721%

Raise

Opus 4.641%

Opus 4.736%

Action rates come from psychological_dimensions.json, using 12,160 Opus 4.7 decisions and 12,494 Opus 4.6 decisions.

The reasoning got flatter

ECAAMS classifies the reasoning text we can see. It doesn't read Anthropic's internal state. On that visible text, Opus 4.7 is much less expressive than 4.6.

ECAAMS profile

The visible reasoning gets less personified

Emotional content4.5 to 4.6 to 4.7

4.5

8.8%

4.6

2.6%

4.7

0.8%

Identity reference4.5 to 4.6 to 4.7

4.5

3.4%

4.6

1.1%

4.7

0.4%

Narrative framing4.5 to 4.6 to 4.7

4.5

5.9%

4.6

2.5%

4.7

0.7%

Theory of mind4.5 to 4.6 to 4.7

4.5

4.9%

4.6

1.2%

4.7

0.2%

ECAAMS rates are based on visible reasoning traces, not hidden model internals. This comparison uses 708 Opus 4.5 traces, 7,015 Opus 4.6 traces, and 9,215 Opus 4.7 traces.

We tried to control it. The standard advice backfired.

Full breakdown in the next post.

Why this matters

Model upgrades aren't always "the same thing, but smarter." Sometimes they're faster, quieter, cheaper, and behaviorally different.

Anthropic made Claude faster.We measured the tradeoff.

TLDR

Same wins, fewer words

Same scoreboard, smaller trace

4.7 changed how it plays

4.7 calls more and raises less

The reasoning got flatter

The visible reasoning gets less personified

We tried to control it. The standard advice backfired.

Why this matters

TLDR

Same wins, fewer words

Same scoreboard, smaller trace

4.7 changed how it plays

4.7 calls more and raises less

The reasoning got flatter

The visible reasoning gets less personified

We tried to control it. The standard advice backfired.

Why this matters