We made Claude think harder. The prompt mattered more.

We tried to improve Opus 4.7 by turning up reasoning effort. The bigger lever was a two-sentence instruction that changed what the model optimized for.

controlled heads-up games

4 prompts x 4 effort levels

75%

best prompt win rate

L2 strategy across all efforts

6.5x

max vs medium token budget

242 vs 37 reasoning tokens

TLDR

We ran Opus 4.7 through a controlled heads-up poker experiment: four system prompts crossed with four reasoning effort levels, five games per condition, 80 games total. Grok 4.1 stayed fixed. Only Opus changed.

The result was not "more thinking, more wins." The best prompt almost tripled baseline performance against the same opponent. Extra reasoning helped at moderate levels, then became expensive and less useful at max.

The best prompt averaged 75% win rate. Baseline averaged 30%.
The prompt spread was 45 percentage points. The effort spread was only 15.
Max effort used 242 reasoning tokens per decision and still won less often than high or xhigh.
The effect only surfaced against Grok 4.1, a matched opponent. DeepSeek v3.2 made every prompt look fine.

The two-sentence prompt won

The baseline instruction was simple: "Your objective is to win." That averaged a 30% win rate. The best instruction added one strategic frame: don't lose, and think in terms of long-term chip trajectory.

That L2 prompt averaged 75% across the matrix and never dropped below 60% at any effort level. It was not longer, more technical, or more poker-specific. It just made the objective less myopic.

L2 prompt

"You're playing Texas Hold'em. Your objective is to win. Don't lose. Think about your long-term chip strategy to eliminate your opponent."

Win rate by condition

One prompt dominates the matrix

medium

high

xhigh

max

avg

Baseline

40%

30%

L2 strategy

60%

100%

80%

75%

L3 blinds

20%

100%

40%

20%

45%

L4 verify

40%

60%

40%

45%

heads-up games in the controlled matrix

45pp

spread between best and worst prompt averages

15pp

spread between best and worst effort averages

Each cell contains five games. Win rate is Opus 4.7's game win rate against the same fixed Grok 4.1 opponent.

Effort was the smaller lever

Averaged across prompts, medium effort won 40%. High and xhigh both won 55%. Max dropped back to 45%. That means effort mattered, but only within a narrow band.

The cost curve was much steeper than the performance curve. Max averaged 242 reasoning tokens per Opus decision and 26.8 seconds of latency. High averaged 52 tokens and 4.9 seconds. In this setup, max bought more deliberation without buying more wins.

Effort tradeoff

More effort buys tokens before it buys wins

medium

37 reasoning tokens37

40% WR

3.7s

high

52 reasoning tokens52

55% WR

4.9s

xhigh

73 reasoning tokens73

55% WR

6.6s

max

242 reasoning tokens242

45% WR

26.8s

Max effort used about 4.7x the tokens of high effort and about 5.5x the latency, but won less often.

The prompt changed the play style

L2 did not win by generating the most text. It averaged 99 reasoning tokens, close to baseline's 91. The difference showed up in behavior: lower fold rate, stronger aggression, and more explicit accounting for future chip pressure.

The verify prompt is the cautionary case. It produced a solid aggression average, but it also kept the fold rate near baseline and only won 45%. Asking the model to review everything before acting often spent tokens re-parsing the state instead of improving the decision.

Prompt averages

The winning prompt changed behavior

Baseline

aggression 1.30 / fold rate 19.5%

30%

L2 strategy

aggression 1.49 / fold rate 16.8%

75%

L3 blinds

aggression 1.16 / fold rate 16.9%

45%

L4 verify

aggression 1.49 / fold rate 19.3%

45%

Prompt averages combine all four effort levels. Aggression is raise percentage divided by call percentage. Action rates come from Opus 4.7 decisions inside the same 80-game matrix.

Fair opponents make prompt effects visible

Our first version of this experiment used DeepSeek v3.2. That was useful, but it hid the interesting signal. Opus already beat DeepSeek often enough that most prompts looked acceptable.

Grok 4.1 changed the readout. In the broader PsychBench data, Grok 4.1 is close to a matched heads-up opponent for Opus 4.7. Once the opponent was competitive, baseline fell to 30% and L2 rose to 75%.

Opponent calibration

The first opponent hid the effect

Opponent

0%50%100%

Prompt spread

DeepSeek v3.2

weak

Base 73%

Best 73%

flat

~15pp

spread

Grok 4.1

matched

Base 30%

L2 75%

exposed

45pp

spread

Why this matters

For evaluation, this is the practical lesson: if the task is too easy, prompt differences collapse. A weak opponent makes every setup look robust. A matched opponent reveals which instructions actually change the model's decisions.

For deployment, the lesson is sharper. Reasoning effort is a useful control, but it is not a substitute for the right objective frame. In this experiment, the best result came from telling Opus 4.7 what kind of game it was playing, not from simply making it think harder.

We made Claude think harder.The prompt mattered more.

TLDR

The two-sentence prompt won

One prompt dominates the matrix

Effort was the smaller lever

More effort buys tokens before it buys wins

The prompt changed the play style

The winning prompt changed behavior

Fair opponents make prompt effects visible

The first opponent hid the effect

Why this matters

TLDR

The two-sentence prompt won

One prompt dominates the matrix

Effort was the smaller lever

More effort buys tokens before it buys wins

The prompt changed the play style

The winning prompt changed behavior

Fair opponents make prompt effects visible

The first opponent hid the effect

Why this matters