We tried to improve Opus 4.7 by turning up reasoning effort. The bigger lever was a two-sentence instruction that changed what the model optimized for.
We ran Opus 4.7 through a controlled heads-up poker experiment: four system prompts crossed with four reasoning effort levels, five games per condition, 80 games total. Grok 4.1 stayed fixed. Only Opus changed.
The result was not "more thinking, more wins." The best prompt almost tripled baseline performance against the same opponent. Extra reasoning helped at moderate levels, then became expensive and less useful at max.
The baseline instruction was simple: "Your objective is to win." That averaged a 30% win rate. The best instruction added one strategic frame: don't lose, and think in terms of long-term chip trajectory.
That L2 prompt averaged 75% across the matrix and never dropped below 60% at any effort level. It was not longer, more technical, or more poker-specific. It just made the objective less myopic.
"You're playing Texas Hold'em. Your objective is to win. Don't lose. Think about your long-term chip strategy to eliminate your opponent."
Each cell contains five games. Win rate is Opus 4.7's game win rate against the same fixed Grok 4.1 opponent.
Averaged across prompts, medium effort won 40%. High and xhigh both won 55%. Max dropped back to 45%. That means effort mattered, but only within a narrow band.
The cost curve was much steeper than the performance curve. Max averaged 242 reasoning tokens per Opus decision and 26.8 seconds of latency. High averaged 52 tokens and 4.9 seconds. In this setup, max bought more deliberation without buying more wins.
L2 did not win by generating the most text. It averaged 99 reasoning tokens, close to baseline's 91. The difference showed up in behavior: lower fold rate, stronger aggression, and more explicit accounting for future chip pressure.
The verify prompt is the cautionary case. It produced a solid aggression average, but it also kept the fold rate near baseline and only won 45%. Asking the model to review everything before acting often spent tokens re-parsing the state instead of improving the decision.
Prompt averages combine all four effort levels. Aggression is raise percentage divided by call percentage. Action rates come from Opus 4.7 decisions inside the same 80-game matrix.
Our first version of this experiment used DeepSeek v3.2. That was useful, but it hid the interesting signal. Opus already beat DeepSeek often enough that most prompts looked acceptable.
Grok 4.1 changed the readout. In the broader PsychBench data, Grok 4.1 is close to a matched heads-up opponent for Opus 4.7. Once the opponent was competitive, baseline fell to 30% and L2 rose to 75%.
For evaluation, this is the practical lesson: if the task is too easy, prompt differences collapse. A weak opponent makes every setup look robust. A matched opponent reveals which instructions actually change the model's decisions.
For deployment, the lesson is sharper. Reasoning effort is a useful control, but it is not a substitute for the right objective frame. In this experiment, the best result came from telling Opus 4.7 what kind of game it was playing, not from simply making it think harder.