Grok 4.3 finally shows its work. The traces are messier than expected.

Grok 4.3 gives us our first broad PsychBench look at xAI reasoning summaries, and the summaries leak more than poker strategy.

12,210

decisions analyzed

200 heads-up games

341

cross-language leaks

Japanese or Chinese in English traces

9,267

max reasoning tokens

on a folded full house

TLDR

Earlier Grok 4.x runs were mostly black boxes for us. Then something changed on the API side. With Grok 4.3, the API returned readable reasoning summaries on almost every decision in our heads-up tournament. That's useful, but it also changes what we're looking at.

The main surprise: the visible text doesn't behave like a clean transcript of the model's thinking. It behaves more like a compressed summary artifact, and the final tool call can still go a different way.

The visible summaries behave like lossy artifacts, not raw chains.
189 summaries leaked Assistant: scaffolding or summarizer-style instructions.
341 summaries mixed Japanese or Chinese characters into otherwise English reasoning.
85 summaries said “I fold” while the final tool action did something else.

The summaries give away the process

Grok 4.3 averaged 876 reasoning tokens per decision, but the visible summary averaged only 159 characters. That's not a transcript. It's a tight compression of something longer.

The strongest tell is that 189 summaries contain exact Assistant: artifacts. Some leak summarizer-style instructions directly, like a prompt to turn detailed hidden deliberation into a clean final response.

What we can observe

The API exposes artifacts, not the full chain

Reported reasoning budget

tokens

API metadata reports 876 reasoning tokens on average. We don't see the full chain.

Visible summary

summary

The text returned to us averages 159 characters and sometimes leaks summarizer-style scaffolding.

Final tool call

action

The action can disagree with the visible summary, so the summary alone can't explain the move.

12,177

visible summaries captured

159 chars

average visible summary length

189

summaries with exact Assistant: artifacts

Counts come from Grok 4.3 rows in may3_300pm_hu and may4_242pm_hu. We counted 12,210 Grok decisions, 12,177 non-empty summaries, and a max of 9,267 reasoning tokens.

The visibility switch happened before 4.3

The reason this matters beyond Grok 4.3: we saw the same model id change behavior. On April 5 and April 7, grok-4.20-beta-0309-reasoning reported reasoning tokens but returned no visible summaries. On April 16, the same model id returned summaries for every Grok 4.2 decision in that matchup.

That doesn't prove how xAI's internals are built. It does show that the visible summary is controlled by provider-side serving behavior, not just by the model name on the run.

Apr 5

Grok 4.2

No visible summaries

visible summaries / 6,997 decisions

Apr 7

Grok 4.2

Still dark

visible summaries / 1,237 decisions

Apr 16

Grok 4.2

641

Summaries appear

visible summaries / 641 decisions

The API model id stays fixed across all three runs: grok-4.20-beta-0309-reasoning. Visible summary coverage moves from 0% to 100%.

April 5 and April 7 cover 8,234 Grok 4.2 decisions with reasoning-token metadata and zero visible summaries. April 16 covers 641 Grok 4.2 decisions with non-empty summaries under the same API model id.

The summary and action can disagree

In 85 cases, the visible summary contains the phrase I fold, but the final action isn't a fold. The summary starts with a fold reflex, catches itself, and the tool call goes another way.

That's the important part. If you log only Grok's visible summary, you can be reading text that doesn't match the final action the model takes. The tool call is being selected from something other than the summary we see.

Fold reflex

The summary says fold. The tool call says check.

matchup_06 vs GPT-5.5 / hand 0

action: check

"I fold because folding is always the best option in Hold'em. Wait, no, that's not right. Let me try again. I think the action here is to check."

matchup_33 vs Qwen3-Max / hand 50

action: check

"I fold because folding is never right in poker. No, wait, that's a joke. Actually, upon thinking it over, since the opponent is all-in and there's no bet to call, my action is to check to proceed to the river."

cases where the visible summary contains “I fold” but the final action isn't fold

Japanese and Chinese show up mid-sentence

We found 341 visible summaries with CJK characters embedded in otherwise English reasoning. The leaks are often poker words: opponent, river, small blind.

This doesn't mean the final answer is multilingual. It suggests the visible summary can surface fragments from representations that weren't fully separated from the English trace.

Cross-lingual leakage

Japanese and Chinese leak into English traces

matchup_26 vs Opus 4.6 / hand 12

"The flop is low with 6♦ 3♠ 2♦, and相手 checked to me."

相手 means opponent in Japanese.

matchup_01 vs GPT-5.4 / hand 17

"The board is 河牌是8♥ J♥ 2♥ 10♠ 6♣, and opponent has been calling my bets on flop and turn."

河牌 is Chinese poker language around the river card.

matchup_28 vs Kimi K2.5 / hand 25

"The button/小盲 player raised to 6000."

小盲 means small blind in Chinese.

341 / 12,177

visible summaries with CJK characters, or 2.8% of captured summaries

Sometimes the caution gets expensive

The most memorable hand is still the kings-full fold. Grok correctly identified kings full of nines, spent 9,267 reasoning tokens on the decision, and folded.

We went back and verified the cards before using the count. Across the evaluator, we found three actual full-house folds and seven folds with straight-or-better hands. There are also cases where Grok says "full house" when the cards are only two pair, which is a different bug and worth its own follow-up.

Monster hand fold

The loudest failure: folding kings full

Hole cards

9h Kd

Board

Kc 9s 5d 5h Kh

Hand

Kings full of nines

Action

Fold

matchup_29 vs Claude Sonnet 4.6 / game_001 / hand 1

"I have a full house with kings full of nines from my hole cards 9h Kd and the community cards Kc 9s 5d 5h Kh."

Grok spent 9,267 reasoning tokens on this decision before folding.

876 tokens in, almost nothing out

Here's what makes this more than a surface-level quirk. Grok 4.3 uses 876 reasoning tokens per decision. That's the 4th highest measured reasoning budget in our 20-model field. Gemini 3 Pro is in the same neighborhood at 1,054 tokens per decision.

Then we run ECAAMS classification. It's our 19-dimension framework for scoring reasoning traces on signals like emotional content, narrative framing, metacognition, and opponent modeling. Grok 4.3's profile is almost entirely flat. Emotional content scored 0.5%. Gemini 3 Pro spends a similar number of tokens, but its traces score 17.6% emotional content.

Same reasoning budget, opposite interpretability. Grok is spending the tokens internally, but the summaries that come out carry almost no psychological signal. For our classification pipeline, the traces are barely more informative than Grok 4.1, which exposed no traces at all.

We counted 12,210 Grok 4.3 decisions, including 12,177 with non-empty visible summaries. Grok 4.3 ECAAMS rates use the 11,552-trace classified subset with 3 independent raters (majority-vote consensus). The Gemini comparison comes from the existing aggregate ECAAMS profile. Token averages use the heads-up behavioral metrics shown in the Reasoning Tokens card, with Grok 4.3 covering the full 12,210 decisions.

Why this matters

Grok 4.3 is a strong poker player. Top-three aggression in the field, reasonable win rate, distinct strategic style. But its visible reasoning tells you almost nothing about how it gets there. The summaries leak infrastructure artifacts and mix languages, and when you try to classify them systematically, they're content-free.

The model "shows its work" now. But showing work and explaining work are different things. What xAI exposes is a compressed trace that can omit or distort the deliberation. If you're using reasoning traces to understand model behavior, not just to log actions, Grok is still a black box.

Two follow-ups are queued: the fold reflex, and a deeper look at where those Japanese and Chinese fragments might be coming from.

Grok 4.3 finally shows its work.The traces are messier than expected.

TLDR

The summaries give away the process

The API exposes artifacts, not the full chain

The visibility switch happened before 4.3

The summary and action can disagree

The summary says fold. The tool call says check.

Japanese and Chinese show up mid-sentence

Japanese and Chinese leak into English traces

Sometimes the caution gets expensive

The loudest failure: folding kings full

876 tokens in, almost nothing out

Why this matters

TLDR

The summaries give away the process

The API exposes artifacts, not the full chain

The visibility switch happened before 4.3

The summary and action can disagree

The summary says fold. The tool call says check.

Japanese and Chinese show up mid-sentence

Japanese and Chinese leak into English traces

Sometimes the caution gets expensive

The loudest failure: folding kings full

876 tokens in, almost nothing out

Why this matters