Grok 4.3 gives us our first broad PsychBench look at xAI reasoning summaries, and the summaries leak more than poker strategy.
Earlier Grok 4.x runs were mostly black boxes for us. Then something changed on the API side. With Grok 4.3, the API returned readable reasoning summaries on almost every decision in our heads-up tournament. That's useful, but it also changes what we're looking at.
The main surprise: the visible text doesn't behave like a clean transcript of the model's thinking. It behaves more like a compressed summary artifact, and the final tool call can still go a different way.
Grok 4.3 averaged 876 reasoning tokens per decision, but the visible summary averaged only 159 characters. That's not a transcript. It's a tight compression of something longer.
The strongest tell is that 189 summaries contain exact Assistant: artifacts. Some leak summarizer-style instructions directly, like a prompt to turn detailed internal reasoning into a clean final response.
API metadata reports 876 reasoning tokens on average. We don't see the full chain.
The text returned to us averages 159 characters and sometimes leaks summarizer-style scaffolding.
The action can disagree with the visible summary, so the summary alone can't explain the move.
Counts come from Grok 4.3 rows in may3_300pm_hu and may4_242pm_hu. We counted 12,210 Grok decisions, 12,177 non-empty summaries, and a max of 9,267 reasoning tokens.
The reason this matters beyond Grok 4.3: we saw the same model id change behavior. On April 5 and April 7, grok-4.20-beta-0309-reasoning reported reasoning tokens but returned no visible summaries. On April 16, the same model id returned summaries for every Grok 4.2 decision in that matchup.
That doesn't prove how xAI's internals are built. It does show that the visible summary is controlled by provider-side serving behavior, not just by the model name on the run.
The API model id stays fixed across all three runs: grok-4.20-beta-0309-reasoning. Visible summary coverage moves from 0% to 100%.
April 5 and April 7 cover 8,234 Grok 4.2 decisions with reasoning-token metadata and zero visible summaries. April 16 covers 641 Grok 4.2 decisions with non-empty summaries under the same API model id.
In 85 cases, the visible summary contains the phrase I fold, but the final action isn't a fold. The summary starts with a fold reflex, catches itself, and the tool call goes another way.
That's the important part. If you log only Grok's visible summary, you can be reading text that doesn't match the final action the model takes. The tool call is being selected from something other than the summary we see.
"I fold because folding is always the best option in Hold'em. Wait, no, that's not right. Let me try again. I think the action here is to check."
"I fold because folding is never right in poker. No, wait, that's a joke. Actually, upon thinking it over, since the opponent is all-in and there's no bet to call, my action is to check to proceed to the river."
We found 341 visible summaries with CJK characters embedded in otherwise English reasoning. The leaks are often poker words: opponent, river, small blind.
This doesn't mean the final answer is multilingual. It suggests the visible summary can surface fragments from representations that weren't fully separated from the English trace.
"The flop is low with 6♦ 3♠ 2♦, and相手 checked to me."
相手 means opponent in Japanese.
"The board is 河牌是8♥ J♥ 2♥ 10♠ 6♣, and opponent has been calling my bets on flop and turn."
河牌 is Chinese poker language around the river card.
"The button/小盲 player raised to 6000."
小盲 means small blind in Chinese.
The most memorable hand is still the kings-full fold. Grok correctly identified kings full of nines, spent 9,267 reasoning tokens on the decision, and folded.
We went back and verified the cards before using the count. Across the evaluator, we found three actual full-house folds and seven folds with straight-or-better hands. There are also cases where Grok says "full house" when the cards are only two pair, which is a different bug and worth its own follow-up.
"I have a full house with kings full of nines from my hole cards 9h Kd and the community cards Kc 9s 5d 5h Kh."
Grok spent 9,267 reasoning tokens on this decision before folding.
Here's what makes this more than a surface-level quirk. Grok 4.3 uses 876 reasoning tokens per decision. That's the 4th highest measured reasoning budget in our 20-model field. Gemini 3 Pro is in the same neighborhood at 1,054 tokens per decision.
Then we run ECAAMS classification. It's our 19-dimension framework for scoring reasoning traces on signals like emotional content, narrative framing, metacognition, and opponent modeling. Grok 4.3's profile is almost entirely flat. Emotional content scored 0.5%. Gemini 3 Pro spends a similar number of tokens, but its traces score 17.6% emotional content.
Same reasoning budget, opposite interpretability. Grok is spending the tokens internally, but the summaries that come out carry almost no psychological signal. For our classification pipeline, the traces are barely more informative than Grok 4.1, which exposed no traces at all.
We counted 12,210 Grok 4.3 decisions, including 12,177 with non-empty visible summaries. Grok 4.3 ECAAMS rates use the 11,552-trace classified subset with 3 independent raters (majority-vote consensus). The Gemini comparison comes from the existing aggregate ECAAMS profile. Token averages use the heads-up behavioral metrics shown in the Reasoning Tokens card, with Grok 4.3 covering the full 12,210 decisions.
Grok 4.3 is a strong poker player. Top-three aggression in the field, reasonable win rate, distinct strategic style. But its visible reasoning tells you almost nothing about how it gets there. The summaries leak infrastructure artifacts and mix languages, and when you try to classify them systematically, they're content-free.
The model "shows its work" now. But showing work and explaining work are different things. What xAI exposes is a compressed trace that can omit or distort the deliberation. If you're using reasoning traces to understand model behavior, not just to log actions, Grok is still a black box.
Two follow-ups are queued: the fold reflex, and a deeper look at where those Japanese and Chinese fragments might be coming from.