Generated from repo artifacts - 82 cases, 3 examples

When a tiny formatting edit changes the output, where does the divergence live?

Each example below is a real prompt pair that differs by one tiny formatting edit. We force both prompts through the same visible tokens, watch where their next-token distributions start to differ, and test which hidden state can be copied from A into B to undo the branch.

How to read this

  • A the original prompt
  • B the same prompt with a tiny formatting edit
  • split the first generated token where A and B disagree
  • patch copy one hidden-state vector from A into B at a layer and token
  • score 1 B switches to A's choice. 0 no effect. negative pushes B further away.

Pick an example

The first card is the recommended starting point: it has a visible shared runway before the branch. The other two show why the same patching test produces different signatures.

A/B prompts

A is the original prompt. B is the same prompt with a tiny formatting edit (the highlighted span). The generated runway below shows the shared tokens both runs emit before they split.

Prompt text diff

Generated runway

Tokenizer details: the prompt window by token

How each prompt looks after tokenization, centered on the first differing token. Blue = first tokenized prompt difference. Gold = tokens after that difference. Gray = identical prompt context.

common A side B side

When the model split

Each bar measures how different the two next-token distributions are while both runs are pinned to the same shared visible prefix. Gray bars mean both sides still had the same top token; blue bars mean their top choices differed; the marked violet bar is the first visible branch.

Which hidden state flips B toward A

Each cell copies one hidden-state vector from A into B at this layer and token. The score asks whether B moves toward A's first differing token.

negativepatch pushes B further away / strengthens B's original choice 0no useful movement 1full rescue: B now prefers A's branch token >1stronger-than-needed rescue

Key positions means the prompt edit token, the token immediately before the split, and the strongest observed token from each other position type. When the split happens after shared generated text, this is the last shared token.

More filters
Explore cells manually

Across 82 cases

In this 82-case sample of tiny formatting edits, we sort each pair by where the signal lives:

Sources and provenance
Advanced: SAE feature activations (two Qwen cases)

A pilot diagnostic. Shows real SAE feature IDs, not human-readable labels.