Three kinds of attention

Self-attention, cross-attention, and the GRU as a deliberate bottleneck

Time and space
Cross-attention: reading the graph
Self-attention: remembering the journey
The GRU: lossy temporal reasoning
The architectural lineage
Enriching the temporal signal
The spectrum

Time and space

The hybrid decoder from part 6 has a clean separation of responsibilities. The GRU handles temporal reasoning: what has happened so far on this journey, where the route has been, which lines have been taken. The cross-attention layers handle spatial reasoning: what the graph looks like from where the decoder currently stands, which neighbours lead toward the destination, which interchanges are ahead.

This separation wasn't designed from first principles. It emerged from a sequence of failures. But it maps onto a distinction that the architecture makes legible, and understanding it clarifies both why the hybrid works and where the remaining limitations are.

At each decoding step, the model needs two kinds of information to choose the next station. It needs to know what the graph looks like — which stations are adjacent, which lines serve them, where the destination is relative to the current position. And it needs to know what it has already done — which direction it's travelling, whether it has already changed lines, how far it is into the journey.

These are different questions. The first is spatial: it's about the structure of the network, which doesn't change between steps. The second is temporal: it's about the sequence of decisions made so far, which changes at every step.

The three attention mechanisms available to a sequence decoder correspond to these questions in different ways.

Cross-attention: reading the graph

Cross-attention lets the decoder query a separate representation — in this case, the 272 station embeddings produced by the GATv2 encoder. At each step the decoder's state becomes a query, and the encoder's station embeddings are the keys and values. The attention weights identify which stations in the network are relevant right now; the value projections carry that information back into the decoder's representation via a residual connection.

This is purely spatial reasoning. The decoder is asking: "given my current state, what does the network look like from here?" It can find the destination (its embedding will have distinctive features from the encoder), identify upcoming interchanges, and assess which of the 3–4 adjacent stations leads in the right direction. The encoder output is computed once and never changes — it's a static map that the decoder reads at every step.

The pointer mechanism that the GRU originally used was a minimal version of this. A single query–key dot product over all station embeddings, no value projections, no residual connection. It scored stations but didn't read information back. Replacing the pointer with multi-head cross-attention was the change that lifted beam accuracy from 28.2% to 57.4% at dev scale — the decoder could now read the graph, not just score against it.

Stacking cross-attention layers intensifies this spatial reasoning. With two layers, the first gathers context from the graph and refines the decoder's representation; the second re-attends to the same graph with this enriched representation, potentially extracting information that wasn't accessible in a single pass. This is the pattern the literature calls multi-hop attention, first described in End-to-End Memory Networks (Sukhbaatar et al., 2015): attend, refine, attend again. Each hop reads the same memory but with a progressively better query.

The 2-layer hybrid showed a clear improvement over the 1-layer at dev scale (63.1% vs 57.4% beam), suggesting that single-pass spatial reasoning was genuinely insufficient for some routing decisions. A junction like Camden Town, where the Northern line forks into two branches, requires knowing not just the immediate neighbours but where each branch leads several hops downstream. The GATv2 encoder with 6 layers propagates information across 6 hops, so this is already encoded in the station embeddings — but extracting it may take more than one attention pass.

Self-attention: remembering the journey

Self-attention lets the sequence attend to itself. Every position in the decoded route can look at every other position. At step 25 of a 30-station journey, the model can directly attend to step 8 and see, with full fidelity, that it changed to the Central line at Bank.

This is temporal reasoning. The decoder is asking: "what have I already done on this journey?" With self-attention, the answer is precise and lossless. The model at step 25 has the same access to step 1 as it has to step 24. There is no degradation, no compression, no forgetting.

The pure Transformer decoder from part 6 had self-attention, and it achieved 96% per-token accuracy — dramatically better than the GRU's 72%. The temporal reasoning capability was real and powerful. But it interacted badly with the training procedure.

The mechanism is specific and worth being precise about. The training procedure that produces output diversity is scheduled sampling: at each step, with probability 0.5, the model sees its own (possibly wrong) prediction rather than the ground truth. In a GRU, this noise permanently corrupts the hidden state. The model can't ignore it — the corrupted state is the only state — so it must learn to recover from errors, and in doing so it learns to hedge across multiple plausible continuations.

Self-attention neutralises this training signal. If one token in the input sequence is corrupted, later positions can attend more strongly to the uncorrupted tokens and largely ignore the noise. The model never needs to learn recovery because it can always recover clean information from the majority of the sequence. The four failed fixes from part 6 — step-by-step scheduled sampling, token corruption, temperature scaling, min-loss — all demonstrated this: the Transformer routed around every form of noise we introduced.

The result, in every configuration we tested, was distribution collapse. The model concentrated all probability mass on a single route per OD pair, leaving beam search nothing to explore.

But this is a statement about the interaction between full self-attention and scheduled sampling, not a blanket claim about self-attention itself. It's possible that restricted forms of self-attention — windowed attention over only the last few positions, attention with heavy dropout, attention without residual connections — might provide some temporal access without fully neutralising the noise signal. A 5-position window would let the model see "I just passed through Baker Street and Bond Street, so I'm heading south on the Jubilee" without being able to reconstruct the entire route from step 1. Whether that limited context is enough to enable memorization, or whether it keeps the distribution soft enough for beam search, is an open question. None of these variants have been tested.

The GRU: lossy temporal reasoning

The GRU is the third option, and in the context of this architecture, its role is best understood as a deliberately imperfect substitute for self-attention.

It answers the same question — "what have I already done on this journey?" — but through a bottleneck. At step 25, the GRU's hidden state contains some information about step 8, but degraded. The signal has passed through 17 gate updates, each of which blends it with newer information and loses some fidelity. The model at step 25 knows something about where it was at step 8 — probably which general area of the network, possibly which line — but not with the precision that self-attention would provide.

This lossy compression is, counterintuitively, the GRU's main contribution to the architecture. It is what makes scheduled sampling effective.

When the training label switches from route A to route B between epochs, the GRU's hidden state can't fully memorize either. Scheduled sampling adds further noise: at each step, with probability 0.5, the model sees its own (possibly wrong) prediction rather than the teacher's token. The hidden state carries a blurred representation of multiple routes — not any single one with perfect fidelity, but a distribution over several. This blurring is what beam search exploits. The probability mass spread across alternatives is exactly the entropy that beam search needs to explore different hypotheses.

The GRU is not a better sequence model than self-attention. It is a worse one, in a way that interacts productively with the training procedure. The information bottleneck that makes it worse at per-step accuracy is the same mechanism that keeps scheduled sampling effective, which in turn preserves the output distribution that beam search converts into route predictions.

The architectural lineage

Framing the hybrid decoder this way reveals its place in a well-studied lineage.

The pointer mechanism was Bahdanau-style attention (2014) in its simplest form — the pattern that Vinyals et al. (2015) named Pointer Networks when the attention scores are used directly as output logits. Single-head, no value projections, purely a scoring mechanism.

Replacing the pointer with multi-head cross-attention moved to proper encoder-decoder attention as described in the Transformer paper (Vaswani et al., 2017) — the standard pattern where the decoder reads from the encoder at every step. But only the cross-attention component was taken, not the self-attention.

Stacking cross-attention layers followed the multi-hop attention pattern from Memory Networks (Sukhbaatar et al., 2015). Attend, refine the query, attend again. The same memory (encoder station embeddings) is read at each hop, but the query improves with each pass.

The overall pattern — a recurrent controller that queries structured memory via multi-hop attention — is also recognizable as a descendant of the Neural Turing Machine (Graves et al., 2014) and the Differentiable Neural Computer (Graves et al., 2016). Those architectures had three components: a recurrent controller (LSTM), an external memory bank, and differentiable read/write heads that addressed memory via content-based attention. The read head — content-based attention over memory slots returning a weighted sum — is cross-attention.

The parallel is worth noting because the DNC's original demonstration task was routing on the London Underground. An LSTM controller with external memory, learning to navigate the same graph. A decade later, the architecture has been refined — GATv2 embeddings as the memory, multi-hop cross-attention as the read mechanism, no write head — but the core idea is the same: a recurrent controller that queries structured memory to make sequential routing decisions.

The trajectory from NTM to DNC to Memory Networks to Transformers to this hybrid can be understood through one question: what happened to the write head?

The NTM and DNC needed a write mechanism because the LSTM controller had a bottleneck — it couldn't hold everything, so it offloaded information to external memory. Transformers eliminated the need for writes by making all previous positions directly accessible via self-attention. There is nothing to write because there is nothing to forget — the full history is always available.

The read mechanism, however, survived. Reading from a separate representation (the encoder output, a retrieval database, a knowledge graph) is a genuinely different operation from reading from your own history. Cross-attention in every encoder-decoder model since 2017 is the direct descendant of the NTM read head and the Memory Network hop.

In this hybrid, the memory is not a scratchpad. It is a fixed, structured representation of the physical graph. The model never writes to it. The GATv2 encoder produces it once, and the decoder reads from it at every step. The write head died because self-attention made it unnecessary; the read head survived as cross-attention; and the recurrent controller persists not because it's the best sequence model but because its lossy compression is what makes the diversity-preserving training signal work.

Enriching the temporal signal

The 2-layer cross-attention model is currently training on the full profile. But regardless of how that lands, the stratified breakdown from earlier experiments consistently shows the same pattern: accuracy degrades on longer routes, and the degradation is steeper than cross-attention alone can fix. The errors on 25+ station routes are not primarily spatial — the model can see the destination at every step — they're temporal. The GRU has lost track of which line it committed to fifteen steps ago.

More cross-attention layers won't help with this. They read the static graph, not the route history. A deeper or wider GRU might help marginally — more capacity in the recurrent state means slower information decay — but it doesn't change the fundamental exponential nature of information loss through recurrent updates.

Self-attention would fix the temporal deficit directly. But full self-attention, as tested extensively in part 6, neutralises the training signal that produces diversity. The restricted variants (windowed attention, dropout on attention weights, no residual connection) might thread the needle, but they're untested.

There's one idea that sidesteps the self-attention question entirely, which I want to test next.

Currently the GRU's input at each step is the token embedding of the previous station — a learned vector that says "the last station was Baker Street" but nothing about what the routing decision was or what the cross-attention found. The cross-attention output at each step carries much richer information: not just which station, but the full context of the graph read, the destination signal, the interchange assessment. That representation is projected to logits and then discarded. The GRU never sees it.

If the cross-attended representation were fed back as the GRU input instead of the token embedding:

step t:   GRU(cross_attn_output_{t-1}, hidden_{t-1}) → cross_attend(encoder) → logits
step t+1: GRU(cross_attn_output_t, hidden_t)         → cross_attend(encoder) → logits

then the recurrent state would accumulate spatial reasoning from every previous step. The hidden state at step 25 would carry a compressed version of 25 routing decisions — each informed by the full graph structure — rather than 25 token identities. It would still be lossy, still noisy under scheduled sampling, still subject to the GRU's compression dynamics. But the bandwidth of what's being compressed would be substantially higher.

Whether this actually helps is genuinely uncertain. The risk is that richer GRU inputs might reduce the diversity-preserving noise. If the cross-attended representation is more informative than a token embedding, the GRU might memorize routes more precisely, narrowing the distribution. Scheduled sampling would still corrupt the input at $p = 0.5$ , but the question is whether corrupting a rich representation produces more diverse recovery strategies or just better memorization of the clean signal.

There's also a gradient flow question. If the cross-attention output is fed back without detaching, gradients from step $t+1$ flow back through the GRU, through the cross-attention at step $t$ , and into the encoder. This creates a longer gradient path but also a richer training signal — the encoder learns to produce embeddings that are useful not just for the current step's decision but for the next step's recurrent state.

This is the next experiment to run once the 2-layer results are in.

The spectrum

Viewed together, the mechanisms available to the decoder form a spectrum of temporal precision:

Mechanism	Temporal access	Spatial access	Diversity impact
Full self-attention	Lossless	—	Collapse (observed)
GRU + token embedding	Lossy, exponential decay	—	Preserved (observed)
Cross-attention (graph)	—	Full per step	Neutral (observed)
GRU + cross-attn feedback	Lossy, richer signal	Full per step	Unknown
Windowed self-attention	Lossless (local), none (distant)	—	Unknown

The first three rows are empirical: we have results for each. The last two rows are hypotheses.

The pattern across the observed results is specific: any mechanism that gives the decoder lossless access to its full history enables memorization of single routes and makes scheduled sampling ineffective. Any mechanism that limits temporal access — through recurrent compression, through the information bottleneck of the hidden state — keeps scheduled sampling effective and preserves the distribution that beam search needs.

The engineering challenge, going forward, is to push temporal reasoning as far as possible without crossing the point where scheduled sampling stops working. The cross-attention feedback loop is the most conservative next step because it enriches the GRU's input without changing the GRU's fundamental compression dynamics. Windowed self-attention is the more aggressive option — direct access to recent history — but with unknown diversity consequences.

Whether either of these moves the numbers on 25+ station routes is the experiment to run.

Note: The code for the 2-layer cross-attention experiment is at lmmx/tubeulator-models#5.