Constraining the search

Adjacency masking, route equivalence, and telling the model about the graph

The station model's problem

The previous post left the station model in an awkward position. At 200 epochs on the full profile it reached 44% greedy exact match and 59% under beam search — respectable, but with a learning curve that had visibly plateaued. The per-token station accuracy was stuck at 90%, and the gap between greedy and beam was narrow enough to suggest the model wasn't learning meaningful alternatives.

The root cause here was architectural. At every decoding step, the pointer mechanism scored all 272 stations in the network. The model had to learn from raw data that after Euston, the only valid next stations are Warren Street, Mornington Crescent, Camden Town, and King's Cross St. Pancras — not the other 268. It spent most of its capacity rediscovering the adjacency structure of the graph, which is information we already have.

Adjacency masking

The fix is to give the decoder what the encoder already knows. The topology extraction already produces line_adj, a mapping from each station to its neighbors on each line. Flattening this across all lines produces a boolean matrix A{0,1}N×NA \in \{0, 1\}^{N \times N} where Aij=1A_{ij} = 1 if station jj is adjacent to station ii on any line. For the London Underground this matrix has 966 nonzero entries (including self-loops) across 272 stations — an average of 3.6 neighbors per station.

At each decoding step, the pointer computes raw logits RN\ell \in \mathbb{R}^N over all stations. Before the softmax, we mask:

$$\ell_j \leftarrow \begin{cases} \ell_j & \text{if } A_{\text{current}, j} = 1 \ -10^4 & \text{otherwise} \end{cases}$$

The softmax then distributes probability over 3–4 candidates instead of 272. The decision at each step changes from which of 272 stations comes next? to the categorically easier which of my neighbors do I walk to?

The self-loops (Aii=1A_{ii} = 1) exist because the station sequence label starts with the origin itself. Without them, the model's first prediction would mask out the correct token.

Three attempts to make it work

Getting the mask into training was harder than expected, and the sequence of failures is worth recounting because each one revealed a different interaction between the mask and the training pipeline.

Attempt 1: mask everything, always

The obvious implementation was to apply the mask during both training and inference.

This immediately produced NaN losses. The culprit was scheduled sampling: with p=0.5p = 0.5, the model sometimes picks a wrong station, the current tracker jumps there, and the next teacher-forced label isn't adjacent to the wrong station. The correct token gets -\infty, cross-entropy explodes.

Attempt 2: mask at inference only

Restricting the mask to model.eval() mode avoided the training NaN but threw away most of the benefit. The model still had to learn adjacency from data during training; the mask only helped at decode time. The dev run showed 10.5% exact — marginally better than without the mask.

Attempt 3: mask with teacher-tracked position

The key insight was to decouple the masking position from the feedback token. During training, the mask always follows the teacher's position (the ground truth label), while the GRU feedback may use the model's own prediction (via scheduled sampling). This means the mask never goes off-track: the teacher sequence is always a valid walk on the graph, so the correct token at each step is always among the unmasked candidates.

At inference, when there are no labels, current simply follows the model's own predictions; which are now constrained to be adjacent at every step.

Incompatible with label smoothing

The third attempt still produced inflated losses, caused by label smoothing.

With smoothing parameter α=0.1\alpha = 0.1, the target distribution assigns α/(N1)\alpha / (N-1) probability to every non-target class. For the ~268 masked-out stations with logits of 104-10^4, the model can never match that target probability, producing a per-token penalty of roughly 268×αN1×104268 \times \frac{\alpha}{N-1} \times 10^4 — hundreds of nats of irrecoverable loss per step.

The fix was to set label smoothing to zero for the station model. The adjacency mask is a much stronger regulariser than label smoothing: it constrains the output space to 3–4 options per step, making overconfidence on the correct answer not just acceptable but desirable.

A note on numerical precision

The fill value for masked logits likewise went through a few stages of development. The natural choice of -\infty crashed torch.compile under AMP, because float('-inf') cannot be represented as float16. Switching to 109-10^9 hit the same overflow (float16max65504\text{float16}_{\max} \approx 65504). The working value I settled on was 104-10^4: comfortably within float16 range, large enough that exp(104)0\exp(-10^4) \approx 0 in any precision, and gradient-safe during backpropagation.

Route equivalence

A separate change in the same PR addressed the measurement side.

The London Underground has extensive multi-line shared tracks: District, Circle, and Hammersmith & City all run between Aldgate East and Paddington. If a route takes the District line from Whitechapel to Paddington, there exist equivalent routes taking the Circle or H&C over exactly the same stations. Previously these were treated as different routes with different labels, meaning the model could predict the physically correct journey and score zero because the evaluation sampled a different line assignment.

(I'm not sure how you'd want to treat separate services altogether like DLR, which runs to the same station as the tube at certain stations like West Ham, but I'll cross that bridge when we get to it...)

The route enumerator now expands each route into its line-equivalent variants. For each leg, equivalent_lines_for_leg computes the intersection of lines serving every edge in that leg's station sequence:

$$L_{\text{equiv}}(\text{leg}) = \bigcap_{(s_i, s_{i+1}) \in \text{leg}} L(s_i, s_{i+1})$$

where L(si,si+1)L(s_i, s_{i+1}) is the set of lines serving the directed edge from sis_i to si+1s_{i+1}. The Cartesian product across legs gives all valid line assignments for the route.

The route's signature (the hashable method used for deduplication) is now simply the station sequence, since two routes visiting the same stations in the same order are the same physical journey regardless of which shared-track line they name.

This expansion added roughly 10% more label variants to the dataset, concentrated on the central section of the network where track sharing is densest.

Beam search under the mask

The adjacency mask transforms beam search economics. Without it, each decoding step branches over 272 candidates, and beam width 5 barely scratches the surface. With it, each step branches over 3–4 neighbors. Beam width 20 now explores the full branching factor at most stations, and the computational cost is lower than beam width 5 was before the mask — the topk operation over 20×4=8020 \times 4 = 80 candidates per step is trivial compared to 5×272=13605 \times 272 = 1360.

The full profile sets beam_width = 20 for the station model specifically. The line and change models keep width 5, since their per-step vocabularies (11 lines × 2 directions) are already small.

Checkpointing on the right metric

The logs from the previous full run revealed that the station model's validation loss and exact match accuracy diverged after epoch 88: loss rose while accuracy kept climbing. This meant checkpointing on best loss was saving an inferior model!

For the station model, the training loop now checkpoints on best greedy exact match rather than best validation loss. This is computed every epoch for free (no beam search required) and tracks what we actually care about. The line and change models continue to checkpoint on loss, where the two metrics stay aligned.

Results

The full profile run completed in 57 minutes over 200 epochs, with d_model=512, 6 encoder layers, and beam width 20:

Before mask After mask
Greedy exact 44% 69.3%
Beam (k=5 → k=20) 59% 81.0%
Station acc ~90% 71.9%
Valid 100% 100%

The greedy–beam gap widened from 15 points to 12, but at a much higher baseline. The model is now finding valid routes 81% of the time within its top 20 hypotheses.

The per-token station accuracy appears to drop (90% → 72%), but these numbers aren't comparable.

The old 90% was accuracy over 272 candidates where the model had learned to eliminate most of them. The new 72% is accuracy over 3–4 candidates where every option is a plausible next station. The task per token is harder in the masked regime — the easy eliminations are done for free by the mask, and what remains is the genuine routing decision.

The dev run confirmed the mask's effect in isolation: at d_model=128 with just 20 epochs, the masked model hit 18.2% greedy and 28.2% beam, versus 10.5% and 10.8% without. The beam gap going from essentially zero to 10 points confirmed that beam search was finally exploring meaningful alternatives rather than wasting hypotheses on topologically impossible paths.

The change model's full run told a parallel story. At the final epoch it reached 56.6% greedy and 66.7% beam — up from 50.2% in part 2's single-route evaluation, and now measured against the stricter multi-route eval. The per-head accuracies barely moved (line 83.5%, direction 91.0%, station 67.3%), confirming the improvement comes from route equivalence expanding what counts as a correct prediction rather than from the model learning anything new.

But the saved checkpoint only captured 37.7% exact match, because the change model exhibits the same loss/accuracy divergence as the station model: validation loss bottoms out early while accuracy keeps climbing. The exact-match checkpointing fix currently only applies to the station model. Generalising it is the obvious next step.

Taking stock

Four posts in, these three models have had unequal amounts of attention, and it's worth stepping back to look at them side by side.

The station model has received the most architectural investment — adjacency masking, the teacher-tracked position fix, label smoothing removal, checkpointing on exact match, and it shows. 69.3% greedy and 81% beam on the full profile is a strong result for a 50-step autoregressive sequence model over a transport graph, but the headline number flatters it slightly: evaluation is against all valid routes per OD pair (2.6 on average), so the model has multiple targets to hit rather than one.

The change model has been comparatively neglected. Its full-profile run hit 56.6% greedy and 66.7% beam, but the saved checkpoint only captured 37.7% because the exact-match checkpointing fix hasn't been applied to it yet. That 66.7% is the number that matters, and it was achieved without any of the architectural tricks that transformed the station model. The change model is predicting 3–4 autoregressive steps where the station model is predicting 30–50. It doesn't need adjacency masking to avoid topological nonsense because the per-step vocabulary is already small. It just needs its checkpoint saved at the right epoch.

The line model hasn't been run on the full profile since part 2's single-route evaluation, so its numbers aren't directly comparable to anything. The dev-profile results (36.1% greedy, 51.3% beam) are consistent with the other models at that scale, but until it gets a full-profile run with the corrected multi-route evaluation there's not much to say.

What strikes me looking at these together is that the change model is doing the strategically important work. Its output of (line, direction, interchange station) per leg is nearly sufficient to reconstruct full station sequences deterministically. If it predicts (Northern, southbound, Embankment) → (District, westbound, Victoria), the stations between origin and Embankment on the Northern line are recoverable by walking line_adj. Between two known stations on a known line, the path is almost always unique.

This means the change model at 3–4 autoregressive steps sidesteps the compounding error that is the station model's fundamental ceiling. A 4-step sequence with 90% per-step accuracy gives 65% exact match. A 30-step sequence with 97% per-step accuracy gives 40%. The maths overwhelmingly favours the shorter sequence, and no amount of scaling the station model changes that arithmetic.

There's also an obvious masking intervention available for the change model that I haven't tried. The station_head in InterchangeDecoder currently scores all 272 stations, but the valid interchange station for a given leg must be on the predicted line. The topology already has station_lines, building a (n_lines, n_stations) boolean mask and applying it after the line prediction would constrain the station head to ~15–30 candidates instead of 272. This is the same principle as adjacency masking: give the decoder information we already have rather than making it rediscover it from data.

One other thing I want to revisit is the GRU feedback path. All three decoders currently call self.gru(h, h), passing the hidden state as both input and hidden state, then add the feedback embedding via residual connection after the GRU step. The GRU cell's signature is GRUCell(input, hidden), and by giving it the same tensor for both, the reset and update gates never see what the model just predicted. The feedback is added too late for the gating mechanism to use it. Feeding the previous step's feedback embedding as the GRU input and the recurrent state as hidden would let the gates condition on the last prediction directly. This is a one-line change per decoder and might improve the model's ability to maintain coherent multi-step plans, which is precisely the kind of sequential reasoning that's weakest right now.