The diagnosis
The hybrid decoder from part 7 was the best sequence model we tested. Two layers of cross-attention, a GRU backbone that preserved diversity, adjacency masking that constrained the output to valid moves. At full scale it reached 57.4% beam accuracy — meaning beam search found the exact ground-truth route for more than half of all origin–destination pairs.
But the stratified breakdown told a consistent story across every architecture we tried. Short routes (2–5 stations) were largely solved. Medium routes (6–10) were competitive. Long routes (21+ stations) collapsed.
The failure mode was always the same: compounding error in autoregressive decoding. At each step the model predicts a station, feeds that prediction back as input to the next step, and any mistake propagates. On a 30-station route, even 95% per-step accuracy gives chance of a perfect sequence. The decoder wasn't failing to understand the graph. It was failing to survive its own mistakes over long sequences.
Two independent analyses converged on the same prescription. First, the iterative encoder: replace the stacked GATv2 layers (each with independent weights) with a single weight-tied layer run K times, giving a K-hop receptive field with a Bellman–Ford flavour. Second, the next-hop policy: stop predicting entire sequences and instead predict one hop at a time.
The first change is mechanical — ten lines of code, no change to the training loop. The second is a fundamental reformulation of what the model is learning.
Reformulating the problem
The autoregressive station decoder learned:
which requires generating a variable-length sequence where every token depends on every previous token. The output space is enormous — all possible station sequences up to length 40 — and the model must navigate it autoregressively, one token at a time.
The next-hop policy learns:
which is a single classification over adjacent stations. There is no sequence. There is no autoregression. There is no compounding error. At each station the model independently decides which neighbour to visit next.
This matches the structure of classical routing algorithms. Bellman–Ford, Dijkstra, and A* all make local decisions — "from here, which edge should I take?" — and the global route emerges from chaining those decisions together. The next-hop policy is a learned version of the same idea.
The dataset transformation is straightforward. Each training route with destination decomposes into independent training examples:
| Current | Destination | Target |
|---|---|---|
The 73,712 OD pairs with their enumerated routes expand to approximately 1.5 million next-hop training samples. Each is a flat classification: given the encoder's embedding of the current station and the destination, predict which adjacent station to move to.
The decoder is trivially simple — a 3-layer MLP that takes the concatenation of and and produces logits over all stations, masked by the adjacency matrix. No GRU, no attention, no sequence modelling of any kind. All the intelligence lives in the encoder, which must propagate enough information through the graph for the MLP to make good local decisions.
The iterative encoder
The encoder change landed first because it was a prerequisite for both the autoregressive model and the next-hop policy.
The original GATv2 encoder stacked 3 independent layers, each with its own weights. Layer 1 aggregated 1-hop neighbours, layer 2 aggregated 2-hop, layer 3 aggregated 3-hop. On a graph where some routes span 40+ stations, a 3-hop receptive field is inadequate. Stations at opposite ends of the network are invisible to each other.
The iterative encoder replaces the stack with a single GATv2Conv layer run times with shared weights:
for _ in range(self.n_iters):
h_res = h
h = self.conv(h, edge_index, edge_attr=edge_attr)
h = self.norm(h + h_res)
h = torch.relu(h)
h = self.dropout(h)
With , every station's embedding incorporates information from 12 hops away. Weight tying means this adds almost no parameters — the single shared conv layer is reused, so 12 iterations costs the same parameter budget as 1 layer, with 12× the receptive field. The trade-off is compute: 12 forward passes through the same layer, which is linear in but cheap per iteration on a 272-node graph.
The Bellman–Ford analogy is precise. In classical shortest-path algorithms, iterations of relaxation propagate distance information hops. Here, iterations of message passing propagate learned representations hops. The residual connection ensures that local information is preserved as global information accumulates.
First results
The next-hop model trained fast. With the encoder doing all the heavy lifting and the decoder reduced to a 3-layer MLP, the entire model is small and the training signal is dense — every hop in every route is an independent supervised example.
The first dev run (128-dimensional embeddings, 20 epochs, 2 minutes):
| Bucket | Success |
|---|---|
| 2–5 st | 82% |
| 6–10 st | 89% |
| 11–20 st | 72% |
| 21–30 st | 44% |
| 31–50 st | 10% |
68.9% overall success, with a length ratio of 1.09 — routes that reached the destination were only 9% longer than the training data's best. The 6–10 bucket outperforming 2–5 is an artefact of redundancy: short routes have fewer chances to recover from a wrong hop, while medium routes have enough network redundancy that a suboptimal hop can still find an alternative path.
The pattern was immediately legible. Short and medium routes were mostly solved. Long routes failed because the encoder's receptive field (initially misconfigured at 6 iterations, later fixed to 12) couldn't propagate destination information far enough for 30+ hop decisions.
Scaling
The encoder iteration count turned out to be the dominant hyperparameter. Not model width, not learning rate, not decoder depth — the number of message-passing iterations.
n_enc_layers |
Dev success | Dev time |
|---|---|---|
| 5 | 87.2% | 3m12s |
| 8 | 95.3% | 4m42s |
| 10 | 98.9% | 5m42s |
| 12 | 100% | 6m40s |
At 12 iterations, the dev model achieved 100% success across all route-length buckets. This confirmed the diagnosis: the binding constraint had been receptive field, not model capacity or training signal.
The full model (512-dimensional, 80 epochs) matched at 98.2% with greedy rollout — slightly lower than the dev model because the larger model's sharper probability distributions gave less room for error recovery. This was a hint of what would become important later.
Beam search
The greedy rollout takes the single highest-probability neighbour at each step and commits to it. If that hop is wrong, the route is stuck on a suboptimal path with no way to backtrack.
Beam search keeps parallel routes alive simultaneously. At each step, every active route expands to all its valid neighbours, scored by cumulative log-probability. Only the top candidates survive. The moment any route reaches the destination, it's stored. Among multiple completions, the highest-scored one is returned.
The implementation is batched: all examples and their $K$ beams step forward together in a single tensor operation per step, with per-example visited masks and early termination.
The effect was dramatic.
| Model | Greedy success | Beam success |
|---|---|---|
| Dev (128d) | 94.2% | 100% |
| Full (512d) | 98.2% | 100% |
The full model, which had been slightly worse than the dev model under greedy rollout due to overconfident distributions, now matched it. Beam search compensated for confidence — the dev model's softer distributions gave beams natural diversity, while the full model needed beam search to explore alternatives its greedy policy wouldn't consider.
With beam width 5, every route-length bucket hit 100% on both model sizes. The length ratio (hop count relative to best training route) dropped to 0.99 — the model was finding routes shorter than any in its training data, because the beam explored paths the constrained Dijkstra enumeration had never generated.
The value head
Success rate was solved. The next question was route quality.
A value head predicts the remaining cost to the destination from any given station. It's a second output of the decoder MLP, trained alongside the policy via MSE loss:
where and is the remaining travel time in minutes along the training route.
During beam search, the value estimate augments the log-probability scoring. A beam at station with cumulative log-probability and value estimate gets an A*-style score:
where $\alpha = 0.1`$. Lower estimated remaining cost means closer to the destination, which means a higher score. This steers beams toward the destination more efficiently, improving both success rate and route quality.
The value head added 33K parameters to the dev model and 525K to the full model — negligible overhead.
Travel time
Up to this point, the model was optimised for hop count. The training routes came from Dijkstra with travel-time weights, so the routes were time-optimal, but the loss function and the value head both treated every hop equally. A route with 10 slow edges (120 seconds each) scored the same as a route with 10 fast edges (60 seconds each).
Switching the value head's target from remaining hop count to remaining travel time in minutes was straightforward — the route data already contained cumulative travel times from the Dijkstra enumeration. The policy loss remained cross-entropy against the next-hop target, unchanged.
Separately, adding a proper baseline comparison required computing the true unconstrained shortest travel time between all station pairs. Floyd–Warshall on 272 nodes runs in milliseconds and produces the all-pairs shortest time matrix — the ground truth against which the model's route quality can be measured.
The Dijkstra ratio (model's travel time divided by Floyd–Warshall optimal) replaced the hop-count length ratio as the primary quality metric.
Masking correctness
The beam search rollout exposed a class of bugs that hadn't been visible with greedy rollout.
The adjacency mask and the visited mask must be applied in the correct order. The adjacency mask is a hard constraint — non-adjacent stations are physically impossible moves. The visited mask is a soft preference — revisiting a station is discouraged but allowed as a backtrack.
The original code applied them as:
logits = logits.masked_fill(~adj_mask, float("-inf")) # hard
logits = logits.masked_fill(visited, -1e4) # soft
If a station is both non-adjacent and visited, the second line overwrites -inf with -1e4, promoting an impossible move to merely discouraged.
The fix is to apply the hard constraint last:
logits = logits.masked_fill(visited, -1e4) # soft
logits = logits.masked_fill(~adj_mask, float("-inf")) # hard
This is the kind of bug that only manifests when beam search runs long enough for all adjacent unvisited neighbours to be exhausted — which happens on long routes where beams get stuck. Under greedy rollout, it was invisible because the highest-probability adjacent station was always chosen.
The general principle: hard constraints must always be applied after soft penalties, so they can never be overwritten.
The Dijkstra gap
With all corrections in place, the final numbers:
| Dev (128d, 15 layers) | Full (512d, 12 layers) | |
|---|---|---|
| Success | 100% | 100% |
| Length ratio | 0.99 | 0.99 |
| vs Dijkstra | 1.05 | 1.06 |
| Training time | 7m45s | 13m00s |
| Parameters | 644K | 9.5M |
The model finds the destination every time and takes routes within 5–6% of the unconstrained shortest travel time.
The stratified Dijkstra ratio tells a revealing story:
| Bucket | Dev | Full |
|---|---|---|
| 2–5 st | 1.04 | 1.03 |
| 6–10 st | 1.05 | 1.04 |
| 11–20 st | 1.06 | 1.04 |
| 21–30 st | 1.05 | 1.04 |
| 31–50 st | 1.05 | 1.03 |
The gap is remarkably uniform — approximately 4% across every route length. This rules out planning depth as the cause. If the model struggled with long-range planning, the gap would widen with route length. Instead, the gap is per-hop: the model occasionally picks a slightly slower neighbour.
Step accuracy is 90%, meaning 1 in 10 hops diverges from the Dijkstra-optimal choice. Most of these divergences are minor — a 120-second edge instead of a 60-second edge — and the beam usually still reaches the destination via a competitive alternative. But those small per-hop penalties accumulate to the 4% overhead.
Reducing the transfer penalty and increasing the maximum transfer count in the route enumeration narrowed the gap from 6% to 4%, confirming that part of the overhead came from training on routes that were optimal under constraints the model doesn't face at inference time. The remaining 4% appears to be the imitation learning ceiling — the gap between what the training data demonstrates and what the neural network actually internalises.
What changed
The progression from the autoregressive station decoder to the next-hop policy with beam search and value head was a series of simplifications, not complications.
The sequence decoder had a GRU backbone, cross-attention layers, scheduled sampling, a pointer mechanism, teacher forcing, and beam search over token sequences. The next-hop policy has an MLP and an adjacency mask. The encoder is the same (improved with weight tying). Beam search moved from token-level to rollout-level.
The fundamental change was the question being asked. "What is the complete route?" is a sequence prediction problem with compounding error. "What is the next hop?" is a classification problem with independent predictions. The entire apparatus of autoregressive decoding — hidden states, scheduled sampling, diversity preservation — existed to manage compounding error. Remove compounding error and you remove the need to manage it.
The DNC parallel from part 7 is worth revisiting. The Differentiable Neural Computer used an LSTM controller with external memory to navigate the London Underground — a recurrent model querying structured memory at each step. The next-hop policy uses a GATv2 encoder with an MLP decoder — a graph neural network querying its own node embeddings at each step. The recurrent controller is gone. The write head is gone. The read head (cross-attention) is implicit in the encoder's message passing rather than explicit in the decoder. What remains is the core insight: route the decision through the graph structure, not through a sequential bottleneck.
On the same 272-station graph, in 13 minutes of training, with 100% success and 4% travel time overhead against the mathematically optimal shortest path.