The 4% plateau
The next-hop policy from part 8 solved the routing problem in the sense that mattered most: 100% success rate, every origin–destination pair, every route-length bucket. Routes were within 4–6% of the mathematically optimal shortest travel time as computed by Floyd–Warshall.
The stratified Dijkstra ratio was remarkably uniform — approximately 1.04 whether the route was 5 stations or 40. That uniformity was the diagnostic clue. If the model struggled with long-range planning, the gap would widen with route length. It didn't, which meant the overhead was per-hop: at roughly 1 in 10 junctions, the model picked a neighbour that was slightly slower than the best available.
The question was why.
What cross-entropy actually teaches
The training signal for the next-hop policy was standard supervised classification. Each training example was a triple — current station, destination, target next station — where the target came from a Dijkstra-enumerated shortest path. The loss was cross-entropy against that single target.
Cross-entropy treats every wrong answer equally. At a junction with three neighbours, if A is optimal and B is one second slower while C is five minutes slower, the loss function assigns the same penalty for predicting B as for predicting C. The model learns which neighbour Dijkstra picked. It does not learn how much worse the alternatives are.
On most of the London Underground this doesn't matter. Most junctions have one clearly best neighbour and the cross-entropy signal is clean. But at a substantial minority of junctions — shared-track segments, parallel corridors, branch merges — two or more neighbours lead to routes within seconds of each other. At these junctions, Dijkstra breaks ties arbitrarily (in my implementation, by insertion order into the priority queue, which depends on Python set iteration over line names). The training data records one answer. The model must either memorise that arbitrary choice or hedge, and cross-entropy penalises hedging.
The result is a systematic small error at every near-tie junction. Those errors are individually minor — a slightly slower edge here, an equivalent-but-not-identical path there — but they accumulate across a route to produce the 4% overhead.
Q-values: the cost of each choice
The fix comes from an idea borrowed from reinforcement learning, though what follows is not reinforcement learning in any meaningful sense.
In RL, a Q-value represents the total cost of taking a specific action in a specific state. For routing, the state is (current station, destination) and the actions are the adjacent stations. The Q-value of moving to neighbour from station with destination is:
where is the travel time on the edge and is the shortest remaining travel time from to the destination — exactly what Floyd–Warshall computes.
This is the Bellman equation for shortest paths. It says: the cost of a choice is the immediate cost plus the best possible future.
For the junction example:
| Neighbour | Edge time | Remaining | Q |
|---|---|---|---|
| A | 60s | 540s | 600s |
| B | 120s | 480s | 600s |
| C | 60s | 720s | 780s |
A and B are equally good — different edges, same total cost. C is clearly worse. Cross-entropy, seeing only that Dijkstra happened to pick A, would say B and C are equally wrong. The Q-values say B is essentially correct and C is the real mistake.
Soft targets from Q-values
Rather than training against a single correct answer, the Q-values can be converted into a target probability distribution over all neighbours:
This is a softmin: low-cost neighbours get high probability, high-cost neighbours get low probability, and the temperature parameter controls how sharply the distribution concentrates on the best options.
For the example above, with appropriate :
| Neighbour | Q | Target probability |
|---|---|---|
| A | 600s | 0.46 |
| B | 600s | 0.46 |
| C | 780s | 0.08 |
The model is now taught: A and B are both good choices, C is not. The arbitrary Dijkstra tiebreak has been replaced by the actual cost surface.
The training loss switches from cross-entropy against a one-hot label to KL divergence against this soft distribution. The model no longer learns which neighbour an algorithm happened to pick. It learns how much each neighbour costs.
Implementation
Since Floyd–Warshall was already computed for the evaluation metric, the training change was small. For each batch of (current, destination, target) triples:
- Compute Q for all stations:
Q = edge_time[current] + floyd_warshall[:, dest] - Mask non-adjacent stations to infinity
- Centre by subtracting the minimum Q per sample (prevents numerical saturation for distant stations where all Q-values are in the thousands of seconds)
- Normalise by the standard deviation across adjacent neighbours (keeps meaningful regardless of whether the junction has 30-second spreads or 300-second spreads)
- Apply softmin with fixed `
- Train with KL divergence instead of cross-entropy
The value head loss weight was increased from 0.1 to 0.5 at the same time, anticipating later experiments where the value head would matter more.
Results
The dev model (128-dimensional, 20 epochs) showed the shift immediately:
| Before (CE) | After (Q-soft) | |
|---|---|---|
| Success | 100% | 98.4% |
| vs Dijkstra | 1.05 | 1.01 |
A slight drop in success rate (the dev model is small and the softer training signal is less aggressive about reaching the destination) but a dramatic improvement in route quality.
The full model (512-dimensional, 80 epochs):
| Before (CE) | After (Q-soft) | |
|---|---|---|
| Success | 100% | 100% |
| vs Dijkstra | 1.04 | 1.00 |
| Training time | 13m | 16m |
1.00 across every route-length bucket:
| Bucket | Before | After |
|---|---|---|
| 2–5 st | 1.03 | 1.00 |
| 6–10 st | 1.04 | 1.00 |
| 11–20 st | 1.04 | 1.00 |
| 21–30 st | 1.04 | 1.00 |
| 31–50 st | 1.03 | 1.00 |
The model is finding routes that match the unconstrained shortest travel time computed by Floyd–Warshall on the raw graph.
Step accuracy went down
An apparent paradox: step accuracy — the fraction of hops where the model's top prediction matches the Dijkstra-enumerated target — dropped from 90% to 76%.
This is not a regression. It is the model correctly learning that many junctions have multiple equally good neighbours. At a junction where A and B both lead to optimal-cost routes, the old model was trained to always say A (because Dijkstra said A). The new model says A 48% of the time and B 48% of the time. Measured against the hard label "A", that's a miss half the time. Measured against actual route quality, it's perfect.
Step accuracy against a one-hot target is not a meaningful metric when the target distribution is legitimately soft. The Dijkstra ratio is the metric that matters, and it went to 1.00.
What the model learned
The shift from cross-entropy to Q-soft targets changed what the network represents.
Under cross-entropy, the model learned a lookup table: given this station and this destination, output this neighbour. The internal representations encoded decision boundaries — "if heading north on the Northern line past Camden Town, take the Edgware branch" — without any sense of how much better or worse the alternatives were.
Under Q-soft targets, the model learned the cost surface. The internal representations now encode relative costs — "the Edgware branch saves 45 seconds compared to the High Barnet branch for this destination" — which is a richer representation that generalises better. The policy that falls out of this representation is better precisely because it's grounded in costs rather than arbitrary choices.
The value head, and what it doesn't do yet
The model also has a value head that predicts remaining travel time to the destination. Before this work, it was trained as a minor auxiliary (0.1× weight) alongside the cross-entropy policy loss. Even after increasing the weight to 0.5, the value head's mean absolute error remained at 7.32 minutes — far too imprecise for practical use.
A diagnostic test confirmed this. Instead of using the policy to choose the next station, a Bellman rollout was tested: at each step, pick the neighbour that minimises edge time plus the value head's predicted remaining time. This is the theoretically optimal decision rule if the value predictions are accurate. In practice, with 7.32-minute MAE against a graph where adjacent edges are 1–3 minutes, the Bellman rollout achieved only 41% success rate and a 1.87 Dijkstra ratio.
The value head failed because the policy loss dominates training. The KL divergence on the Q-soft targets produces large gradients (the distribution is over 272 stations); the MSE on a single scalar produces small gradients. Even at 0.5× weight, the value head is an afterthought.
The interesting next step is to train a value-primary model: drop the policy loss entirely and train the encoder and value head with pure MSE against Floyd–Warshall targets. If the value head can learn an accurate distance field over the graph — MAE under one minute — then the Bellman action rule would give optimal routing without beam search, without a policy network, and with the ability to query cost-to-go for any station pair without rolling out a full route. That would be a strictly more capable system: same routing quality, plus a queryable distance field that supports rerouting, partial-route comparison, and counterfactual analysis ("what if this line is closed?").
Whether the value head can actually reach that accuracy is the experiment to run.