The distance field

Training a value-primary model, and why it can't replace the policy

The idea
Training the value-primary model
The value head got good
The Bellman rollout did not
Why accurate values produce bad routes
What the value head is still good for
The lesson

The idea

The Q-soft policy from part 9 reached 1.00 Dijkstra ratio — mathematically optimal routing. But it reached that optimality through a black box: given a station and a destination, the policy outputs a distribution over neighbours, and the right answer falls out.

You cannot ask the policy "how far is Bank from Morden?" You can only roll out a full route and sum the edges. You cannot compare two partial routes without completing both. You cannot answer "what if the Northern line is closed?" without retraining.

The value head could do all of this — if it worked. At the end of part 9, the value head predicted remaining travel time with 7.32-minute MAE, far too imprecise to be useful. A Bellman rollout using those predictions achieved 41% success rate.

The value head failed because it was an auxiliary loss. The KL divergence on the Q-soft policy produces gradients across 272 stations per sample; the MSE on a single scalar produces one small gradient. Even at 0.5× weight, the value head was starved.

The fix seemed obvious: drop the policy loss entirely and train the encoder and value head with pure MSE against Floyd–Warshall shortest-path distances. If the value head could learn an accurate distance field — MAE under one minute — then the Bellman action rule (at each step, pick the neighbour minimising edge time plus predicted remaining time) would give optimal routing without a policy network, without beam search, and with a queryable distance field as a free bonus.

Training the value-primary model

The training data changed completely. The Q-soft policy trained on per-step triples extracted from routes: 1.5 million (current, destination, next-station) samples. The value-primary model trained on OD pairs directly: 73,712 (origin, destination, shortest-time) samples drawn from the Floyd–Warshall matrix.

The value head was given more capacity — a four-layer MLP with layer normalisation, roughly 3× the parameters of the auxiliary version. The policy MLP was frozen. The encoder was shared and trained end-to-end, so the GATv2 message-passing layers could reshape their representations to encode distance information rather than decision boundaries.

With only 73,712 training samples versus 1.5 million, each epoch was fast — 65 batches at batch size 1,024 — but many more epochs were needed to compensate. At 800 epochs the model saw each OD pair roughly 800 times, compared to the policy model seeing each step-level sample roughly 80 times across its 80 epochs.

The loss was Huber with δ=2 minutes: squared penalty for errors under two minutes (encouraging precision where it matters for Bellman), linear penalty above (preventing large-error outliers from dominating gradients).

The value head got good

MAE dropped fast. By 6% of training, it was under 1 minute. By 12%, under 0.4 minutes. The best checkpoint reached 0.25-minute MAE — 15 seconds average error across all station pairs.

	Auxiliary (part 9)	Value-primary
MAE	7.32 min	0.25 min
RMSE	—	0.41 min
< 30s error	—	90.4%
< 1 min error	—	97.5%
< 2 min error	—	99.0%

A 29× improvement in MAE. The model learned a compressed representation of the 272×272 shortest-path matrix — 73,984 distance values — inside the encoder embeddings and a four-layer MLP.

As a distance oracle, this works. You can query "how far is Ealing Broadway from Stratford?" and get an answer within 15 seconds of the true shortest-path time, without rolling out a route.

The Bellman rollout did not

	Q-soft policy	Value-primary Bellman
Success	100%	43.2%
vs Dijkstra	1.00	1.75

Stratified by route length:

Bucket	Policy success	Policy Dijkstra	Bellman success	Bellman Dijkstra
2–5 st	100%	1.00	71%	1.75
6–10 st	100%	1.00	52%	1.92
11–20 st	100%	1.00	41%	1.74
21–30 st	100%	1.00	29%	1.37
31–50 st	100%	1.00	28%	1.27

0.25-minute MAE. 90% of predictions within 30 seconds. 43% routing success.

Why accurate values produce bad routes

The Bellman action rule is: at station $s$ with destination $d$ , move to the neighbour $n$ that minimises $w(s,n) + V(n,d)$ . The edge weights $w(s,n)$ are exact — they come from the GTFS timetable. The value predictions $V(n,d)$ are approximate. The action rule takes an argmin over approximate values.

The problem is that the argmin is fragile. With 3.6 average neighbours, the cost difference between the best and second-best neighbour is often under a minute. A 15-second average error sounds small, but it only needs to flip the ranking at one junction to send the route down a suboptimal branch.

And the errors compound. If each hop has roughly 90% probability of picking the correct neighbour (consistent with the 90.4% of predictions within 30 seconds of truth), then over a 20-hop route: $0.9^{20} \approx 12\%$ . The stratified results confirm this: short routes (2–5 stations) succeed 71% of the time; long routes (21–30 stations) succeed 29%.

This is a known failure mode of value-based methods in reinforcement learning. The value function can have low average error while the induced policy is poor, because the policy depends on the argmin — which depends on the relative ordering of values, not their absolute accuracy. Two neighbours predicted as 14.2 and 14.5 minutes might actually be 14.5 and 14.2. The MAE is small in both cases. The action is wrong.

The Q-soft policy avoids this entirely. It doesn't predict distances and derive actions. It directly learns a distribution over actions, trained against the true cost surface. The classification problem — "which of these 3–4 neighbours is best?" — is easier than the regression problem "what is the exact travel time to every station?", precisely because classification only needs the ranking right, not the magnitudes.

What the value head is still good for

The value-primary model isn't useful for routing. But it learned something the policy model didn't: a queryable distance field.

The policy can answer "what is the next hop from King's Cross to Brixton?" It cannot answer "how long will that take?" without rolling out every hop and summing edge times.

The value head can answer "how long from King's Cross to Brixton?" in a single forward pass — one encoder call plus one MLP evaluation. For the 90% of station pairs where the error is under 30 seconds, that answer is precise enough for display, ranking, and comparison.

Whether that capability justifies the additional model complexity depends on what gets built on top. For now, the routing system uses the Q-soft policy. The value head stays as an auxiliary, available if needed, not pretending to be something it isn't.

The lesson

The Q-soft policy works because it solves the right problem. Routing is a sequence of discrete choices. The policy directly models those choices, informed by costs.

The value-primary model solves a harder problem — learning a continuous distance field over 73,000 station pairs — and solves it well. But the Bellman bridge between "accurate distances" and "optimal actions" is too fragile. Small errors in a continuous prediction, when passed through an argmin, produce discrete mistakes that compound across hops.

The theoretically elegant approach lost to the pragmatic one. That happens.