The problem with equal edges
The initial models from part 2 came out perfectly respectably, scoring 59% exact match on the line model, and 50% on the change model, but they were trained on a ground truth that had been simplified past the point of validity to the real world. The BFS route enumerator sorted candidates by , effectively treating every edge in the graph as the same length. That meant a journey from Chesham to Chalfont & Latimer (4 pleasant yet long minutes through Buckinghamshire countryside on the Metropolitan) counted the same as Liverpool Street to Bethnal Green (90 seconds on the Central). The "best" route was whichever had the fewest transfers and then the fewest stops, regardless of its actual speed.
Now, minimising transfers can be a valid reason to prefer a particular route (probably most of all if you have mobility issues, but also if you just don't fancy the interruptions). In this model though, I want to score routes primarily for how fast they get from A to B.
The TfL API timetable endpoints had readily provided this inter-station travel time, but my initial routing pipeline had discarded them.
In search of lost travel time
The topology extractor already parsed stop_times.txt in the GTFS file to build the adjacency structure.
Extending it to also extract travel times between consecutive stops on the same trip was straightforward: for each pair of adjacent stations on a trip, take the difference in arrival times.
Where multiple trips give different times for the same edge (peak vs off-peak services), keep the median.
This produced 894 timed edges out of 1,074 adjacencies in the graph — about 83% coverage. The remaining edges are cases where the GTFS data has the same arrival time for consecutive stops (rounding to whole minutes in some timetables) or where a station pair only appears in trips with missing time data. For untimed edges, a fallback of 120 seconds (a reasonable average inter-station time on the Underground) fills the gap.
Transfer penalty
Changing lines isn't free, in the vast majority of platform-to-platform wanders. Tubegoers must navigate flights of stairs, escalators, and other liminal spaces, before they get to catch or await their next train. A flat penalty of 240 seconds per transfer (4 minutes) is a crude but defensible approximation. Some interchanges are faster (directly cross-platform at Mile End, where they're timed to match up) and some slower (the hike between District and Northern at Bank), but as a first pass it shifts the ranking in the right direction.
With weighted edges and a transfer penalty, the route enumerator now sorts by total travel time rather than transfers and stops. This means a route with one extra transfer but significantly faster running time can rank above a "direct" route — exactly the kind of decision that you want a journey planner to inform you of in practice (for instance switching to the Jubilee at West Ham can shave a significant few minutes off a journey along the District/H&C).
The transfer penalty lives in defaults.toml as transfer_penalty = 240.0, so it can be refined later (most likely as per-station penalties, or even learned from the data) without touching any code.
Multi-route training
The initial models trained on a single "best" route per OD pair. This was the first thing to change.
The route enumerator now stores all valid routes (up to max_routes_per_od, default 3) for each origin–destination pair.
The dataset went from 73,712 examples with one label each to 73,712 OD pairs carrying 194,224 routes between them — an average of 2.6 routes per pair.
During training, __getitem__ samples one route uniformly at random each time it's called.
Over many epochs the model sees all valid routes for each OD pair and learns the distribution rather than memorising a single arbitrary "best" route.
The architecture doesn't change at all — the decoder still outputs one sequence per forward pass.
The prediction diversity is learned from the training signal.
When evals go wrong
The first run with multi-route training on the dev profile showed an immediate red flag:
| Model | Exact match | Valid |
|---|---|---|
| line | 15.5% | 100% |
| change | 6.4% | 85.2% |
| station | 6.2% | 12.2% |
- The line model dropped from 59% to 15%.
- The change model went from 100% validity to 85%.
- The station model was generating near-total nonsense at 12% validity.
Two things had gone wrong, and telling them apart was the key to fixing this.
The validity drop was the first clue. The dev profile only trains for 20 epochs.
With random route sampling, each individual route is seen roughly 7–8 times total.
That isn't enough to learn the structure of the network — the model hasn't converged, and it shows as incoherent predictions.
This was a training problem, not an architecture problem.
It was confirmed by the line model still hitting 100% validity: with only two tokens per step (line ID and direction), 20 epochs is enough to at least learn the valid token ranges even if it can't pick the right ones.
The exact match drop was measurement error. This was the subtler problem.
The validation loop was calling __getitem__ on each example, which sampled a random route from the valid set.
The model could predict the fastest route perfectly and score 0% exact match because the evaluation happened to sample a different route from the same OD pair.
The model was being scored against a moving target that changed every epoch.
The fix here was to separate the training and evaluation paths. Training keeps random sampling as intended, which is how the model learns the full route distribution. Evaluation however must compare the model's prediction against all valid routes for each OD pair. A prediction counts as correct if it matches any of them.
This required the dataset to expose a get_all_labels(idx) method alongside the random-sampling __getitem__, and the evaluator to loop over all valid routes when checking for a match.
The per-head accuracy metrics (line accuracy, direction accuracy, station accuracy) are scored against whichever valid route has the most token overlap with the prediction, which gives the fairest read on where the model is partially right.
Beam search
With multi-route training, the model learns that multiple routes exist for each OD pair, but the decoder still outputs a single sequence — the greedy argmax path through its probability distribution. If the model has learned three valid routes for Camden Town → Victoria, greedy decoding picks whichever one has the highest probability at each step, which might be a chimera of two different routes rather than any single valid one.
Beam search solves this by maintaining the top-k hypotheses at each decoding step rather than committing to one. For the line model with beam width 5 and 4 decoding steps, this is cheap. For the station model with 50 steps, it's expensive — which led to an architectural decision: run greedy evaluation during training epochs and beam search only on the final epoch.
This gives two metrics:
- top1: does the single best beam match any valid route? This is the greedy metric, comparable to the old exact match.
- beam: does any beam in the top-k match any valid route? This measures whether the model has learned the route at all, even if it isn't the highest-ranked prediction.
The gap between top1 and beam tells you how much diversity the decoder has learned.
If beam is much higher than top1, the model knows multiple routes but isn't always ranking the right one first — a decoding problem rather than a learning problem.
Where we are now
The corrected pipeline on the dev profile (20 epochs, d_model=128) gives us a first honest read.
The final-epoch numbers, where beam search actually runs:
| Model | Top-1 | Beam (k=5) | Beam gap | Valid |
|---|---|---|---|---|
| line | 36.1% | 51.3% | +15.2 | 100% |
| change | 18.6% | 31.5% | +12.9 | 100% |
| station | 9.5% | 10.0% | +0.5 | 100% |
And the per-head breakdown at each model's best checkpoint:
| Model | Exact | Line acc | Dir acc | Station acc |
|---|---|---|---|---|
| line | 32.7% | 61.9% | 82.4% | — |
| change | 11.8% | 69.4% | 83.5% | 48.6% |
| station | 9.5% | — | — | 32.1% |
Validity is 100% across the board, confirming that the earlier validity drops were a convergence problem and not an architectural one.
A few observations.
The beam gap is where multi-route training proves itself. For the line model, greedy decoding matches a valid route 36% of the time, but searching through 5 hypotheses raises that to 51%. The model has learned that multiple routes exist for a given OD pair and assigns probability mass to more than one of them. The change model shows the same pattern. This is exactly what we hoped random-sampling training would produce.
The station model has no meaningful beam gap. At 9.5% top-1 and 10.0% beam, it hasn't learned enough about the network to propose coherent alternatives. This is almost certainly undertrained: 20 epochs at batch size 128 with output sequences up to 50 tokens long is not much. The full profile should tell us whether there's more to extract here.
The "station accuracy" column is misleading across models. The change model's 48.6% station accuracy and the station model's 32.1% are measuring different things. The change model predicts 1–3 interchange stations per route, chosen from a set that's effectively constrained to the 80 interchanges in the network. The station model predicts every intermediate stop in a sequence that might be 10–30 tokens long. The change model's task is categorically easier, and comparing the two numbers directly would be wrong.
Best checkpoint metrics don't include beam. Beam search only runs on the final epoch, but the best checkpoint is saved at the epoch with the lowest validation loss, which is usually earlier. So the "best metrics" lines report beam = exact, because the checkpoint was saved during a greedy-evaluation epoch. The progress bar's final-epoch numbers are the ones to look at for beam coverage.
Next
The full profile run (200 epochs, d_model=256, deeper encoder) is the obvious next step, and should clarify whether the station model's flatness is a capacity issue or a convergence issue.
Looking beyond that, we'll get into:
- Topological validity checking that goes beyond token ranges. The current "valid" metric only checks that predicted IDs are in bounds. It doesn't check whether the predicted sequence of stations is actually traversable on the graph. A prediction could name 10 real stations in an order that doesn't correspond to any connected path, and still score as "valid."
- Per-station transfer penalties to replace the flat 240 seconds, since the walk from the Jubilee to the Northern at London Bridge is nothing like the walk between the District and Northern at Bank.
- Travel time as an encoder feature. The GNN currently sees line identity on each edge but not how fast the line runs between those stations. Adding travel time as an edge feature would let the encoder learn that the Jubilee between Stratford and Canary Wharf is fast, rather than having to infer it indirectly.