Initial results

Training three models and identifying what the evaluation actually measures

The pipeline
The decoders
Regularise it
Results
Measurement error
Next steps

The pipeline

With the GTFS data in hand, the training pipeline was set up in four stages:

Topology extraction parsing the GTFS zip into a line-aware adjacency structure: which stations are adjacent on which lines, which stations serve as interchanges, and a canonical ordering of stations per line for determining direction.
Route enumeration running a BFS over every origin–destination pair in the network (272 × 271 = 73,712 pairs), finding up to three topologically valid routes per pair with at most two transfers.
Graph construction building an enriched PyTorch Geometric graph with four node features (normalised easting, normalised northing, number of lines, interchange flag) and one-hot line identity as edge features.
Training running a shared GATv2 encoder with three interchangeable decoder heads.

The whole thing is driven by a single defaults.toml TOML file controlling every hyperparameter, with a layered merge system (base → model type → profile → CLI overrides) rather than having settings scattered across the codebase. There are two profiles: dev for fast iteration (20 epochs, large batch) and full for production runs (200 epochs, deeper encoder, d_model=256).

This use of a TOML file within software is a little trick I started doing recently in the [Havelock][https://havelock.ai] repo (developing BERT/ModernBERT variants), providing a centralised 'control plane'. This idea of hierarchical TOML configs is a pattern seen in libraries pydantic-settings in Python and figment in Rust; the idea of 'develop' and 'release' profiles is a pattern stolen from the Rust compiler itself.

In ML projects you typically only tune your hyperparameters for optimal model outcomes, but it's also handy to be able to quickly switch to a 'quick and dirty' one (high batch size/LR, fewer epochs).

The decoders

All three decoders are autoregressive: a GRU cell that at each step emits predictions from one or more classification heads, then feeds its own output back as input to the next step.

The line decoder has two heads per step — a line classifier (11-way) and a direction classifier (binary) — and runs for up to four steps. Its output for a journey from Camden Town to Victoria might be [(northern, southbound), (district, westbound)].

The change decoder adds a third head: a station classifier (272-way) that predicts where to transfer. Same journey: [(northern, southbound, Embankment), (district, westbound, Victoria)].

The station decoder replaces the structured heads with a pointer mechanism. At each step it computes attention scores over all station embeddings from the encoder and selects the next station in the sequence. It runs for up to 50 steps and must predict every intermediate stop.

Regularise it

The first run on the full profile showed massive overfitting: train loss 0.05 against val loss 0.44, an 8× gap. The model was memorising routes rather than learning the structure of the network.

Two changes closed this:

Label smoothing (0.1) softens the target distribution, preventing the model from becoming overconfident on any single route.
Scheduled sampling (p=0.5) is the bigger lever. During training, at each decoder step, with probability 0.5 the model is fed its own previous prediction instead of the ground truth. This forces it to learn to recover from its own mistakes, which is exactly the regime it faces at inference.

The train/val gap dropped from 8× to 1.5×.

Results

All three models trained on 66,341 examples (90/10 split) with 200 epochs on the full profile. The station model was stopped early at 35% of training.

Model	Params	Exact match	Line acc	Dir acc	Station acc	Valid
line	1,074,318	59.1%	77.6%	90.2%	—	100%
change	1,232,590	50.2%	83.1%	89.3%	68.3%	100%
station	1,270,272	41.7%¹	—	—	—	100%

¹ Station model training cut short at 35% of training (70/200 epochs), still climbing.

"Exact match" means every token in the predicted sequence matches the label. For the change model, getting the line and direction right but picking Embankment instead of Monument as the interchange station counts as a total miss.

Direction accuracy at 90% across both line and change models suggests the encoder learns a strong sense of orientation on the network — it knows which way is "northbound" on the Northern line with high reliability.

100% topological validity across all models means every predicted token is a real line ID or station ID. The architecture never hallucinates.

Measurement error

The results above are against a single "best" route per OD pair, selected by the BFS with a naive sort by $(n_{\text{transfers}},\, |\mathcal{S}_r|)$ (fewest transfers first, then fewest stops). This has two problems.

First, many "errors" are valid routes. The London Underground has extensive track sharing — District, Circle, and Hammersmith & City all run between Aldgate East and Paddington. If the ground truth says "take the District" and the model predicts "take the Circle", the evaluation marks it wrong despite the routes being functionally identical. The same applies anywhere multiple routes exist between the same OD pair.

Second, "fewest transfers" is not always "fastest". Inter-station travel times vary enormously. A journey from Upton Park to Borough is quicker via a transfer to the Jubilee at West Ham — the Jubilee's fast running to London Bridge more than compensates for the transfer penalty — than staying on the District/Hammersmith & City to a more direct interchange. The BFS ranking ignores this entirely, treating all edges as equal length.

The GTFS data already contains inter-station travel times from the TfL API, but the routing pipeline throws them away.

Next steps

The architectural comparison has served its purpose. The shared encoder works, the three decoder granularities produce the expected hierarchy of difficulty, and the regularisation approach is sound. What needs to change is what the models are trained on and how they are evaluated.

The next post will cover:

Weighted routing: replacing the naive BFS with shortest-path routing over actual travel times, with a configurable transfer penalty.
Multi-route training: storing all valid routes per OD pair and sampling uniformly during training, so the model learns the full distribution rather than a single arbitrary "best".
Topological evaluation: scoring predicted routes by whether they are actually traversable on the graph, rather than whether they match one specific label.
Travel-time-aware encoding: adding inter-station travel time as an edge feature so the encoder can distinguish fast lines from slow ones.