Hierarchical decoding

Testing the change-model-plus-graph-walk hypothesis, and why the station model wins

The hypothesis
The GRU feedback fix
The interchange mask
Building the pipeline
The results
What went wrong
Why the station model wins
What this means

The hypothesis

The previous post ended with a compelling argument on paper. The station model's ceiling is set by autoregressive error compounding: a 30-step sequence with 97% per-token accuracy yields only $0.97^{30} \approx 40\%$ exact match. The change model makes 3–4 predictions per route instead of 30–50. If those predictions are right, the stations between each pair of interchanges are deterministic — just walk the line_adj graph.

Three architectural changes went in alongside this experiment: a corrected GRU feedback path across all three decoders, an interchange mask on the change model's station head, and label smoothing zeroed for the change model. The full-profile results with these changes landed at:

Model	Greedy	Beam
line	63.8%	74.8%
change	55.3%	66.4%
station	70.1%	79.1%

The station model barely moved from the previous post (69.3% → 70.1% greedy, 81.0% → 79.1% beam — within noise). The GRU fix didn't help it, confirming the ceiling is structural rather than architectural. Its stratified breakdown tells the story plainly: 88–93% on short routes (≤10 stations), 74% on medium (11–20), 48% on long (21–30), 30% on very long (31–50).

The change model at 66.4% beam gave the hierarchical approach its shot. Two-thirds of routes have a correct leg-level prediction somewhere in the beam. If the graph walk can convert those into full station sequences, hierarchical decoding could potentially match or beat the station model's 79.1%.

The GRU feedback fix

A quick aside on the architectural change, since it affected all three models and the line model showed the clearest benefit.

The original decoders called self.gru(h, h) — passing the hidden state as both input and hidden state to the GRU cell — then added the feedback embedding via residual connection afterwards. The GRU cell's signature is GRUCell(input, hidden), and by giving it the same tensor for both, the reset and update gates never see what the model just predicted. The feedback arrives too late for the gating mechanism to condition on it.

The fix introduces a learned start_input parameter for the first step, then feeds each step's feedback embedding as the GRU input with the recurrent state as hidden. The gates can now directly condition on the previous prediction.

The line model showed the cleanest signal: it reached comparable accuracy in half the epochs on the dev profile. The station model was unaffected — its bottleneck is sequence length, not per-step decision quality.

The interchange mask

The change model's station_head previously scored all 272 stations at each step. But the valid interchange for a given leg must be on the predicted line, and we already know which stations serve which lines. A (n_lines, n_stations) boolean mask constrains the station head to ~35 candidates per line instead of 272.

During training, the mask follows the teacher's line prediction (same principle as adjacency masking in the station model — the teacher sequence is always valid, so the correct station is always unmasked). At inference, it follows the model's own line prediction.

Label smoothing was zeroed for the same reason as the station model: smoothing assigns target probability to masked-out stations with logits of $-10^4$ , producing irrecoverable loss. The mask is the regulariser now.

Building the pipeline

The hierarchical decode pipeline has no trainable components. It takes the change model's beam output and attempts to expand each hypothesis into a full station sequence:

Parse the beam tokens into (line, direction, interchange) legs.
Stop parsing once the predicted interchange matches the destination.
For each leg, BFS along line_adj from the current station to the interchange on the predicted line.
Concatenate the leg segments, deduplicating at interchange stations.
Take the first beam hypothesis that produces a valid walk.

An additional repair step checks whether the predicted line actually serves the current station. If not, it looks for a line that serves both the current station and the predicted interchange, and substitutes it. This catches cases where the change model names the wrong line but picks the right interchange — a common failure mode on shared-track segments.

The results

A diagnostic run on 200 val examples, comparing hierarchical decode against the station model on the same sample:

Hierarchical:  115 (57.5%)
  expand failed: 52 (26.0%)
  wrong route:   33 (16.5%)
Station model: 159 (79.5%)

The station model wins by 22 points.

What went wrong

The failures break into two clean categories, and examining them explains why the hypothesis failed.

Expansion failures (26%) are cases where no beam hypothesis could be walked on the graph. These are the change model getting the legs genuinely wrong — predicting "northern" from Hainault (which is on the Central line), or routing Wood Green to Cockfosters via three transfers when it's four stops on the Piccadilly. The line repair logic catches some of these, but when the model picks both the wrong line and the wrong interchange, there's nothing to repair. This is the change model's 34% beam miss rate manifesting as route-level failures.

Wrong route (16.5%) splits into two subcategories. Some are branch ambiguity: the Northern line forks north of Camden Town, and BFS returns whichever branch it finds first. If the ground truth goes via Belsize Park and BFS walks via Kentish Town, the route is valid but doesn't match. Others are valid-but-unlabelled routes: the model predicts a real journey that isn't among the 3 stored label variants per OD pair. Liverpool Street → Clapham South via the Waterloo & City line and Northern line is a real route — it just wasn't enumerated.

Why the station model wins

The arithmetic that motivated hierarchical decoding was correct but incomplete.

Yes, a 4-step sequence with 90% per-step accuracy gives 65% exact match, versus $0.97^{30} \approx 40\%$ for a 30-step sequence. But the change model doesn't have 90% per-step accuracy — it has 66.4% beam coverage, meaning 34% of routes have no correct hypothesis in the beam at all. And when a leg is wrong, the entire route fails. There's no partial credit, no graceful degradation.

The station model's errors are different in kind. It compounds small per-token errors over many steps, but adjacency masking means each error only sends the route one station off course. A wrong turn at step 15 doesn't invalidate steps 1–14. Beam search can recover by maintaining hypotheses that took the right turn. The station model's errors are local; the change model's errors are global.

The breakeven point would require the change model to reach roughly 85%+ beam accuracy on the leg-level predictions. At 66%, the leg-level error rate (34%) dominates the station model's compounding error rate.

What this means

The hierarchical approach was the right experiment to run and the wrong bet to make. The change model is too inaccurate at the leg level for deterministic expansion to work. It would need to nearly double its beam gap — from 66% to 85%+ — before hierarchical decoding becomes competitive.

The station model at 70.1% greedy / 79.1% beam is the best route predictor in this system. Its ceiling is real — long routes above 30 stations are essentially unsolved — but on the 90% of routes that are under 20 stations, it's performing well.

The change model remains useful for what it was designed for: telling you which lines to take and where to change. At 55.3% greedy it gets the abstract itinerary right more than half the time, and at 66.4% beam the correct answer is usually in the top 5. That's a perfectly reasonable journey planner output — you just shouldn't try to reconstruct the full station sequence from it.

The code is at lmmx/tubeulator-models#4.