Setting out the problem

From API timetables to graph objects, and three ways to predict a route

Background
From GTFS to graphs
Not another web app
Three models
Deterministic by design

Background

Tubeulator is a library I wrote that code-generates up-to-date Python interfaces from TfL's OpenAPI schemas and handles authentication for you. That side of things — the data engineering — is not particularly interesting in itself. What made it worth writing about is what I wanted to do with the data.

TfL used to publish timetable files as zips in an S3 bucket. Among the many other casualties of the pandemic was that these stopped being updated. In 2022, Kurt Raschke wrote about the withdrawal, confirming via FOI that the dataset was gone for good. Using Tubeulator I was able to reconstruct a GTFS feed from the API's timetable endpoints directly, clocking in at 270 stops, 44,212 trips, and 1,349,594 stop_times across the eleven Underground lines on a single day.

   1 │   bakerloo...
   2 │   central...
   3 │   circle...
   4 │   district...
   5 │   hammersmith-city...
   6 │   jubilee...
   7 │   metropolitan...
   8 │   northern...
   9 │   piccadilly...
  10 │   victoria...
  11 │   waterloo-city...
  12 │ Writing GTFS zip...
  13 │ Done: /home/louis/dev/tubeulator-models/data/tfl_station_data_gtfs.zip
  14 │   270 stops, 44,212 trips, 1,349,594 stop_times

From GTFS to graphs

City2Graph can load a GTFS zip and produce a travel summary graph as a pair of GeoDataFrames (nodes with coordinates, edges with travel time and frequency), and from there convert into NetworkX multigraphs, rustworkx graphs, or PyTorch Geometric Data objects. It can also filter a graph to a named geographic boundary via OSMnx geocoding, which is useful for clipping out intercity and international services that happen to originate at London stations.

The pipeline is exposed as CLI entry points, which is how I like to structure pipelines in Python:

tm-build-gtfs fetches timetables and writes the GTFS zip,
tm-gtfs2graph produces GeoParquet of nodes and edges projected to British National Grid,
tm-graph2pyg converts those into a PyG Data object, and
tm-gtfs2pyg runs the lot end-to-end.

From Transit System to Learning Problem

Not another web app

My original goal was a journey planning web app, which was fun to build in React until it inevitably landed at the missing central problem, that of the routing. A read-only visualisation of the trains on the transit network assumes journey planning would be a trivial concern. I initially tried to sketch a simple one in JavaScript at render time, but common sense told me this was a backend concern.

This time I'm inverting my approach: build the routing mechanism and the interface will come.

Three models

Rather than prejudice the outcome with an upfront decision on what is the best way to model the tube, I'm developing three graph attention models to predict routes each at a different granularity.

Line-sequence model

This will be the coarsest, outputting a short sequence of (line, direction) pairs: equivalent to "take the Jubilee westbound, then the Northern northbound", with nothing about where to change or which intermediate stations you pass through. This mirrors how most Londoners think about their journeys in practice.

Interchange-station model

This adds one piece of info per leg: where to transfer. Its output is a sequence of (line, direction, interchange station) tuples. This resolves the ambiguity of an instruction like "change from District to Northern" being underspecified when Monument and Embankment are both options, and the choice affects journey length.

Full station-sequence model

This would need to predict every station from origin to destination. As the most expressive, with the longest output sequences, I'd expect this to be both the most useful and to have the most ways to be wrong.

All three can be expressed via graph neural network encoder trained over the station graph and differ only in the decoder head, making the comparison controlled.

GATv2 Encoder, Three Decoder Granularities

Deterministic by design

There are two ways to look at a transit network: a fixed, deterministic view based on connectivity, and an ever-changing, contingent view based on a live timetable.

These models are topology-based, not timetable-based, their input graph defined solely by stations and their line connectivity. This means no representation of departure times, delays etc., and edge features are purely structural (line identity, transfer flags, mode type).

Deterministic by Design: Structural vs Live Timetable

Models that give the same answer on a Tuesday morning as on a Sunday evening are less fragile, and I'd suggest this is often what people want. I personally find the experience of travel apps changing their routes ten minutes before departure vs. five minutes later particularly frustrating as it often leaves me racking my brains for what the now-hidden route is as I head out the door. It's a little paternalistic, as of course the algorithm must know best, or perhaps it is rerouting you For The Greater Good.

That kind of instability is something I expressly do not want. The cost is that these models cannot say which route is fastest right now, only what routes exist, which is knowledge that only changes when TfL changes the physical network.

The latter is what production journey planners use (RAPTOR, CSA, and the rest of the classical algorithms), and can be layered on top. But the structural model comes first, as it's what you can reason about stably.