Thinking in Distros

What it means to treat a pipeline as a distribution

The previous post argued that shipping a suite of models reduces to shipping a software distribution. This post specifies what that means for how we think about pipelines.

The pipeline as assemblage

A pipeline is a concept. It can be informal, but it's most efficient to formalise—then we can standardise and operationalise via automation, or run "hands off". Whether tightly organised or unstructured, a data pipeline has recognisable traits:

So a pipeline must compose inputs and outputs as data and I/O side effects, and adapt to change. That sounds daunting if you don't have a mental model of what the pipeline is (which is natural before you formalise it).

The granularity at which we lock in each of these constitutes the degree to which we can refer to our model pipeline as a deterministic artifact—an assemblage of data and code.

Determinism and adaptability

To take a mundane example: if a step is to add a new data source file where previously it took one, we need to maintain both schema compatibility (e.g. the index column to join on) and also semantic compatibility (e.g. the index must mean the same thing in both datasets).

Naive or careless interventions are what make pipelines adapt poorly to change.

The manifest is the unit of input versioning

As with software, we store the version in the manifest (which may use a dynamic version in the code artifact, but we still count that as being in the manifest). In package managers, a package cannot ship without a versioned manifest.

Likewise for model-data-assemblage as software, the singular corpus is what I am calling a monolithic data source. Of course data sources in the real world are never a single file, and as discussed above we do not wish to limit ourselves to even the same number of data sources. Yet we can always list out the entirety of these sources as a singular manifest.

This definitional resolution implies that an ideal manifest ought also double as an interface (though I would say this is rare to see). In many ways that would be ideal: to unify the enumeration of inputs and the means to access their content. This might look like a "data access layer". In reality most people write ad-hoc interfaces.

One notable exception has been pyproject.toml manifests which have made Python software substantially more reproducible. Yet for data interfaces there is not really the same standardisation.

Flow architectures

In James Urquhart's Flow Architectures, the conundrum of how to capture real-time data as a static asset is solved quite matter-of-factly through queues. Queueing commits a stream to an immutable log which can be versioned. The flow itself remains Heraclitean while its record does not.

My framing applies to data already at rest, or that can be made so. The manifest I describe assumes you can enumerate your inputs and pin them. If your inputs arrive as unbounded streams, you are in the flow architect's territory—but the moment you need to train on them, you freeze them, and then you are back in mine.

The virtue of thinking in distributions is that it is a solved problem. We know how to ship distributions. We version packages, pin dependencies, cut releases, manage compatibility. The conceptual work is recognising that model pipelines are distributions—once you see that, you stop treating reproducibility as a quixotic goal and start treating it as operations.

Questions that seemed hard dissolve. "What data did this train on?" is in the manifest. "Can I reproduce this run?" depends on whether you committed the lockfile. "Which model is in production?" is which version you deployed. These are not ML questions. They are packaging questions, and packaging has straightforward answers.