The previous post argued that shipping a suite of models reduces to shipping a software distribution. This post specifies what that means for how we think about pipelines.
The pipeline as assemblage
A pipeline is a concept. It can be informal, but it's most efficient to formalise—then we can standardise and operationalise via automation, or run "hands off". Whether tightly organised or unstructured, a data pipeline has recognisable traits:
- Ingesting some inputs at each step: either pre-formed (a file, which may in turn have come from a pipeline step), or sourced on demand (such as from a web source or a database)
- Emitting some outputs at each step: either as side effect on disk or at some network resource such as cloud storage, or as program state in a variable
- Inputs mutate: not only may a given source change (such as a dataset adding new entries, or its schema evolving) but their multiplicity may change. It should be expected for a pipeline step to go from having one source to multiple, or consolidating many to one.
So a pipeline must compose inputs and outputs as data and I/O side effects, and adapt to change. That sounds daunting if you don't have a mental model of what the pipeline is (which is natural before you formalise it).
The granularity at which we lock in each of these constitutes the degree to which we can refer to our model pipeline as a deterministic artifact—an assemblage of data and code.
Determinism and adaptability
- A well-defined data model of the inputs and outputs of each step is pre-requisite for meaningful mechanistic description of the model.
- Code must be appropriately adaptive, not brittle, to its inputs. This at first seems to contradict the previous point (a precisely defined data model does not leave any freedom to adapt). However I mean adaptable to change as in how easily we can modify its source without overlooking and violating the assumptions of its previous iterations.
To take a mundane example: if a step is to add a new data source file where previously it took one, we need to maintain both schema compatibility (e.g. the index column to join on) and also semantic compatibility (e.g. the index must mean the same thing in both datasets).
Naive or careless interventions are what make pipelines adapt poorly to change.
The manifest is the unit of input versioning
As with software, we store the version in the manifest (which may use a dynamic version in the code artifact, but we still count that as being in the manifest). In package managers, a package cannot ship without a versioned manifest.
Likewise for model-data-assemblage as software, the singular corpus is what I am calling a monolithic data source. Of course data sources in the real world are never a single file, and as discussed above we do not wish to limit ourselves to even the same number of data sources. Yet we can always list out the entirety of these sources as a singular manifest.
This definitional resolution implies that an ideal manifest ought also double as an interface (though I would say this is rare to see). In many ways that would be ideal: to unify the enumeration of inputs and the means to access their content. This might look like a "data access layer". In reality most people write ad-hoc interfaces.
One notable exception has been pyproject.toml manifests which have made Python software substantially more reproducible. Yet for data interfaces there is not really the same standardisation.
Flow architectures
In James Urquhart's Flow Architectures, the conundrum of how to capture real-time data as a static asset is solved quite matter-of-factly through queues. Queueing commits a stream to an immutable log which can be versioned. The flow itself remains Heraclitean while its record does not.
My framing applies to data already at rest, or that can be made so. The manifest I describe assumes you can enumerate your inputs and pin them. If your inputs arrive as unbounded streams, you are in the flow architect's territory—but the moment you need to train on them, you freeze them, and then you are back in mine.
The virtue of thinking in distributions is that it is a solved problem. We know how to ship distributions. We version packages, pin dependencies, cut releases, manage compatibility. The conceptual work is recognising that model pipelines are distributions—once you see that, you stop treating reproducibility as a quixotic goal and start treating it as operations.
Questions that seemed hard dissolve. "What data did this train on?" is in the manifest. "Can I reproduce this run?" depends on whether you committed the lockfile. "Which model is in production?" is which version you deployed. These are not ML questions. They are packaging questions, and packaging has straightforward answers.