Data Versioneering

Frozen data as code is the logical endpoint of versioning it

Data as code implies data should not be configurable
The reluctant data versioneer
Bref

No man ever steps in the same river twice.

— Heraclitus

The fluid, real-time communication of system state is much like the flow of water in a river, or the flow of traffic in a highway system; it is a flow of activity.

— James Urquhart, Flow Architectures (O'Reilly, 2021)

If the river were to freeze, would Heraclitus be able to step in the same river twice? Obviously.

Data sources should be immutable in model training pipeline code, not configurable at runtime.

This may strike you as either trivial, or totally off-piste (which was my original view on this idea).

I believe that hardcoding corpus version specifiers in an immutable data registry is good, and a logical endpoint of the frequently invoked wisdom to "version your data".

Data as code implies data should not be configurable

This conclusion falls out of using the proper vocabulary for what it is we do when we write data pipelines.

A data store stops being merely a "source directory" in a mundane sense when the source directory stops being overwritten.

For example:

A museum catalogue is an inventory, and we might refer to its pieces as data. An artist's 10th and 11th iterations of the same work constitute distinct versions: if the museum switches out one for the other between seasons, they do not discard one. The versions can represent the same work, but are fundamentally not interchangeable.
A supermarket's inventory on the other hand is not a versioned store: one can of soup is not recorded with a distinct version to the next (they use the same barcode). The cans are fundamentally interchangeable.

A similar distinction is often made as "cattle not pets", but to me this hand-waves past the differentiator. Cattle are considered interchangeable, while all pets are unique versions. What makes the difference?

The Heraclitean river provides a criterion: what distinguishes versionable data from the undifferentiated stream is whether we can name a fixed state. The river cannot be fixed, so the most we can say about it is the state of affairs at the moment of observation—fundamentally interchangeable with any other moment. Freeze the river and it becomes singular, nameable, retrievable. Likewise the Ship of Theseus: the puzzle dissolves if you version the ship. Theseus-v1 and Theseus-v2 are distinct artifacts that happen to share a lineage.

For any scenario where we care about handling data deliberately, we are dealing with immutably versioned registries. "Version your data" is not a descriptive measure (like "today's shipment of soup" on the supermarket shelf), but a naming convention created to constrain the model training program such that it is deterministic and reproducible.

To reiterate: versioning data is not merely descriptive, it is operational.

The reluctant data versioneer

I'm wary to speak of versioning data, since this implies a connection between model training code version and data registry version (data version increment constituting 'breaking change'). On first glance, reality seems to ridicule this resemblance.

If data versions and code versions are tied, then how would anyone ever ship a model without a prohibitively involved versioning system?

This was my misgiving. Having given it further thought, I can see how it would be possible—and whether you do so is mainly down to how rigorous you wish to be.

Tying the model version to the data version is defensible if (and only if) you have a singular data source. In other words, one corpus version begets one model version, which implies a one-to-one relation of model training package code to corpus.

This might be by value (a directory named model-a/v1, or a materialised registry on disk), or by reference (a git tag model-a-v1 with an associated commit hash).

Having multiple models therefore implies multiple packages, such that a pipeline ought be considered a distribution of packages with an associated distribution version.

This consequence resolved my hesitance. You could phrase it as: to speak of "a model" as a coherent entity, you must be able to refer to its lineage—its developmental history. A model artifact without traceable lineage isn't knowable, its provenance a matter you can only gesture to.

Shipping a suite or ensemble of models thus reduces to shipping a software distribution—a problem that is well understood.

Bref

"Version your data" is not descriptive hygiene but an operational constraint that makes the program deterministic.
This implies data version and code version are coupled. At first this seems prohibitive.
The distribution framing dissolves the problem: one corpus version → one model version → one package. Multiple models = multiple packages. The pipeline is a distribution. We already know how to ship distributions.