Lately, I've been thinking about structured generation, and in particular an idea I call second-order data modelling. I’ve found it difficult to verbalise what exactly I'm trying to get at due to its abstract nature. The problem with abstract ideas is that attempting to articulate them can overwhelm your working memory, or rather fill it with the specific instances and precise details, which ironically causes you to lose sight of the big (abstract) picture.
While I do want to touch on the finer points, I first want to articulate the concept in a more abstract sense, and promise to follow that up with concreteness later down the line. I don't typically have this issue, but I think the fact that this is a meta-concept lends itself to this curious failure mode.
Understanding Second-Order Data Models
A second-order data model is essentially a schema for schemas. It operates at a meta-level, it is recursive in nature, but it only recurses once, which is what makes it second-order.
To better understand this, I want to clarify what a first-order data model is.
- A first-order data model is simply a schema, which is a formal definition of an object.
- This definition may contain entities, and each entity may include typed fields.
There are many examples of first-order data models.
- The primary one I’ll be referring to here are Pydantic data models, which can also be exported as JSON schemas, another common format.
- Other examples include protobuf and various other schema languages.
If we were to 'go up a level', we would encounter 'schemas of schemas', or second-order models. However, it's important to distinguish these from concepts like grammars.
A grammar is more like an alphabet, a set of rules that defines the space of all possible schemas, rather than the specific type of schema I’m discussing here. Grammars represent a broader category, encompassing the potential of all possible schemas, which is different from the structured models we're focusing on, nor am I addressing languages.