Semantic Types

Paths would ideally represent necessity and possibility

When I program with Python paths, I frequently find myself reaching for types to represent semantics, which I've only encountered in Pydantic. I don't think it's particularly correct for this to be something in that library (as much as I appreciate them providing them), as it's rather more fundamental and shouldn't be tied up with runtime validation, nor something they should have to maintain (being something of a marginal concern I would not be surprised if they entirely removed it from their library in the future).

File paths, directory paths, paths that must/must not exist, fragments — I would ideally be able to represent all of these as specific types rather than the single simple Path type.

Pathlib was a big upgrade over plain strings all the way back in 3.4 (7 years ago: March 2019), best part of a decade on it's something we're all well accustomed to. How we might go beyond it?

Existing Extended Path Types

Pydantic offers FilePath, DirectoryPath, and NewPath which validate at runtime that a path exists and is the correct kind, or in the case of NewPath, that it must not yet exist. These capture programmer intent within the type: you want to refer to something that exists, or you to something to be created fresh.

Effect Systems

This ventures into the nefarious realm of effect systems. In programming language theory, an effect is a part of the program's semantics, the observable behaviours of a program beyond producing a return value. Effect systems model and track these behaviours inside the language.

Besides printing to the console, file system I/O is probably the most common form of effect. Aspects of the file system completely outside of the program's scope can produce errors that are "within the semantics of the program" (that is: we must always expect them) but not necessarily representable in the program's variables (that is: we could never predict them).

For example: to write to a file "foo" might work, but a file "bar" might be symlinking a read-only path and raise a permissions error at runtime (ditto for web fetches, and so on). The effect is conditional on program data (the choice of "foo" or "bar") but not reducible to pure data transformations (there is no function you could write that would infallibly tell you if the path will be OK to use).

What Else Is Possible?

In Haskell, path libraries like pathtype and StrongPath use 'phantom types' to track properies such as absoluteness, file vs. directory, and normalisation at compile time. This makes certain operations unrepresentable: you cannot append to a file path, combine two absolute paths, or construct a path that is known to be invalid.

The crux of these libraries is a distinction between describing a path and realising it as a concrete value.

In Python, path composition always produces a concrete Path value. Once that happens, any semantic context—what kind of path it was intended to be, where it was relative to, or how it should be reused—is erased.

For example if I want to create a file of the same name in multiple directories, I have to repeatedly compose arithmetic operations. This frequently happens due to parameter passing, and causes an unpleasant side effect that once baked in, everything downstream is already realised.

x = Path("foo1") / "bar"
y = x.parent / "bar"
z = y.parent / "bar"

It's like having to write x=10+1x=10+1, y=10+110+2y=10+1-10+2, z=10+110+210+3z=10+1-10+2-10+3 rather than vectorising as x,y,z=10+[1,2,3]x,y,z = 10 + [1,2,3].

Oh, you can only see the variable z in your function and you want to recall its value? Well, you have to trace back all the operations through the code, and there's no way to guarantee it's right.

Oh, you want to change "foo1" to "foo2" based on runtime config? Well, y and z have been statically set from x's default already so now you have to rewrite your code to set those dynamically.

What if we could write

leaf = Path.fragment("bar")
for parent in [Path("foo1"), Path("foo2")]:
    full = leaf.under(parent)

or

leaf = Path.fragment("bar")
a,b,c = map(leaf.under, [Path("foo1"), Path("foo2")])

In Python's pathlib you phrase this the other way around: parent.joinpath(leaf), but the difference here is less the syntax than the idea of deferring. If we could represent the path more like a query plan (with semantics but no location) then we are storing the intent as well as the actual concrete value.

Anchoring a fragment to a concrete filesystem is an effect: it commits the path to a particular environment and moment in time. Once that happens, we lose the ability to reuse or reason about the path independently of that context.

Bidirectional Paths and Lenses

What else might this imply: in what ways might a path change? The limitation of Path is that it forces path semantics to collapse immediately into concrete values, leaving no room to represent intent independently of realisation. But intent is a vague term: perhaps it's better in a temporal framing: we are formalising a notion of a type's future (what we plan for it).

This naturally leads to the question: could we also consider its past? This is often called "data lineage", but we could give more structure to this than a simple sequence of concrete values, as well as add structured metadata to record things like timestamps:

raw_data = Path("data/raw/input.csv").tag(source="upload", timestamp=upload_time)
processed = transform(raw_data)
# processed.provenance -> {derived_from: raw_data, operation: "transform", ...}

# Later, you can ask: where did this file come from?
output_model.trace_lineage()
# -> output_model <- training_data <- processed_features <- raw_data

This would naturally lend itself to both debugging and cache invalidation (identifying originating files to be deleted as well as some known corrupt or stale downstream artifact).

Functional programming has the concept of lenses—bidirectional accessors that let you both get and set nested data. Paths could have similar structure. A "path lens" might describe not just "where is the file" but "how to transform between two locations".

# Any file in src/ has a corresponding file in build/
src_to_build = PathLens(
    forward=lambda p: Path("build") / p.relative_to("src"),
    backward=lambda p: Path("src") / p.relative_to("build"),
)

# Now you can map in either direction
source = Path("src/module/foo.py")
output = src_to_build.forward(source)  # build/module/foo.py
recovered = src_to_build.backward(output)  # src/module/foo.py

# Or apply to whole trees
build_files = src_to_build.forward_all(src_dir.rglob("*.py"))

To me this fits very naturally into the idea of a data pipeline (consider also: build systems, backup systems, migration scripts, testing).

It's inherent to many operations to give a path a notion of passing from an input source to an output destination. In fact, a lot of my usage of I/O-linted Pydantic path types is just to add pre- and post-conditions onto pipeline steps (ensure the input exists and the output doesn't, or similar).

Even aside from the existence of the file though, it's of central importance where things come from and lead to. What if we treated this like we treat query plans in DataFrames like Polars?

Path Capabilities

We discussed effects, a related term is capabilities: these are the things one can do with a path (read/write/create/delete/enumerate). Currently these are checked at runtime when you attempt the operation, but we could track them in types:

def ingest_data(source: Path[Readable]) -> DataFrame: ...
def write_report(dest: Path[Writable, MayCreate]) -> None: ...
def archive_logs(log_dir: Path[Listable, Readable], archive: Path[Writable]) -> None: ...

The type system can then verify you don't pass a read-only path to a function that writes, or a file where a directory is expected, and you've explicitly handled the "may not exist" case before reading.

Note that this introduces a sort of nullable type to the capability (some things are going to always be created, some only maybe), which might in turn be something checked.

In all these cases, being more precise about this can catch both systemic disastrous mistakes and rare one-off flukes that would cause data loss. It also takes such concerns out of the business logic and into the type system, making the error unrepresentable rather than implying a need for defensive programming.

This is what Rust's typed_path does with push_checked() returning errors for traversal attacks. Python could do similar with types:

# Path that's statically known to be under a trusted root
def serve_static(asset: Path[Under[STATIC_ROOT], Readable]) -> Response: ...

# Trying to pass an arbitrary path is a type error
serve_static(Path("/etc/passwd"))  # Error: not Under[STATIC_ROOT]

(Traversal attack is a common form of security vulnerability and one of the main uses of people developing capability-related libraries, like Rust's cap-std)

Content-Addressed Paths

Nix takes a radical approach: paths are derived from the content (and build instructions) of what's at them. /nix/store/ab3f2... is computed from the hash of the package's inputs, making them reproducible, cacheable, parallelisable, and deterministic. Again these seem like useful properties for working with data artifacts in pipelines.

Path Schemas and Contracts

Projects have conventional structures and typically we end up encoding this in incredibly rudimentary ways like globals or at best a dataclass or other model that we have to reluctantly lug around. What if the type system held all this for us instead?

@directory_schema
class MLProject:
    data: Dir
    data.raw: Dir
    data.processed: Dir
    models: Dir
    config: File["config.yaml"]

# Bind to an actual directory
project = MLProject.bind("/home/user/my-project")

# Now you have typed accessors
project.data.raw  # Path to data/raw, statically known to be a directory
project.config    # Path to config.yaml, statically known to be a file

# Validation
MLProject.validate("/home/user/my-project")  # Checks structure exists
MLProject.scaffold("/home/user/new-project")  # Creates the structure

What if we could also run type inference on a project and get this automatically?

My library polars-genson (atop of genson-core) does schema inference on JSON, what if it did something similar for file systems? It'd be a kind of 'structural schema' inference over the file system, from observations of filesystem paths instead of JSON documents.

For JSON, genson-rs infers object keys, data types, optional vs. required, arrays vs. scalars, and then genson-core extends it by 'unifying' Record types (like [{"lang":{"en": "hello"}}, {"lang":{"fr": "bonjour"}}]) as Map types like lang: Map<str,str> instead of non-required key heavy records like lang: Record{en: Optional<str>, fr: Optional<str>}.

From a corpus of directory trees you would infer their node kind (directory, file), name fixity or else parameterisation (numeric index like file-001-of-010, wildcard like *.json, regex-like: model-{epoch}.pt), cardinality (0/1 optional, exactly 1, 0+, 1+), and so on.

The schema would look something like this:

class FSNodeSchema:
    kind: Literal["dir", "file"]
    name: NameSchema
    children: dict[str, FSNodeSchema] | None
    cardinality: Cardinality

Where NameSchema might be:

class NameSchema:
    kind: Literal["literal", "glob", "regex", "parameterized"]
    pattern: str

and maybe Cardinality as an enum:

class Cardinality(Enum):
    ONE = "one"
    OPTIONAL = "optional"
    MANY = "many"

This mirrors JSON schema concepts closely enough to lend some genson-core ideas. For each root directory you observe:

def walk(root: Path) -> TreeObservation:
    ...

you'd emit something like:

{
  "data": {
    "raw": {},
    "processed": {}
  },
  "models": {
    "model-001.pt": None,
    "model-002.pt": None,
  },
  "config.yaml": None
}

with directories as schema dicts and files as leaves.

The name patterning would parallel genson-core's Map type inference, where instead of treating a directory as a big Record type we'd try to infer a Map to capture as much as we can (such as incrementing index or indexes in the filename).

For example:

models/
  model-001.pt
  model-002.pt
  model-003.pt

would be inferred as:

models:
  model-{index:03}.pt: File[int]

allowing you to ascend from f-string based primitives to a more mature interface on the file system, at the artifact dataset level.

Path Algebra

Sometimes you want to describe sets of paths and operate on them algebraically, and if we have a deferred type then we can do this (much like query planners in libraries like Polars).

Let's imagine here PathQ is a new deferred Path 'query' type

# Union: try these locations in order
config = UserConfig | SystemConfig | DefaultConfig

# Intersection: paths that match multiple criteria
logs = PathQ.under("var/log") & PathQ.glob("*.log") & PathQ.modified_after(yesterday)

# Difference: exclude certain paths
sources = PathQ.under("src") - PathQ.glob("*_test.py")

# Collection happens lazily
for path in sources.collect():
    process(path)

These are still expressions, not concrete lists.

Symbolic/Late-Bound References

Sometimes you want to refer to "the path that will be determined later". Function parameters do this, but you lose the ability to compose them.

Consider the current situation here, where you have to pass root through every layer:

def process(root: Path):
    config = root / "config.yaml"
    data = root / "data"
    output = root / "output"

What if instead:

ROOT = PathVar("root")
CONFIG = ROOT / "config.yaml"
DATA = ROOT / "data"
OUTPUT = ROOT / "output"

and only later do you bind the variable

paths = {ROOT: Path("/actual/root")}
actual_config = CONFIG.collect(paths)

This is late binding for paths. Like string formatting, but with path semantics preserved. Pydantic allows something similar for imports (ImportString), which will perform the import at runtime upon model validation (typically when an instance of the model is instantiated).

Virtual Filesystems

Local files can also be referred to with file:// protocol, but this is neglected in Python's pathlib. Likewise we can use fsspec's path abstraction, albeit to a limited degree, to access a pathlib-like Path interface on remote blob stores in commercial cloud services and in-memory file systems.

While I haven't developed with this library in a while, I'd imagine its example could be learnt from to use the type system for filesystem tracking more generally.

Suggested Syntax

So far, so much pie in the sky. What might this look like in practice?

For fragments we'd want to specify them as a dedicated type or as a method on Path.

fragment = Path.fragment("cache", "models", "v1")

which we turn into a realised path by placing it with fragment.under(Path.home() / ".myapp").

We might also have readymade anchors:

from pathlib import XDG
fragment.under(XDG.CACHE_HOME / "myapp")

A simple way to implement these types would be to use typing.Annotated:

from typing import Annotated
from pathlib import Existing, File, Dir, Under

ConfigFile = Annotated[Path, Existing, File]
OutputDir = Annotated[Path, Dir, Under[PROJECT_ROOT]]

def load_config(path: ConfigFile) -> Config: ...
def save_output(dir: OutputDir, name: str) -> None: ...

Note in this case the intent of having the project root be hardcoded is explicit, but if we wanted to make something variable we could use the late binding approach.

We could have path builders:

class ProjectPaths:
    def __init__(self, root: Path):
        self._root = root

    @path_property
    def config(self) -> Path[File]:
        return self._root / "config.yaml"

    @path_property  
    def data(self) -> Path[Dir]:
        return self._root / "data"

    @path_property
    def cache(self) -> Path[Dir, MayNotExist]:
        return self._root / ".cache"

though this is quite verbose and the exact style I'd prefer to avoid, preferably with a concise dataclass or model, but either way, once built used like so:

proj = ProjectPaths(Path.cwd())
proj.config  # Typed as file
proj.data    # Typed as directory

And last but not least we would get the aforementioned security protections against path traversal attacks:

@validated
class SecurePath(Path):
    """Path that's validated to be under a trusted root."""

    def __new__(cls, *args, root: Path):
        path = super().__new__(cls, *args)
        if not path.is_relative_to(root):
            raise PathEscapeError(f"{path} is not under {root}")
        return path

# Or as a type
TrustedPath = Path.constrained(under=ALLOWED_ROOT, normalized=True)