Connectivity

Paths are made to be pieced together

The Problem
Relations As Variables
Sibling and Cousin Relations
Type-Level Connectivity
Composition of Connected Paths
Requirements/Recap

Paths have an inherent relational structure, but pathlib only gives us operations that produce concrete values, not relations you can hold onto as values themselves. In part 1 I discussed some elements of a way out (bidirectional paths and late binding). Here I want to focus on what it would mean to treat connectivity as first-class.

The Problem

After this line:

config = project_root / "config.yaml"

config is just /home/me/myproject/config.yaml. The fact that it was constructed relative to project_root is gone. You can recover it with config.relative_to(project_root), but that's a runtime computation requiring project_root to still be in scope, not a preserved fact about config.

Internally, Path stores segments in _raw_paths, but this doesn't help:

>>> (Path("a") / "b" / "c")._raw_paths
['a', 'b', 'c']
>>> (Path("a/b") / "c")._raw_paths
['a/b', 'c']
>>> (Path("a/b/c"))._raw_paths
['a/b/c']

In these cases you do retain an internal representation of the path you composed under, but as you see below it's quickly lost:

>>> (Path("a/b/c").parent)._raw_paths
['a/b']
>>> (Path("a/b/c").parent / "c")._raw_paths
['a/b', 'c']

The 'raw path' segments only reflect how the path was most recently constructed, not its compositional history. Operations like .parent lop off the structure and mean you can't rely on it for any kind of meaningful provenance.

Relations As Variables

Instead of just storing the result, consider if we capture the relation itself:

config = project_root.child("config.yaml")

# config retains its structure:
config.base        # -> project_root
config.base_rel    # -> Path("config.yaml")

Now config isn't just a string that happens to start with project_root, it stores project_root as the base it was defined relative to. You could review this, extract it to pass off separately, or more importantly rebase it to recreate a new path in the same relation:

test_config = config.rebase(test_root)

test_config is "config.yaml" relative to test_root, preserving the relation with a new base.

This seems neater than to use a general purpose Path.under(new_parent) (which would not be much different from the existing pathlib Path.joinpath just read in the opposite direction). That is, having to extract and then chain adds another operation rather than operating on an implied base:

test_config = config.base_rel.under(test_root)

This connects directly to the late binding idea from part 1. A PathVar is an unresolved anchor; a connected path is a relation waiting for its anchor to be supplied or substituted.

Sibling and Cousin Relations

The same principle applies to lateral relations:

Root = PathVar("Root")                  # Named, like a TypeVar
csv_in = Root / "data" / "input.csv"
schema = csv_in.sibling("schema.json")  # csv_in.parent / "schema.json", but relationally

Here schema is not Root / "data" / "schema.json": it's defined in terms of csv_in, not Root directly. When you bind Root, both can be collected from it:

path_config = {Root: Path("/tmp/proj")}
csv_in.collect(path_config)             # -> /tmp/proj/data/input.csv
schema.collect(path_config)             # -> /tmp/proj/data/schema.json

The paths dictionary here operates like a dataclass or Pydantic model that provides an instantiation of the PathVar (which operates like a TypeVar). It doesn't look that great but the general idea is there: it uses type annotations at runtime to concretise the symbolic late binding path fragment.

Put simply, you can express relative paths without collapsing everything down to primitive strings under the hood.

For instance you could allow a user to rebase data_file optionally to move some set of things to another dir, and schema_file would go with it. The relation is the thing you're holding/providing, not the realised path.

You could extend this to cousins:

# output.csv lives in ../output/ relative to input.csv
output_file = input_file.cousin("../output", "output.csv")

Though at some point you're just rebuilding a graph structure. The question is where the useful abstraction boundary lies. _

Type-Level Connectivity

The hardest version of this is making connectivity checkable at type-check time:

def process(root: Path, config: Path[Under[root]]) -> None:
    ...

This is difficult in Python because root is a runtime value, not a type. Dependent types would let you express "config: Path is under root: Path" directly, but Python doesn't have them.

You could approximate it with generics:

T = TypeVar("T", bound=Path)

class BasedPath(Generic[T]):
    base: T
    base_rel: Path

def process(root: T, config: BasedPath[T]) -> None:
    # config's type is parametric on root's identity
    ...

But this is awkward. The base becomes part of the type, which means two paths with different bases have different types even if they resolve to the same location. That might be what you want (for safety) or might be annoying (for interoperability).

A more practical approach might be runtime-checked but type-annotated:

ConfigPath = Annotated[Path, RelativeTo["project_root"]]

def load_config(config: ConfigPath) -> Config:
    # At runtime, validate that config.is_relative_to(project_root)
    # At type-check time, document the intent
    ...

This is what Pydantic's validators do, just with connectivity constraints instead of existence constraints.

Composition of Connected Paths

If a has base root, and b has base a, what's b's base?

root = Path("/project")
data = root.child("data")        # base is root
raw = data.child("raw")          # base is data, transitively root

You could flatten the chain (raw.base_rel is "data/raw" relative to root) or preserve it (raw knows about data, data knows about root). The latter is more expressive but more complex.

This is where the path schema idea from part 1 becomes relevant. A schema like:

class Project:
    data: Dir
    data.raw: Dir
    data.processed: Dir

implicitly encodes connectivity: data.raw is under data is under the project root. The schema is a way of declaring the connectivity structure up front rather than discovering it through composition.

Requirements/Recap

So to make connectivity first-class, I'd say you'd need:

A relation type: holding "X relative to Y" as a value, not just the result of computing X from Y.
Base tracking: so paths remember what they were defined relative to, or at least can be queried for it.
Rebasing operations: given a connected path, substitute a different base.
Composition rules: how do connected paths combine? What's the base of a.child("x") where a itself has a base?
Type integration: at minimum, annotations for documentation. Ideally, something a type checker could validate.

None of this would be so hard to come up with, it's just not what pathlib was designed for, and which I doubt I'd try to monkeypatch on. I'd like to see this fleshed out some more (and maybe a few proof of concepts to dissect/critique), and in particular the performance would need to be considered.

Pathlib gives you a nicer syntax for string manipulation on paths, what I'm describing here is closer to a path algebra, where the relations themselves are the objects you manipulate.

It may be something that comes out of the latest crop of type checker development, something I didn't explore here was union and intersection types in ty. I don't really see the role for these as type checkers and runtime use are separate concerns, but if I was developing a new type system for paths I'd probably be exploring what bonus static analysis I'd be able to do (for instance, could we give some sort of report on all the file system effects of a given program).

As for how you implement that, there are a few options floating around now:

Polars-like query planner (instead of filters and selects, it's traversal and fragment composition)
Pydantic-like schema models
Generics
typing.Annotated metadata
(+ Type checker-specific approaches as a bonus)