Paths have an inherent relational structure, but pathlib only gives us operations that produce concrete values, not relations you can hold onto as values themselves. In part 1 I discussed some elements of a way out (bidirectional paths and late binding). Here I want to focus on what it would mean to treat connectivity as first-class.
The Problem
After this line:
config = project_root / "config.yaml"
config is just /home/me/myproject/config.yaml. The fact that it was constructed
relative to project_root is gone. You can recover it with config.relative_to(project_root),
but that's a runtime computation requiring project_root to still be in scope, not a preserved
fact about config.
Internally, Path stores segments in _raw_paths, but this doesn't help:
>>> (Path("a") / "b" / "c")._raw_paths
['a', 'b', 'c']
>>> (Path("a/b") / "c")._raw_paths
['a/b', 'c']
>>> (Path("a/b/c"))._raw_paths
['a/b/c']
In these cases you do retain an internal representation of the path you composed under, but as you see below it's quickly lost:
>>> (Path("a/b/c").parent)._raw_paths
['a/b']
>>> (Path("a/b/c").parent / "c")._raw_paths
['a/b', 'c']
The 'raw path' segments only reflect how the path was most recently constructed, not its compositional history.
Operations like .parent lop off the structure and mean you can't rely on it for any kind of
meaningful provenance.
Relations As Variables
Instead of just storing the result, consider if we capture the relation itself:
config = project_root.child("config.yaml")
# config retains its structure:
config.base # -> project_root
config.base_rel # -> Path("config.yaml")
Now config isn't just a string that happens to start with project_root, it stores project_root as the base it was defined relative to.
You could review this, extract it to pass off separately, or more importantly rebase it to recreate
a new path in the same relation:
test_config = config.rebase(test_root)
test_config is "config.yaml" relative to test_root, preserving the relation with a new base.
This seems neater than to use a general purpose Path.under(new_parent) (which would not be much
different from the existing pathlib Path.joinpath just read in the opposite direction). That is,
having to extract and then chain adds another operation rather than operating on an implied base:
test_config = config.base_rel.under(test_root)
This connects directly to the late binding idea from part 1. A PathVar is an unresolved anchor;
a connected path is a relation waiting for its anchor to be supplied or substituted.
Sibling and Cousin Relations
The same principle applies to lateral relations:
Root = PathVar("Root") # Named, like a TypeVar
csv_in = Root / "data" / "input.csv"
schema = csv_in.sibling("schema.json") # csv_in.parent / "schema.json", but relationally
Here schema is not Root / "data" / "schema.json": it's defined in terms of csv_in,
not Root directly. When you bind Root, both can be collected from it:
path_config = {Root: Path("/tmp/proj")}
csv_in.collect(path_config) # -> /tmp/proj/data/input.csv
schema.collect(path_config) # -> /tmp/proj/data/schema.json
The paths dictionary here operates like a dataclass or Pydantic model that provides an
instantiation of the PathVar (which operates like a TypeVar). It doesn't look that great but the
general idea is there: it uses type annotations at runtime to concretise the symbolic late binding
path fragment.
Put simply, you can express relative paths without collapsing everything down to primitive strings under the hood.
For instance you could allow a user to rebase data_file optionally to move some set of things to another dir,
and schema_file would go with it. The relation is the thing you're holding/providing, not the realised path.
You could extend this to cousins:
# output.csv lives in ../output/ relative to input.csv
output_file = input_file.cousin("../output", "output.csv")
Though at some point you're just rebuilding a graph structure. The question is where the useful abstraction boundary lies. _
Type-Level Connectivity
The hardest version of this is making connectivity checkable at type-check time:
def process(root: Path, config: Path[Under[root]]) -> None:
...
This is difficult in Python because root is a runtime value, not a type. Dependent types would
let you express "config: Path is under root: Path" directly, but Python doesn't have them.
You could approximate it with generics:
T = TypeVar("T", bound=Path)
class BasedPath(Generic[T]):
base: T
base_rel: Path
def process(root: T, config: BasedPath[T]) -> None:
# config's type is parametric on root's identity
...
But this is awkward. The base becomes part of the type, which means two paths with different bases have different types even if they resolve to the same location. That might be what you want (for safety) or might be annoying (for interoperability).
A more practical approach might be runtime-checked but type-annotated:
ConfigPath = Annotated[Path, RelativeTo["project_root"]]
def load_config(config: ConfigPath) -> Config:
# At runtime, validate that config.is_relative_to(project_root)
# At type-check time, document the intent
...
This is what Pydantic's validators do, just with connectivity constraints instead of existence constraints.
Composition of Connected Paths
If a has base root, and b has base a, what's b's base?
root = Path("/project")
data = root.child("data") # base is root
raw = data.child("raw") # base is data, transitively root
You could flatten the chain (raw.base_rel is "data/raw" relative to root) or preserve it
(raw knows about data, data knows about root). The latter is more expressive but more
complex.
This is where the path schema idea from part 1 becomes relevant. A schema like:
class Project:
data: Dir
data.raw: Dir
data.processed: Dir
implicitly encodes connectivity: data.raw is under data is under the project root. The
schema is a way of declaring the connectivity structure up front rather than discovering it
through composition.
Requirements/Recap
So to make connectivity first-class, I'd say you'd need:
-
A relation type: holding "X relative to Y" as a value, not just the result of computing X from Y.
-
Base tracking: so paths remember what they were defined relative to, or at least can be queried for it.
-
Rebasing operations: given a connected path, substitute a different base.
-
Composition rules: how do connected paths combine? What's the base of
a.child("x")whereaitself has a base? -
Type integration: at minimum, annotations for documentation. Ideally, something a type checker could validate.
None of this would be so hard to come up with, it's just not what pathlib was designed for, and which I doubt I'd try to monkeypatch on. I'd like to see this fleshed out some more (and maybe a few proof of concepts to dissect/critique), and in particular the performance would need to be considered.
Pathlib gives you a nicer syntax for string manipulation on paths, what I'm describing here is closer to a path algebra, where the relations themselves are the objects you manipulate.
It may be something that comes out of the latest crop of type checker development, something I didn't explore here was union and intersection types in ty. I don't really see the role for these as type checkers and runtime use are separate concerns, but if I was developing a new type system for paths I'd probably be exploring what bonus static analysis I'd be able to do (for instance, could we give some sort of report on all the file system effects of a given program).
As for how you implement that, there are a few options floating around now:
- Polars-like query planner (instead of filters and selects, it's traversal and fragment composition)
- Pydantic-like schema models
- Generics
typing.Annotatedmetadata- (+ Type checker-specific approaches as a bonus)