After touching some real code in part 3 using template strings to demo a working symbolic Path (but still eager), it's time to consider how to make a genuinely deferred/late-binding path. We want that to avoid allocating intermediates unnecessarily when combining and modifying paths.
Before diving in, it's worth understanding the theoretical machinery at hand. When you build an expression tree and transform it into something executable, you're doing compilation (specifically we plan on doing some term rewriting). Picking the right conceptual framing here can elevate us above operational thinking (the steps a machine takes) into the abstract and mathematical realm, while if we make the wrong choices we can easily paint ourselves into a corner.
This post will split into three broad parts: first we'll look at term rewriting (which you can think of as query optimisation), then the execution models (stack machines vs register machines), and lastly we'll consider how much machinery we actually need for our problem.
Term Rewriting Systems
At its core, an optimiser is a term rewriting system. You have terms (nodes in your expression tree) and rules that transform them. A rule has a pattern (left-hand side) and a replacement (right-hand side):
You apply rules until no more apply. What remains is the normal form, the fully reduced expression.
This framework comes from mathematical logic and has been studied extensively. The key properties we care about are:
1. Termination: whether the rewriting process always halts
If your rules can create terms that match other rules in cycles, you loop forever. For path expressions, our rules only shrink or simplify terms, so termination is straightforward.
2. Confluence: whether rule order matters
A system is confluent (or has the Church-Rosser property) if different reduction orders yield the same normal form. Our path rules are confluent. Folding constants before or after cancelling parent-joins gives the same result.
3. Completeness: whether the normal form represents the "best" version
This is where it gets interesting. Consider:
Both are valid, both might be normal forms depending on your rules. Which is "better" depends on whether you're optimising for instruction count, string allocations, or something else.
How Polars does it
Polars implements rewriting through pattern matching on enum variants. From their optimiser code (simplified from polars-plan/src/plans/optimizer/), you can see the structure:
pub trait OptimizationRule {
fn optimize_plan(&self, plan: &IR, expr_arena: &mut Arena<AExpr>) -> Option<IR>;
}
struct PredicatePushdown;
impl OptimizationRule for PredicatePushdown {
fn optimize_plan(&self, plan: &IR, expr_arena: &mut Arena<AExpr>) -> Option<IR> {
match plan {
IR::Filter { input, predicate } => {
// Try to push the filter down into the input
// ...
}
_ => None,
}
}
}
The optimiser loops until fixed point
// simplified version of `optimizer_loop`
fn optimize(
plan: IR,
rules: &[Box<dyn OptimizationRule>]
) -> IR {
let mut current = plan;
let mut changed = true;
while changed {
changed = false;
for rule in rules {
if let Some(new_plan) = rule.optimize_plan(¤t, &mut arena) {
current = new_plan;
changed = true;
}
}
}
current
}
This is classic iterative rewriting. Its simplicity makes it predictable to reason about and debug. Each rule is self-contained and to add new optimisations you add new rules.
The E-Graph Alternative
There's a more sophisticated approach: equality saturation with e-graphs. Instead of committing to one rewrite at a time, you represent all equivalent forms simultaneously.
An e-graph (equivalence graph) groups terms into equivalence classes. When you apply a rewrite rule, you don't replace — you add the result to the same equivalence class. The graph grows to represent all terms reachable by any sequence of rewrites.
After saturation (no new equivalents can be derived), you extract the best term according to a cost function, like the smallest, fastest, or some other metric.
The egg library in Rust pioneered accessible e-graph implementation. It's used in projects like:
- Cranelift: Compiler backend instruction selection
- SPORES: SQL query optimisation
- Herbie: Floating-point expression optimisation
For path expressions, e-graphs are probably overkill. Our term language is small, our rules are few, and the "obvious" rewrites are unambiguous. But if you find yourself with conflicting optimisation goals (minimise allocations vs minimise instructions vs enable parallelism), e-graphs let you defer the choice until you know the cost model.
Datalog for Declarative Rules
Another approach, common in database query optimisers, is encoding rules in Datalog, a logic programming language for expressing recursive queries over relations.
% A node is constant if all its children are constant
constant(N) :- literal(N).
constant(N) :- join(N, L, R), constant(L), constant(R).
constant(N) :- parent(N, C), constant(C).
% Foldable nodes can be evaluated at compile time
foldable(N) :- constant(N).
You express what constitutes an optimisation opportunity declaratively; the Datalog engine figures out how to find all matches. This scales better than hand-written pattern matching when you have many rules with complex interactions.
Systems like Souffle compile Datalog to C++. CockroachDB's optimiser uses a Datalog-like DSL called Optgen. Again, for our scale, this is heavy machinery, but it's where you'd go if the rule set grew complex.
Do we need a parser?
I tried out pest last year and found it very accessible. Where does parsing fit in our path juggling work here?
Parser generators (pest, nom, lalrpop, tree-sitter) transform text into structured data. You need them when your input is a string in some grammar:
We don't have this problem. Our "DSL" is embedded in Python:
root / "data" / template
The Python interpreter parses this. The / operator calls __truediv__, which builds our AST directly. There's no text-to-tree step for us to implement.
Polars works the same way. When you write:
df.filter(pl.col("x") > 5).select(pl.col("y"))
You're calling Python methods that construct a logical plan.
Polars doesn't parse a query string (well, they have pl.sql() now, but that's separate).
Well when would we need a parser?
If you wanted a text-based path pattern language:
pattern = parse_path_pattern("/data/{dataset}/chunk_{idx:04d}.parquet")
Then yes, you'd need a grammar and a parser. Pest would be reasonable for this (pest is a nice ergonomic PEG parser generator), however it'd be adding a new user-facing syntax, not changing the core architecture.
For now, Python is our parser and the expression tree falls out of normal method chaining.
Execution Models: Stack vs Register Machines
Once you have an optimised IR, you need to execute it. Two classic approaches: stack machines and register machines.
Stack Machines
A stack machine has no named registers. Operations pop operands from a stack and push results back.
\\ Compute: (a / b).parent
PUSH a ; stack: [a]
PUSH b ; stack: [a, b]
JOIN ; pop b, pop a, push join(a,b) → stack: [a/b]
PARENT ; pop a/b, push parent(a/b) → stack: [a]
Advantages:
- Simple compilation: just emit operations in order
- Compact bytecode: no operand addressing
- Easy to implement
The JVM and CPython bytecode are stack machines. So is WebAssembly (mostly).
For expression evaluation, stack machines are natural. Your expression is a tree; post-order traversal directly yields stack code.
Register Machines
A register machine has named slots (registers). Operations specify which slots to read and write.
\\ Compute: (a / b).parent
LOAD r0, a ; r0 = a
LOAD r1, b ; r1 = b
JOIN r2, r0, r1 ; r2 = join(r0, r1)
PARENT r3, r2 ; r3 = parent(r2)
; result in r3
Advantages:
- Fewer instructions (no push/pop overhead)
- Explicit data flow enables better optimisation
- Can reuse registers (liveness analysis)
Lua 5.0's shift from stack to register machine improved performance significantly. Modern JavaScript engines (V8) use register-based bytecode internally.
Why Register Machine for Paths?
Several reasons:
Common subexpression elimination (CSE) falls out naturally. If two expressions share a subpath, you compute it once into a slot and reference that slot twice:
\\ file1 = root / "data" / "2024" / "a.csv"
\\ file2 = root / "data" / "2024" / "b.csv"
LOAD r0, root
LOAD r1, "data"
JOIN r2, r0, r1 ; r2 = root/data
LOAD r3, "2024"
JOIN r4, r2, r3 ; r4 = root/data/2024 ← computed once
LOAD r5, "a.csv"
JOIN r6, r4, r5 ; file1
LOAD r7, "b.csv"
JOIN r8, r4, r7 ; file2 ← reuses r4
In a stack machine, you'd either recompute or manually manage duplication.
Slot count is known at compile time. We can pre-allocate the exact vector size:
let mut slots: Vec<PathBuf> = vec![PathBuf::new(); compiled.num_slots as usize];
No dynamic resizing during execution.
The instruction stream is linear. No branching, no control flow — just straight-line code. This is the simplest possible "machine" to execute.
Static Single Assignment
Our slot allocation has an interesting property: each slot is written exactly once. This is Static Single Assignment (SSA) form, a standard compiler IR.
SSA makes dataflow explicit. If you see r4 used somewhere, you know exactly where it was defined — there's only one definition. This simplifies analysis and optimisation.
Real compilers convert to SSA, optimise, then convert back (φ-functions for control flow merges, register allocation to reduce slot count). We don't need the conversion back — our IR stays in SSA because path expressions have no control flow.
Slot allocation
Compiling an expression tree to register code uses a recursive algorithm:
fn compile_expr(&mut self, node: ExprId) -> SlotId {
// Memoization: if we've compiled this node, return its slot
if let Some(slot) = self.cache.get(&node) {
return *slot;
}
let slot = match self.arena.get(node) {
Expr::Literal(v) => {
let s = self.alloc_slot();
self.emit(Instruction::Load { dst: s, value: v });
s
}
Expr::BinaryOp { left, right, op } => {
let l = self.compile_expr(left); // compile children first
let r = self.compile_expr(right);
let s = self.alloc_slot();
self.emit(Instruction::BinOp { dst: s, left: l, right: r, op });
s
}
// ...
};
self.cache.insert(node, slot);
slot
}
The cache is the CSE: if a node was already compiled, we just return its slot. Otherwise, compile children (getting their slots), allocate a new slot for this node, emit an instruction, cache the result.
This is essentially post-order traversal with memoisation.
Could we reduce slot count?
Currently we allocate a fresh slot for every expression node. In practice, many slots are dead after use. In the join example, r0 and r1 example aren't needed after r2 is computed.
Liveness analysis determines when each slot's value is last used. Register allocation then maps the unbounded SSA slots to a smaller set of physical registers, reusing slots whose lifetimes don't overlap.
For paths, this optimisation probably isn't critical as we're not register-starved like a CPU, but if memory pressure mattered (for huge batch resolutions), you could apply linear scan allocation to compress the slot vector.
Putting it all together
Here's how these pieces compose for path expressions:
Each stage involves a different core concept:
- ASTs (tree data structures), which we store in arenas for cache locality
- Term rewriting, a framework for provably-correct transformations
- Register machines, an execution model with good performance characteristics
- SSA, a clean dataflow model for analysis
Avoiding surplus requirements
It's also worth noting the machinery we're not using:
- Parser generators as our DSL is Python method chaining with no grammar to parse.
- E-graphs as our rule set is small and confluent, so iterative rewriting suffices.
- Datalog (for the same reason), declarative rules suit hundreds of interacting optimisations and we have only three.
- Control flow as paths are pure expressions (meaning without branches, loops, or φ-functions).
- Garbage collection as arenas free everything at once.
- JIT compilation as our instruction interpreter is fast enough.
Good engineering is as much knowing when not to reach for power tools. That said, we might eventually want heavier machinery:
- If we add a text-based pattern syntax, but this defeats the point of the current design (where Python t-strings handle the parsing)
- If rules proliferate and interact, which is where you'd use e-graphs or Datalog to manage the combinatorial explosion.
- If we want cost-based optimisation, again e-graphs would let you extract based on custom cost models. You might imagine a scenario where different backends (local fs, S3, ...) have different performance characteristics.
- If we batch-resolve millions of paths, vectorised execution (SIMD string operations) might justify more complex code generation, but that's not really the goal here.
- If we integrate with filesystem operations, if pursuing the earlier ideas around effect systems, capability tracking, and static analysis of IO patterns.
The architecture outlined above supports extension with a clear theoretical vocabulary to work with. Which equipment to reach for beyond that comes down to goals and limitations therein.
In part 5, I walk through the actual implementation of arena allocation in Rust, the pattern-matching optimiser, the slot-based executor, and the pyo3 bridge to Python.