Compilation Theory

Term rewriting, register machines, and optimisation strategies

After touching some real code in part 3 using template strings to demo a working symbolic Path (but still eager), it's time to consider how to make a genuinely deferred/late-binding path. We want that to avoid allocating intermediates unnecessarily when combining and modifying paths.

Before diving in, it's worth understanding the theoretical machinery at hand. When you build an expression tree and transform it into something executable, you're doing compilation (specifically we plan on doing some term rewriting). Picking the right conceptual framing here can elevate us above operational thinking (the steps a machine takes) into the abstract and mathematical realm, while if we make the wrong choices we can easily paint ourselves into a corner.

This post will split into three broad parts: first we'll look at term rewriting (which you can think of as query optimisation), then the execution models (stack machines vs register machines), and lastly we'll consider how much machinery we actually need for our problem.

Term Rewriting Systems

At its core, an optimiser is a term rewriting system. You have terms (nodes in your expression tree) and rules that transform them. A rule has a pattern (left-hand side) and a replacement (right-hand side):

Parent(Join(a,b))a Parent(Join(a, b)) \rightarrow a

Join(Literal(x),Literal(y))Literal(x.join(y)) Join(Literal(x), Literal(y)) \rightarrow Literal(x.join(y))

WithSuffix(WithSuffix(a,_),s)WithSuffix(a,s) WithSuffix(WithSuffix(a, \_), s) \rightarrow WithSuffix(a, s)

You apply rules until no more apply. What remains is the normal form, the fully reduced expression.

This framework comes from mathematical logic and has been studied extensively. The key properties we care about are:

1. Termination: whether the rewriting process always halts

If your rules can create terms that match other rules in cycles, you loop forever. For path expressions, our rules only shrink or simplify terms, so termination is straightforward.

2. Confluence: whether rule order matters

A system is confluent (or has the Church-Rosser property) if different reduction orders yield the same normal form. Our path rules are confluent. Folding constants before or after cancelling parent-joins gives the same result.

3. Completeness: whether the normal form represents the "best" version

This is where it gets interesting. Consider:

a/"x"/"y" a / \text{"x"} / \text{"y"}

vs \text{vs}

a/"x/y" a / \text{"x/y"}

Both are valid, both might be normal forms depending on your rules. Which is "better" depends on whether you're optimising for instruction count, string allocations, or something else.

How Polars does it

Polars implements rewriting through pattern matching on enum variants. From their optimiser code (simplified from polars-plan/src/plans/optimizer/), you can see the structure:

pub trait OptimizationRule {
    fn optimize_plan(&self, plan: &IR, expr_arena: &mut Arena<AExpr>) -> Option<IR>;
}

struct PredicatePushdown;
impl OptimizationRule for PredicatePushdown {
    fn optimize_plan(&self, plan: &IR, expr_arena: &mut Arena<AExpr>) -> Option<IR> {
        match plan {
            IR::Filter { input, predicate } => {
                // Try to push the filter down into the input
                // ...
            }
            _ => None,
        }
    }
}

The optimiser loops until fixed point

// simplified version of `optimizer_loop`
fn optimize(
    plan: IR,
    rules: &[Box<dyn OptimizationRule>]
) -> IR {
    let mut current = plan;
    let mut changed = true;
    while changed {
        changed = false;
        for rule in rules {
            if let Some(new_plan) = rule.optimize_plan(&current, &mut arena) {
                current = new_plan;
                changed = true;
            }
        }
    }
    current
}

This is classic iterative rewriting. Its simplicity makes it predictable to reason about and debug. Each rule is self-contained and to add new optimisations you add new rules.

The E-Graph Alternative

There's a more sophisticated approach: equality saturation with e-graphs. Instead of committing to one rewrite at a time, you represent all equivalent forms simultaneously.

An e-graph (equivalence graph) groups terms into equivalence classes. When you apply a rewrite rule, you don't replace — you add the result to the same equivalence class. The graph grows to represent all terms reachable by any sequence of rewrites.

a/"x"/"y"a/"x/y" a / \text{"x"} / \text{"y"} \equiv a / \text{"x/y"}

After saturation (no new equivalents can be derived), you extract the best term according to a cost function, like the smallest, fastest, or some other metric.

The egg library in Rust pioneered accessible e-graph implementation. It's used in projects like:

For path expressions, e-graphs are probably overkill. Our term language is small, our rules are few, and the "obvious" rewrites are unambiguous. But if you find yourself with conflicting optimisation goals (minimise allocations vs minimise instructions vs enable parallelism), e-graphs let you defer the choice until you know the cost model.

Datalog for Declarative Rules

Another approach, common in database query optimisers, is encoding rules in Datalog, a logic programming language for expressing recursive queries over relations.

% A node is constant if all its children are constant
constant(N) :- literal(N).
constant(N) :- join(N, L, R), constant(L), constant(R).
constant(N) :- parent(N, C), constant(C).

% Foldable nodes can be evaluated at compile time
foldable(N) :- constant(N).

You express what constitutes an optimisation opportunity declaratively; the Datalog engine figures out how to find all matches. This scales better than hand-written pattern matching when you have many rules with complex interactions.

Systems like Souffle compile Datalog to C++. CockroachDB's optimiser uses a Datalog-like DSL called Optgen. Again, for our scale, this is heavy machinery, but it's where you'd go if the rule set grew complex.

Do we need a parser?

I tried out pest last year and found it very accessible. Where does parsing fit in our path juggling work here?

Parser generators (pest, nom, lalrpop, tree-sitter) transform text into structured data. You need them when your input is a string in some grammar:

"/home/user/data/*.csv"AST \text{"/home/{user}/data/*.csv"} \rightarrow AST

We don't have this problem. Our "DSL" is embedded in Python:

root / "data" / template

The Python interpreter parses this. The / operator calls __truediv__, which builds our AST directly. There's no text-to-tree step for us to implement.

Polars works the same way. When you write:

df.filter(pl.col("x") > 5).select(pl.col("y"))

You're calling Python methods that construct a logical plan. Polars doesn't parse a query string (well, they have pl.sql() now, but that's separate).

Well when would we need a parser?

If you wanted a text-based path pattern language:

pattern = parse_path_pattern("/data/{dataset}/chunk_{idx:04d}.parquet")

Then yes, you'd need a grammar and a parser. Pest would be reasonable for this (pest is a nice ergonomic PEG parser generator), however it'd be adding a new user-facing syntax, not changing the core architecture.

For now, Python is our parser and the expression tree falls out of normal method chaining.

Execution Models: Stack vs Register Machines

Once you have an optimised IR, you need to execute it. Two classic approaches: stack machines and register machines.

Stack Machines

A stack machine has no named registers. Operations pop operands from a stack and push results back.

\\ Compute: (a / b).parent

PUSH a        ; stack: [a]
PUSH b        ; stack: [a, b]
JOIN          ; pop b, pop a, push join(a,b)  stack: [a/b]
PARENT        ; pop a/b, push parent(a/b)  stack: [a]

Advantages:

The JVM and CPython bytecode are stack machines. So is WebAssembly (mostly).

For expression evaluation, stack machines are natural. Your expression is a tree; post-order traversal directly yields stack code.

Register Machines

A register machine has named slots (registers). Operations specify which slots to read and write.

\\ Compute: (a / b).parent

LOAD r0, a          ; r0 = a
LOAD r1, b          ; r1 = b
JOIN r2, r0, r1     ; r2 = join(r0, r1)
PARENT r3, r2       ; r3 = parent(r2)
; result in r3

Advantages:

Lua 5.0's shift from stack to register machine improved performance significantly. Modern JavaScript engines (V8) use register-based bytecode internally.

Why Register Machine for Paths?

Several reasons:

Common subexpression elimination (CSE) falls out naturally. If two expressions share a subpath, you compute it once into a slot and reference that slot twice:

\\ file1 = root / "data" / "2024" / "a.csv"
\\ file2 = root / "data" / "2024" / "b.csv"

LOAD   r0, root
LOAD   r1, "data"
JOIN   r2, r0, r1        ; r2 = root/data
LOAD   r3, "2024"
JOIN   r4, r2, r3        ; r4 = root/data/2024   computed once
LOAD   r5, "a.csv"
JOIN   r6, r4, r5        ; file1
LOAD   r7, "b.csv"
JOIN   r8, r4, r7        ; file2  reuses r4

In a stack machine, you'd either recompute or manually manage duplication.

Slot count is known at compile time. We can pre-allocate the exact vector size:

let mut slots: Vec<PathBuf> = vec![PathBuf::new(); compiled.num_slots as usize];

No dynamic resizing during execution.

The instruction stream is linear. No branching, no control flow — just straight-line code. This is the simplest possible "machine" to execute.

Static Single Assignment

Our slot allocation has an interesting property: each slot is written exactly once. This is Static Single Assignment (SSA) form, a standard compiler IR.

SSA makes dataflow explicit. If you see r4 used somewhere, you know exactly where it was defined — there's only one definition. This simplifies analysis and optimisation.

Real compilers convert to SSA, optimise, then convert back (φ-functions for control flow merges, register allocation to reduce slot count). We don't need the conversion back — our IR stays in SSA because path expressions have no control flow.

Slot allocation

Compiling an expression tree to register code uses a recursive algorithm:

fn compile_expr(&mut self, node: ExprId) -> SlotId {
    // Memoization: if we've compiled this node, return its slot
    if let Some(slot) = self.cache.get(&node) {
        return *slot;
    }

    let slot = match self.arena.get(node) {
        Expr::Literal(v) => {
            let s = self.alloc_slot();
            self.emit(Instruction::Load { dst: s, value: v });
            s
        }
        Expr::BinaryOp { left, right, op } => {
            let l = self.compile_expr(left);   // compile children first
            let r = self.compile_expr(right);
            let s = self.alloc_slot();
            self.emit(Instruction::BinOp { dst: s, left: l, right: r, op });
            s
        }
        // ...
    };

    self.cache.insert(node, slot);
    slot
}

The cache is the CSE: if a node was already compiled, we just return its slot. Otherwise, compile children (getting their slots), allocate a new slot for this node, emit an instruction, cache the result.

This is essentially post-order traversal with memoisation.

Could we reduce slot count?

Currently we allocate a fresh slot for every expression node. In practice, many slots are dead after use. In the join example, r0 and r1 example aren't needed after r2 is computed.

Liveness analysis determines when each slot's value is last used. Register allocation then maps the unbounded SSA slots to a smaller set of physical registers, reusing slots whose lifetimes don't overlap.

For paths, this optimisation probably isn't critical as we're not register-starved like a CPU, but if memory pressure mattered (for huge batch resolutions), you could apply linear scan allocation to compress the slot vector.

Putting it all together

Here's how these pieces compose for path expressions:

COMPILATION PIPELINE PYTHON API → EXPRESSION TREE root / "data" / tmpl Join Join Param Param Lit ARENA ALLOCATION Contiguous storage 0 Param "root" 1 Lit "data" 2 Join L:0 R:1 3 Param "tmpl" 4 Join L:2 R:3 TERM REWRITING Apply pattern-based rules, iterate to fixed point Parent Join a b Apply rule a INSTRUCTION SELECTION Turn tree into register machine code Join Join tmpl root "data" 0: LOAD r0, root 1: LOAD r1, "data" 2: JOIN r2, r0, r1 3: LOAD r3, tmpl 4: JOIN r4, r2, r3 EXECUTION ENGINE Linear scan through instructions

Each stage involves a different core concept:

Avoiding surplus requirements

It's also worth noting the machinery we're not using:

Good engineering is as much knowing when not to reach for power tools. That said, we might eventually want heavier machinery:

The architecture outlined above supports extension with a clear theoretical vocabulary to work with. Which equipment to reach for beyond that comes down to goals and limitations therein.

In part 5, I walk through the actual implementation of arena allocation in Rust, the pattern-matching optimiser, the slot-based executor, and the pyo3 bridge to Python.