Lazy Execution

From expression trees to compiled path plans

The Polars Model
Arena Allocation
The Expression IR
Optimisation Passes
Physical Execution
The Python Interface
Where This Leads

In parts 1 and 2 I sketched what richer path types might look like: semantic annotations, late binding, connectivity as a first-class concern. The Python proof-of-concept I put together showed that you can build expression trees for paths and defer their resolution. But there's a gap between "deferred" and "efficient".

The proof-of-concept walks the tree at resolution time. Every resolve() call traverses the full structure, allocates intermediate Path objects, and throws them away. For a one-off path, that's fine. For resolving thousands of paths in a data pipeline — the exact use case where late binding shines — it's leaving performance on the table.

This got me thinking about Polars.

The Polars Model

Polars doesn't execute DataFrame operations immediately. When you write df.filter(...).select(...).group_by(...), you're building a logical plan — a tree of operations that describes what you want. Only when you call .collect() does Polars compile that plan into an optimised physical execution.

The architecture has four stages:

DSL parsing: User code builds Expr objects (the logical plan)
IR conversion: The DSL gets lowered into arena-allocated AExpr nodes
Optimisation: Passes like predicate pushdown and constant folding transform the IR
Physical execution: The optimised IR compiles to executable operations

Separating the concerns of what users write from what executes is the key to making it genuinely lazy. The intermediate representation exists precisely so it can be transformed.

Path expressions have a similar structure. root / "data" / template is a tree of operations. Why not apply the same treatment?

Arena Allocation

The first thing Polars does differently from naive tree-building is arena allocation. Instead of each node being a heap-allocated Box<Expr> that points to other Box<Expr> nodes scattered across memory, Polars stores all nodes in a contiguous Vec and references them by index.

pub struct PathExprId(NonZeroU32);

pub struct PathExprArena {
    nodes: Vec<PathExprNode>,
}

impl PathExprArena {
    pub fn alloc(&mut self, expr: PathExpr) -> PathExprId {
        let idx = self.nodes.len();
        self.nodes.push(PathExprNode { expr, is_pure: false });
        PathExprId(NonZeroU32::new((idx + 1) as u32).unwrap())
    }

    pub fn get(&self, id: PathExprId) -> &PathExpr {
        &self.nodes[id.index()].expr
    }
}

A PathExprId is 4 bytes. An Arc<PathExpr> is 8 bytes of pointer plus reference count overhead plus indirection. More importantly, arena nodes live contiguously in memory: when the optimiser walks the tree, it's walking a cache-friendly array, not chasing pointers across the heap.

This matters less for building one path than for building thousands. Late-bound path templates are specifically designed for the "build once, resolve many" pattern.

The Expression IR

The PathExpr enum mirrors the Python proof-of-concept's class hierarchy, but flattened into a single discriminated union:

pub enum PathExpr {
    Literal(PathBuf),
    Param(SmolStr),
    Join { left: PathExprId, right: PathExprId },
    Parent(PathExprId),
    WithSuffix { base: PathExprId, suffix: SmolStr },
    WithName { base: PathExprId, name: SmolStr },
    Stem(PathExprId),
    Template { format: SmolStr, params: Vec<SmolStr> },
}

Notice the recursive cases (Join, Parent, etc.) store PathExprId indices, not boxed children. The arena owns the nodes; the expression tree is just a graph of indices into that arena.

Each variant maps to the Python operations:

Literal ⇒ P("/home/user")
Param ⇒ Param("root")
Join ⇒ expr / other
Parent ⇒ expr.parent
Template ⇒ T("chunk_{idx}.parquet", ["idx"])

The Python side builds these via pyo3 bindings. When you write root / "data", you're allocating nodes in a Rust arena behind an Arc<RwLock<PathExprArena>>.

Optimisation Passes

With an arena-based IR, optimisation becomes a matter of tree surgery. You can replace nodes, collapse subtrees, and rewrite patterns; all by index manipulation rather than pointer juggling.

The optimiser runs rules until reaching a fixed point:

pub fn optimize(arena: &mut PathExprArena, root: PathExprId) -> PathExprId {
    let rules: Vec<Box<dyn OptimizationRule>> = vec![
        Box::new(ConstantFolding),
        Box::new(ParentJoinCancellation),
        Box::new(SuffixChainCollapse),
    ];

    let mut changed = true;
    while changed {
        changed = false;
        for idx in 0..arena.nodes.len() {
            let id = PathExprId::new(idx);
            for rule in &rules {
                if let Some(new_expr) = rule.optimize(arena, id) {
                    arena.replace(id, new_expr);
                    changed = true;
                }
            }
        }
    }
    root
}

Constant Folding

If both sides of a Join are Literal, collapse them:

// Before: Join(Literal("/home"), Literal("user"))
// After:  Literal("/home/user")

PathExpr::Join { left, right } => {
    if let (PathExpr::Literal(l), PathExpr::Literal(r)) =
        (arena.get(*left), arena.get(*right))
    {
        return Some(PathExpr::Literal(l.join(r)));
    }
    None
}

This is the path equivalent of constant propagation in a compiler. Any static segments get pre-joined at compile time. Only the parameterised parts remain as runtime work.

Parent-Join Cancellation

Parent(Join(a, b)) is algebraically equivalent to a (assuming b is a single path component). This pattern shows up when you take .parent after building a path:

// expr = root / "data" / "file.txt"
// expr.parent → should be root / "data", not a runtime computation

PathExpr::Parent(child) => {
    if let PathExpr::Join { left, .. } = arena.get(*child) {
        return Some(arena.get(*left).clone());
    }
    None
}

The optimiser recognises the structure and eliminates the runtime .parent() call entirely.

Suffix Chain Collapse

Multiple .with_suffix() calls collapse to just the last one:

// path.with_suffix(".tmp").with_suffix(".json") → path.with_suffix(".json")

PathExpr::WithSuffix { base, suffix } => {
    if let PathExpr::WithSuffix { base: inner_base, .. } = arena.get(*base) {
        return Some(PathExpr::WithSuffix {
            base: *inner_base,
            suffix: suffix.clone(),
        });
    }
    None
}

This catches a common pattern in pipeline code where you might derive intermediate filenames before settling on a final extension.

Physical Execution

After optimisation, the expression tree compiles to a linear sequence of instructions operating on a fixed-size slot array:

pub enum PathInstruction {
    LoadLiteral { slot: SlotId, path: PathBuf },
    LoadParam { slot: SlotId, param: SmolStr },
    Join { dst: SlotId, left: SlotId, right: SlotId },
    Parent { dst: SlotId, src: SlotId },
    WithSuffix { dst: SlotId, src: SlotId, suffix: SmolStr },
    // ...
}

pub struct CompiledPath {
    instructions: Vec<PathInstruction>,
    result_slot: SlotId,
    num_slots: u16,
    required_params: Vec<SmolStr>,
}

This is a register machine for paths. The compiler walks the (optimised) expression tree once, emitting instructions and allocating slots:

fn compile_expr(&mut self, arena: &PathExprArena, id: PathExprId) -> SlotId {
    // Check cache first (common subexpression elimination)
    if let Some(&slot) = self.node_slots.get(&id.index()) {
        return slot;
    }

    let slot = match arena.get(id).clone() {
        PathExpr::Literal(path) => {
            let slot = self.alloc_slot();
            self.instructions.push(PathInstruction::LoadLiteral { slot, path });
            slot
        }
        PathExpr::Join { left, right } => {
            let left_slot = self.compile_expr(arena, left);
            let right_slot = self.compile_expr(arena, right);
            let dst = self.alloc_slot();
            self.instructions.push(PathInstruction::Join {
                dst, left: left_slot, right: right_slot
            });
            dst
        }
        // ...
    };

    self.node_slots.insert(id.index(), slot);
    slot
}

The cache lookup (node_slots) provides common subexpression elimination for free. If two branches of your expression reference the same subpath, it's computed once and reused.

Execution is then a simple loop:

pub fn execute(&self, bindings: &HashMap<SmolStr, PathBuf>) -> Result<PathBuf, PathForgeError> {
    let mut slots: Vec<PathBuf> = vec![PathBuf::new(); self.num_slots as usize];

    for instr in &self.instructions {
        match instr {
            PathInstruction::LoadLiteral { slot, path } => {
                slots[slot.0 as usize] = path.clone();
            }
            PathInstruction::Join { dst, left, right } => {
                slots[dst.0 as usize] = slots[left.0 as usize].join(&slots[right.0 as usize]);
            }
            // ...
        }
    }

    Ok(slots[self.result_slot.0 as usize].clone())
}

No recursion here, nor tree traversal. We execute a linear scan through instructions, reading and writing slots. The number of slots and instructions is known at compile time.

The Python Interface

All of this lives behind a pyo3 boundary. The Python API looks almost identical to the proof-of-concept:

from wend import Param, P, T, compile_path

root = Param("root")
dataset = Param("dataset")
template = T("chunk_{idx}.parquet", ["idx"])

expr = root / "datasets" / dataset / template
compiled = compile_path(expr)

# Resolve many times
for i in range(10000):
    path = compiled.resolve({
        "root": "/data",
        "dataset": "train",
        "idx": f"{i:04d}"
    })

The difference is what happens under the hood. compile_path() crosses into Rust, runs the optimisation passes, and returns a CompiledPath. Each resolve() call executes the compiled instruction sequence without tree walking, and without Python object allocation for intermediate paths.

You can inspect what got compiled:

>>> compiled.required_params()
['root', 'dataset', 'idx']
>>> compiled.num_instructions()
5
>>> compiled.num_slots()
5

Where This Leads

The implementation handles the core path operations, but the architecture supports extension. Template expressions currently use simple {param} substitution; they could support format specs like the Python proof-of-concept's t-string integration.

The optimiser could add more passes: - common subpath elimination across multiple expressions, - filesystem-aware canonicalisation, - batch resolution for SIMD string operations.

More interesting is how this connects back to the ideas from parts 1 and 2. Path schemas could compile to expression graphs. Connected paths could track their relations through the arena. Capability annotations could flow through the IR as metadata, enabling static analysis of what filesystem effects a path expression could have.

The lazy execution model isn't just about performance (as nice as it is to achieve), but about creating a representation rich enough to reason over. Once paths are IR, you can analyse them, transform them, and compile them — much like Polars does for queries, and the same way compilers do for code.

For the implementation details so far, see the wend repository and the design report mapping Polars architecture to path expressions.