Avoiding Accidental Cartesian Explosions in Polars

Order of operations still matters in streaming mode

Triples × triples
The Cartesian transform with two list .explode calls
The efficient transform with list.eval
Why map_elements when you can join

While working with Polars pipeline over Wikidata claims, I observed a nasty exponential blowup in the time to filter the result by language.

With some help from Claude I landed on a different way that produced identical results. The difference came down to the shape of the computation: the original version unintentionally created a Cartesian expansion, while the optimised version avoids it by restructuring how nested data is accessed. Take a look at the files in the GitHub Gist here, or the PR (#10) in the wikidata-pq repo.

Triples × triples

For a bit of background on what I am doing here: Wikidata has many facts about each entity in its grand database of Wiki-verse things, in the form "X-relation-Y" known as 'linked data triples', such as "[Paris] [is the capital of] [France]". Internally though these are stored in many languages. In fact, all the parts of that simple statement may be in many languages: and they may not all be available as translations in the same languages. However, for practical purposes, you typically will only want them in the same language, so X="Paris", R="is the capital of", Y="France" might have 100 X languages, 120 R languages, and 130 Y languages. We want to make a nice tidy dataset, so we want to separate them, and just discard anything we can't express entirely. Ideally we would have all languages for all things, but some entities in Wikidata won't be translated in all languages so the simplest thing to do if your goal is to make language-specific subsets is to just do a simple matching. Personally when I work with Wikidata I just want to see English language values, and having 100 other variants for every piece of text is a burden I'd need to sort out before I could do anything useful. That's a light oversimplification but gives the basic idea here.

# 0_view_base_df.py
shape: (1, 7)
┌──────────┬─────────────────────────────────┬───────────────┬─────────────────────────────────┬────────┬─────────────────────────────────┬─────────────────┐
│ property ┆ datavalue                       ┆ datatype      ┆ property-labels                 ┆ rank   ┆ references                      ┆ qualifiers      │
│ ---      ┆ ---                             ┆ ---           ┆ ---                             ┆ ---    ┆ ---                             ┆ ---             │
│ str      ┆ struct[20]                      ┆ str           ┆ list[struct[2]]                 ┆ str    ┆ list[list[struct[2]]]           ┆ list[struct[2]] │
╞══════════╪═════════════════════════════════╪═══════════════╪═════════════════════════════════╪════════╪═════════════════════════════════╪═════════════════╡
│ P31      ┆ {"Q532",[{"pl","wieś"}, {"en",… ┆ wikibase-item ┆ [{"en","instance of"}, {"fr","… ┆ normal ┆ [[{"P143",[{"P143",{"Q48952",[… ┆ null            │
└──────────┴─────────────────────────────────┴───────────────┴─────────────────────────────────┴────────┴─────────────────────────────────┴─────────────────┘

Table 1: Base DataFrame: a single Wikidata claim after structural normalisation by polars-genson, containing both property labels and datavalue labels stored as multilingual lists.

The Cartesian transform with two list `.explode` calls

In the original approach (transform_cartesian), the transformation explodes both the property labels and the datavalue labels before filtering them to matching languages:

.explode("property-labels")
...
.explode("dv_labels")
...
.filter(pl.col("prop_lang") == pl.col("dv_lang"))

# 1_view_cartesian_intermediate.py 
shape: (44_485, 4)
┌───────────┬─────────┬────────────────┬─────────────────┐
│ prop_lang ┆ dv_lang ┆ property_label ┆ datavalue_label │
│ ---       ┆ ---     ┆ ---            ┆ ---             │
│ str       ┆ str     ┆ str            ┆ str             │
╞═══════════╪═════════╪════════════════╪═════════════════╡
│ en        ┆ pl      ┆ instance of    ┆ wieś            │
│ en        ┆ en      ┆ instance of    ┆ village         │
│ en        ┆ nl      ┆ instance of    ┆ dorp            │
│ en        ┆ it      ┆ instance of    ┆ villaggio       │
│ en        ┆ fr      ┆ instance of    ┆ village         │
│ …         ┆ …       ┆ …              ┆ …               │
│ tly       ┆ bbc     ┆ anlayışın      ┆ Huta            │
│ tly       ┆ ext     ┆ anlayışın      ┆ puebru          │
│ tly       ┆ mag     ┆ anlayışın      ┆ गाँओ             │
│ tly       ┆ sr-ec   ┆ anlayışın      ┆ Село            │
│ tly       ┆ btm     ┆ anlayışın      ┆ Huta            │
└───────────┴─────────┴────────────────┴─────────────────┘

Table 2: Cartesian intermediate produced by exploding both property-labels and datavalue.labels. Every (prop_lang, dv_lang) pair is materialised, producing 44 thousand rows for a single claim before filtering.

Needless to say this is a bad situation: every possible pair of datavalue and property label language exists, most rows differ only in their language keys.

As an aside, this came from some code I wrote originally and recently deleted, which I then rescued from my git history. Ironically, this original code had performed so poorly I thought my entire approach was to blame, but it was probably just this use of dual explode that could have been avoided.

Instead my assumption was that it was the schema juggling I was doing to work with the un-normalised data from Wikidata. This was the primary motivation that kicked off the development of polars-genson, probably a misunderstanding! Oh well, much improved on both counts and I've more confidence that I'm not discarding data that doesn't fit some expected schema thanks to the schema inference and normalisation (unbeknownst to me, Polars would soon after introduce a requirement for the schema to be known when working with the .str.json_decode function from version 1.33.0 in September 2025, so I would've needed it sooner or later anyway).

So this double explode is a literal interpretation of the task, but operationally it materialises all combinations of property-label languages and datavalue-label languages for each claim.

If a claim has P property labels and D datavalue labels, this produces P × D intermediate rows, even though only rows where the language keys match can survive the final filter. In realistic Wikidata data, both P and D are often large enough that this expansion dominates runtime.

The efficient transform with `list.eval`

The optimised version (transform_efficient) starts from the observation that only property labels whose language exists in datavalue.labels can ever produce output. It first extracts the available datavalue languages into a list column without exploding anything:

.with_columns(
    pl.col("datavalue")
    .struct.field("labels")
    .list.eval(pl.element().struct.field("key"))
    .alias("_dv_langs")
)

# 2_view_dv_langs.py 
shape: (3, 2)
┌──────────┬───────────────────────┐
│ property ┆ _dv_langs             │
│ ---      ┆ ---                   │
│ str      ┆ list[str]             │
╞══════════╪═══════════════════════╡
│ P31      ┆ ["pl", "en", … "btm"] │
│ P17      ┆ ["tg", "sk", … "btm"] │
│ P131     ┆ ["fa", "en", … "arz"] │
└──────────┴───────────────────────┘

Table 3: Result of list.eval extracting language keys from datavalue.labels. Each row now carries a compact list of valid languages, without any row expansion.

It then explodes only property-labels and immediately filters to languages that are known to be present:

.explode("property-labels")
...
.filter(pl.col("language").is_in(pl.col("_dv_langs")))

# 3_view_divergence_point.py
shape: (158, 2)
┌──────────┬─────────────────────┐
│ language ┆ property_label      │
│ ---      ┆ ---                 │
│ str      ┆ str                 │
╞══════════╪═════════════════════╡
│ en       ┆ instance of         │
│ fr       ┆ nature de l’élément │
│ de       ┆ ist ein(e)          │
│ it       ┆ istanza di          │
│ pt-br    ┆ instância de        │
│ …        ┆ …                   │
│ mdf      ┆ эзда фкясь          │
│ os       ┆ у                   │
│ en-ca    ┆ instance of         │
│ en-gb    ┆ instance of         │
│ en-us    ┆ instance of         │
└──────────┴─────────────────────┘

Table 4: Rows remaining after exploding only property-labels and filtering by _dv_langs. At this point the row count is already close to the final result size, and no Cartesian intermediate has been created.

At this point, the number of rows is already close to the final result size, and no Cartesian intermediate has been created. Instead of exploding datavalue.labels, the matching label is retrieved directly with a per-row lookup:

.with_columns(
    pl.struct(["datavalue", "language"])
    .map_elements(
        lambda row: next(
            item["value"]
            for item in (row["datavalue"]["labels"] or [])
            if item["key"] == row["language"]
        ),
        return_dtype=pl.String,
    )
    .alias("datavalue_label")
)

This replaces a second explosion and filter with a bounded scan over the small list of labels attached to each row. The resulting complexity is linear in the number of property labels, plus a small per-row lookup, rather than proportional to the product of property and datavalue labels.

This change in structure explains the large performance difference observed in benchmarks:

# debug_claims.py
============================================================
Testing with 10 base claims
============================================================
  (4 are wikibase-item/property)
Cartesian:  3.361s, 395 rows
Efficient:  0.070s, 395 rows

============================================================
Testing with 50 base claims
============================================================
  (28 are wikibase-item/property)
Cartesian:  10.071s, 1993 rows
Efficient:  0.258s, 1993 rows

============================================================
Testing with 100 base claims
============================================================
  (47 are wikibase-item/property)
Cartesian:  16.785s, 3717 rows
Efficient:  0.334s, 3717 rows

============================================================
Testing with 500 base claims
============================================================
  (159 are wikibase-item/property)
Cartesian:  41.403s, 14579 rows
Efficient:  1.602s, 14579 rows

Table 5: Comparison across different numbers of rows (10, 50, 100, 500) in which the efficient approach gives a 25-50x speedup from eliminating redundant expansion before the filter. The matching row counts vouch for the results being the same in either form.

The Cartesian version spends most of its time creating and discarding intermediate rows that were destined to be filtered out, while the optimised version keeps intermediate data close to the size of the final output throughout the pipeline.

The broader point is that when working with nested lists in Polars, exploding multiple columns and filtering afterward can easily introduce unintended Cartesian products. If one side of the relationship can be treated as a lookup rather than another dimension, restructuring the transformation to explode only once can yield substantial improvements in both runtime and memory use without changing the result.

Why `map_elements` when you can join

The efficient version above still has a bottleneck: the map_elements call with a Python lambda is two antipatterns in one (there's a dedicated warning in the docs called "do not kill parallelization"). While it avoids the Cartesian explosion, it still forces Polars to drop into single-threaded Python execution for every row. As you scale, this becomes your new limiting factor.

.with_columns(
    pl.struct(["datavalue", "language"])
    .map_elements(
        lambda row: next(
            item["value"]
            for item in (row["datavalue"]["labels"] or [])
            if item["key"] == row["language"]
        ),
        return_dtype=pl.String,
    )
    .alias("datavalue_label")
)

My initial idea to improve further was to use some Expr.list operations (which typically use Rust polars-compute crate 'kernels' internally and subsequently give faster paths), but there was no such available. Instead I found this feature request, "Find indexes of item in list". In a reply to that request, another contributor suggests using Polars with_row_index, which gave me an idea.

Using the row index, this lookup can also be expressed as a join to keep it Polars-native. We want to match each (row_id, language) pair in the main table to the corresponding label value in the exploded labels:

# Add row index to correlate after exploding
indexed = base.with_row_index("_row_id")

# Lookup table: explode labels into (row_id, lang, label_value)
labels_lookup = (
    indexed
    .select(
        "_row_id",
        pl.col("datavalue").struct.field("labels").alias("_dv_labels"),
    )
    .explode("_dv_labels")
    .with_columns(
        pl.col("_dv_labels").struct.field("key").alias("_lang"),
        pl.col("_dv_labels").struct.field("value").alias("datavalue_label"),
    )
    .select("_row_id", "_lang", "datavalue_label")
)

# Main table: explode property-labels
main = (
    indexed
    .explode("property-labels")
    .with_columns(
        pl.col("property-labels").struct.rename_fields(["language", "property_label"])
    )
    .unnest("property-labels")
)

# Inner join naturally filters to matching languages only
result = (
    main
    .join(labels_lookup, left_on=["_row_id", "language"], right_on=["_row_id", "_lang"], how="inner")
    .drop("_row_id")
)

The key move is adding a _row_id column before any explosion to let us explode the labels into a separate lookup table, then join back on (row_id, language). The inner join simultaneously performs the language matching and the value extraction, in native Polars operations that share the work across threads.

# debug_claims_map_elems.py
============================================================
Testing with 1 claims
============================================================
  Wikibase claims: 1
  map_elements:  1.162s, 158 rows
  join_lookup:   1.078s, 158 rows
  Speedup: 1.1x

============================================================
Testing with 5 claims
============================================================
  Wikibase claims: 5
  map_elements:  1.177s, 533 rows
  join_lookup:   1.143s, 533 rows
  Speedup: 1.0x

============================================================
Testing with 10 claims
============================================================
  Wikibase claims: 10
  map_elements:  1.222s, 741 rows
  join_lookup:   1.130s, 741 rows
  Speedup: 1.1x

============================================================
Testing with 50 claims
============================================================
  Wikibase claims: 50
  map_elements:  1.452s, 3839 rows
  join_lookup:   1.223s, 3839 rows
  Speedup: 1.2x

============================================================
Testing with 500 claims
============================================================
  Wikibase claims: 500
  map_elements:  4.636s, 41417 rows
  join_lookup:   2.155s, 41417 rows
  Speedup: 2.2x

============================================================
Testing with None claims
============================================================
  Wikibase claims: 6180
  map_elements:  44.759s, 558850 rows
  join_lookup:   10.567s, 558850 rows
  Speedup: 4.2x

Table 6: Comparison of map_elements vs join-based lookup. The speedup grows with data size, reaching 4.2x on the full file. Combined with the earlier 25-50x improvement from avoiding the Cartesian product, the total speedup from the naive approach is over 100x.

At small sizes the overhead is similar, but as the data grows the join approach pulls ahead significantly. The map_elements version scales poorly because Python's GIL kills any parallelisation opportunities, while the join version can use all available cores.

So the lesson from this second round is when you find yourself reaching for map_elements to do a lookup against nested data, consider whether it can be restructured as a join. Explode the nested structure into a separate lookup table with a synthetic key, then join back. It's more verbose, but it keeps the entire computation in Polars' native execution engine where it can be optimised.

Avoiding Accidental Cartesian Explosions in Polars

Order of operations still matters in streaming mode

Triples × triples

The Cartesian transform with two list .explode calls

The efficient transform with list.eval

Why map_elements when you can join

The Cartesian transform with two list `.explode` calls

The efficient transform with `list.eval`

Why `map_elements` when you can join