Putting web IO in the DataFrame

P-p-pick up a Polars plugin

Integrating I/O into data models and DataFrames

I'm working on a Polars DataFrame plugin library called httpolars.

I was a big fan of how Patito combined Pydantic with Polars to "put the data model in the DataFrame". If you're not familiar, it validates DataFrames both row-wise and column-wise for considerations like dtype and uniqueness.

I saw Patito as continuing an existing theme of Pydantic features that aimed to "put the IO in the data model", as I called it. Types like NewPath/FilePath would validate the non-/existence of files.

At the conceptual level, this elevates file system concerns from program side effects to part of the type system (in the context of a Pydantic data model).

Combining the ideas above (IO → data model → DataFrame) you're “putting the IO in the DataFrame”.

But before we get too comfy: this wasn't the end of the matter! The true data source is often a web API.

Put The Web IO In The DataFrame

"IO" typically indicates we're ingesting a ready-made file (whether local/online). Often in reality an API serves our dataset (proxies it), and we need to ask for it piece by piece, constantly double-checking our auth is still valid, and that we didn't get rejected for asking too fast (rate-limited retries), etc.

In this case, to merely put file IO in the DataFrame, we're stopping short, there's still got to be this step of "download then write to disk". We should really aim to start at the API: put the web IO in the DataFrame!

You're going to want request caching for any sufficiently large web datasets

Or else restarting the process will be slow and painful. We'll come back to this in part 3.

This led me to the question of how I do that, given that Polars only reads whole files.

It's well-known that you can extend Polars in Rust with PyO3 plugins. If I could do that, what if I put the Web IO in the DataFrame?

Polars plugins are a mix of Rust/Python that compile to a binary Python can import

I wrote PyO3 Python extensions for the first time in late 2023. Marco Gorelli's recent tutorial is excellent and my top recommendation to get started. My completed version of it is here.

I've seen this covered in multiple talks, the upshot of which is "calling Rust is faster than calling Python" (i.e. mapping a Python function row-wise with Expr.map_elements). Here are a few:

Wrapping reqwest in a Polars plugin: the birth of httpolars

httpolars is a Polars plugin wrapping HTTP call capabilities of the Rust library reqwest. reqwest has a familiar usage style also seen in Python's httpx (or requests before it): you configure a client, pass parameters/headers/etc, get back a response, check its status code, read its response text/content.

httpolars plumbs reqwest into the "Polars native expressions API" (in other words, talking to its Rust internals), enabling direct API calls from DataFrame values.

Polars is the way it is for a reason

The performance focus of Polars belies its pursuit of real-time (or at least time-sensitive) data processing applications and scaling to large datasets, so httpolars likewise adopts this stance.

Its Rust internals also exhibit strong type safety, producing a user experience where I find myself more confident about "what exactly I'm working with" (compare this to the prevalence of the ambiguous object dtype in Pandas).

It further exhibits clarity in its operations, aiming for code readability and ultimately maintainability. That Polars methods aren't overloaded like the [] operator is in Pandas was recently picked up on by both Ben Feifke and Francis Wolinski (🇫🇷).

Put the HTTP status codes in the DataFrame

In an ideal world (and with toy problems) we get the new data in the DataFrame without a hitch. With real data, a HTTP library is bound to hit errors, and should handle them distinctly.

The goal of httpolars isn't to go to heroic lengths for all possible edge cases, but to present a reliable and robust approach for the common use case. It is undoubtedly simple for this approach to smooth a lot of web API IO, and thus make us fast and our programs clear.

Traditional DataFrame operations would not account for HTTP-specific issues like API rate limiting or failed requests due to server errors. These fall more under the umbrella of "side effects" (non-deterministic aspects of the system not directly due to program behaviour).

httpolars handles these and thus we can distinguish null values due to dropped requests from genuine null responses from APIs, which mishandling might otherwise confuse and subtly degrade our analyses.

Additionally, features like retry logic for specific status codes (e.g., 429: Too Many Requests) are considered to keep the fetch operations robust despite the realities of flakey servers.