Worked examples

Show me the (web) data

ET Phone Home

For these examples we of course need an API, so I made one with rate-limiting (I'm curious how httpolars compares to a traditional optimised Python thread/process pool API fetching routine).

I spun up 2 GET endpoints on localhost [i.e. my PC] with FastAPI:

  1. /noop which returns its input unmodified (a "no-op") ,

    {"value": "x"}{"value": "x"}

  2. /factorial which gives the factorial of the input.

    {"number": 3}{"number": 3, "factorial": 6}

from math import factorial

from fastapi import FastAPI, Request
from slowapi import Limiter, _rate_limit_exceeded_handler
from slowapi.errors import RateLimitExceeded
from slowapi.util import get_remote_address

limiter = Limiter(key_func=get_remote_address)
app = FastAPI()
app.state.limiter = limiter
app.add_exception_handler(RateLimitExceeded, _rate_limit_exceeded_handler)


@app.get("/noop")
@limiter.limit("4/2 seconds")
async def read_noop(request: Request, value: str | None = None):
    return {"value": value}


@app.get("/factorial")
@limiter.limit("50/minute")
async def read_factorial(request: Request, number: int | None = None):
    return {"number": number, "factorial": factorial(number)}


def run_app():
    import uvicorn

    uvicorn.run(app, host="127.0.0.1", port=8000)

Set up: extracting a JSONPath to a Polars Series

Polars has functionality to work with JSON as a dtype:

We're interested in the latter: we'll pass a JSONPath to specify a field/sub-field of the response to put in a Polars Series, to make a new column in the DataFrame.

Throws errors if invalid JSON strings are encountered. All return values will be cast to String regardless of the original value.

The following helper function jsonpath:

import polars as pl


def jsonpath(response: str | pl.Expr, path: str):
    """Accept either the response `Expr` or reference by its column name."""
    response = pl.col(response) if isinstance(response, str) else response
    return response.str.json_path_match(f"$.{path}")

Demo 1: doing nothing

Let's call the /noop endpoint which will respond with our input, and let's give the letters x, y, z as the input, in 3 separate calls.

url = "http://localhost:8000/noop"
df = pl.DataFrame({"value": ["x", "y", "z"]})

Now let's make a Polars Expr expression, just like when we call pl.col() on a column name.

import httpolars as httpl

response = httpl.api_call("value", endpoint=url)

Nothing got sent over the internet yet

No requests are executed by calling httpl.api_call(), it constructs the Polars expression on the input column pl.col("value") If we print its repr we see that:

<Expr ['col("value")./home/louis/dev/h…'] at 0x7F9D56B716D0>

This is a statement of intent:

  • to pass a value column (not yet defined, only named)...
  • ...to the endpoint url (defined, more specifically in the Rust extension as the ApiCallKwargs struct's endpoint field, here).

That response variable denotes the column where we'll get back the response body as a string type Polars column (a JSON string). Next we put it through our helper function jsonpath:

value = jsonpath(response, "value")

So now we've got a string dtype scalar column, still named value (the name is the same because Polars just sees this as a transform on the column named value).

Let's look at the data briefly:

>>> df
shape: (3, 1)
┌───────┐
 value 
 ---   
 str   
╞═══════╡
 x     
 y     
 z     
└───────┘
>>> df.with_columns(response)
shape: (3, 1)
┌───────────────┐
 value         
 ---           
 str           
╞═══════════════╡
 {"value":"x"} 
 {"value":"y"} 
 {"value":"z"} 
└───────────────┘
>>> df.with_columns(jsonpath(response, "value"))
shape: (3, 1)
┌───────┐
 value 
 ---   
 str   
╞═══════╡
 x     
 y     
 z     
└───────┘

Neat! We've successfully done absolutely nothing. Give yourself a pat on the back.

When we stick the resulting Series back on the DataFrame, nothing changes (we simply overwrite the DataFrame column with identical data).

Demo 2: counting permutations

Moving onto our endpoint that computes the factorial of our input, let's say we're interested in knowing how many arrangements there are of a set of number items (the number of permutations of N things = N!, n factorial):

url = "http://localhost:8000/factorial"
df = pl.DataFrame({"number": [1, 2, 3]})

This time let's take the response (the JSON string) and 'store it' by renaming it to "response". This way with_columns will put it alongside rather than overwriting the "number" column.

response = httpl.api_call("number", endpoint=url).alias("response")

Likewise let's also extract the "number" field in the response and rename its column to "supplied", so we can check we got back the same integer we put in.

in_ = jsonpath("response", "number").str.to_integer().alias("supplied")

And to keep organised, let's rename the value in the "factorial" key of the response to "permutations" (so our code is clear on what the value is needed for):

out = jsonpath("response", "factorial").str.to_integer().alias("permutations")

Then let's run the HTTP calls by using that response variable, selecting the columns, and then drop the response column (which we just name by a string of its name here):

result = df.with_columns(response).with_columns([in_, out]).drop("response")

The result is correct:

┌────────┬──────────┬──────────────┐
 number  supplied  permutations 
 ---     ---       ---          
 i64     i64       i64          
╞════════╪══════════╪══════════════╡
 1       1         1            
 2       2         2            
 3       3         6            
└────────┴──────────┴──────────────┘

and our FastAPI server recorded just a single GET request for each data point, as expected:

INFO:     127.0.0.1:59410 - "GET /factorial?number=1 HTTP/1.1" 200 OK
INFO:     127.0.0.1:59418 - "GET /factorial?number=2 HTTP/1.1" 200 OK
INFO:     127.0.0.1:59430 - "GET /factorial?number=3 HTTP/1.1" 200 OK