ET Phone Home
For these examples we of course need an API, so I made one with rate-limiting (I'm curious how httpolars compares to a traditional optimised Python thread/process pool API fetching routine).
I spun up 2 GET endpoints on localhost [i.e. my PC] with FastAPI:
/noop
which returns its input unmodified (a "no-op") ,{"value": "x"}
→
{"value": "x"}/factorial
which gives the factorial of the input.{"number": 3}
→{"number": 3, "factorial": 6}
from math import factorial
from fastapi import FastAPI, Request
from slowapi import Limiter, _rate_limit_exceeded_handler
from slowapi.errors import RateLimitExceeded
from slowapi.util import get_remote_address
limiter = Limiter(key_func=get_remote_address)
app = FastAPI()
app.state.limiter = limiter
app.add_exception_handler(RateLimitExceeded, _rate_limit_exceeded_handler)
@app.get("/noop")
@limiter.limit("4/2 seconds")
async def read_noop(request: Request, value: str | None = None):
return {"value": value}
@app.get("/factorial")
@limiter.limit("50/minute")
async def read_factorial(request: Request, number: int | None = None):
return {"number": number, "factorial": factorial(number)}
def run_app():
import uvicorn
uvicorn.run(app, host="127.0.0.1", port=8000)
Set up: extracting a JSONPath
to a Polars Series
Polars has functionality to work with JSON as a dtype:
json_decode()
: Parse string values as JSON (takes adtype
or infers it)json_path_match()
: Extract the first match of JSON string with the provided JSONPath expression.
We're interested in the latter: we'll pass a JSONPath to specify a field/sub-field of the response
to put in a Polars Series
, to make a new column in the DataFrame.
Throws errors if invalid JSON strings are encountered. All return values will be cast to
String
regardless of the original value.
The following helper function jsonpath
:
- takes
response
, the name of the Polars column we've put our HTTP response body (JSON string) in, and wraps it inpl.col()
if it's not already a Polars expression. - reads the string as JSON and accesses the
path
JSONPath (will always give a string value)
import polars as pl
def jsonpath(response: str | pl.Expr, path: str):
"""Accept either the response `Expr` or reference by its column name."""
response = pl.col(response) if isinstance(response, str) else response
return response.str.json_path_match(f"$.{path}")
Demo 1: doing nothing
Let's call the /noop
endpoint which will respond with our input,
and let's give the letters x, y, z as the input, in 3 separate calls.
url = "http://localhost:8000/noop"
df = pl.DataFrame({"value": ["x", "y", "z"]})
Now let's make a Polars Expr
expression, just like when we call pl.col()
on a column name.
import httpolars as httpl
response = httpl.api_call("value", endpoint=url)
Nothing got sent over the internet yet
No requests are executed by calling httpl.api_call()
, it constructs the Polars expression on the input column pl.col("value")
If we print its repr we see that:
<Expr ['col("value")./home/louis/dev/h…'] at 0x7F9D56B716D0>
This is a statement of intent:
- to pass a
value
column (not yet defined, only named)... - ...to the endpoint
url
(defined, more specifically in the Rust extension as theApiCallKwargs
struct'sendpoint
field, here).
That response
variable denotes the column where we'll get back the response body as a string type
Polars column (a JSON string). Next we put it through our helper function jsonpath
:
value = jsonpath(response, "value")
So now we've got a string dtype scalar column, still named value
(the name is the same because
Polars just sees this as a transform on the column named value
).
Let's look at the data briefly:
>>> df
shape: (3, 1)
┌───────┐
│ value │
│ --- │
│ str │
╞═══════╡
│ x │
│ y │
│ z │
└───────┘
>>> df.with_columns(response)
shape: (3, 1)
┌───────────────┐
│ value │
│ --- │
│ str │
╞═══════════════╡
│ {"value":"x"} │
│ {"value":"y"} │
│ {"value":"z"} │
└───────────────┘
>>> df.with_columns(jsonpath(response, "value"))
shape: (3, 1)
┌───────┐
│ value │
│ --- │
│ str │
╞═══════╡
│ x │
│ y │
│ z │
└───────┘
Neat! We've successfully done absolutely nothing. Give yourself a pat on the back.
When we stick the resulting Series back on the DataFrame, nothing changes (we simply overwrite the DataFrame column with identical data).
Demo 2: counting permutations
Moving onto our endpoint that computes the factorial of our input,
let's say we're interested in knowing how many arrangements there are of a set of number
items
(the number of permutations of N things = N!
, n factorial):
url = "http://localhost:8000/factorial"
df = pl.DataFrame({"number": [1, 2, 3]})
This time let's take the response (the JSON string) and 'store it' by renaming it to "response".
This way with_columns
will put it alongside rather than overwriting the "number" column.
response = httpl.api_call("number", endpoint=url).alias("response")
Likewise let's also extract the "number" field in the response and rename its column to "supplied", so we can check we got back the same integer we put in.
in_ = jsonpath("response", "number").str.to_integer().alias("supplied")
And to keep organised, let's rename the value in the "factorial" key of the response to "permutations" (so our code is clear on what the value is needed for):
out = jsonpath("response", "factorial").str.to_integer().alias("permutations")
Then let's run the HTTP calls by using that response
variable, selecting the columns, and then
drop the response column (which we just name by a string of its name here):
result = df.with_columns(response).with_columns([in_, out]).drop("response")
The result
is correct:
┌────────┬──────────┬──────────────┐
│ number ┆ supplied ┆ permutations │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞════════╪══════════╪══════════════╡
│ 1 ┆ 1 ┆ 1 │
│ 2 ┆ 2 ┆ 2 │
│ 3 ┆ 3 ┆ 6 │
└────────┴──────────┴──────────────┘
and our FastAPI server recorded just a single GET request for each data point, as expected:
INFO: 127.0.0.1:59410 - "GET /factorial?number=1 HTTP/1.1" 200 OK
INFO: 127.0.0.1:59418 - "GET /factorial?number=2 HTTP/1.1" 200 OK
INFO: 127.0.0.1:59430 - "GET /factorial?number=3 HTTP/1.1" 200 OK