Speculative features and future directions
As httpolars develops, several potential features and considerations are under consideration:
- Error Handling: Introducing a new column to record HTTP status codes or errors alongside the data could provide clarity, and may even be actionable in its own right during data analysis (e.g. may indicate some inputs are invalid and that this should be corrected 'upstream').
- Caching Mechanisms: Implementing in-memory or disk-based caching strategies could obviate repeated requests, reducing latency and load on both the server and the client. JSON caching by input ID is a nice pattern, and would be nice to avoid implementing this manually. Maybe someone already made an interface to do this I can just hook into.
- Concurrency: Exploring asynchronous or parallel request handling could align httpolars with Polars' performance standards, ensuring that data ingestion does not become a bottleneck.
- Rate Limiting: Developing mechanisms to dynamically adjust request rates could ease API usage and optimise data retrieval efficiency. This can be benchmarked against
Process
/Thread
pools in Python. - Batch API calls: Some APIs have support for batching inputs, e.g. as comma-separated lists. Supporting this would require a the operations be along multiple rows in a Series, but I think this can be done.
- Pydantic Data Model Integration: Either through Patito or through Pydantic's internal Rust API (???), maybe the JSON deserialisation could be handled in data models? Or maybe not requiring data models, but still getting their benefit somehow.
The development of httpolars should also be accompanied by thorough documentation (not yet begun!), example use cases (part of this series), and a comprehensive API reference (also not yet prioritised) for both its Rust and PyO3 code (not yet considered how to do both in one docs site).
The vision: seamless DataFrame expansion from APIs
Elevating web API fetching to a first-class DataFrame op is a strategic shift, aiming to set a new standard for how APIs are integrated into data workflows, eliminating the diversion and ambiguity in fetching data from a web API source.
I use the phrase "putting the web IO in the DataFrame" repeatedly in this series to emphasise that the mission here is to make web IO a first-class citizen and thus one handled in a standardised way, as opposed to one handled with whatever level of effort the time demand of the current analysis can offer.
If developed to high quality, this approach would fundamentally simplify data retrieval, making it seamless with data manipulation.