I churned out 30 Python packages this year and, somewhat unexpectedly, an equal number of Rust crates (my first ever being in March).
Looking back, I seem to have a thing for inverse problems.
- Dewarping page images to recover flattened geometry from photographed books
- Schema inference on JSON to recover the shape and types from observed values (allowing it to fit neatly into a DataFrame's column schema)
- ICA on neural text embeddings to recover meaningful topics.
Put another way, inverting a camera projection, undoing JSON serialisation, and unmixing vectors in an embedding space.
In the case of image dewarping, there is a correct answer, and you try to pull it out through optimisation (or as close as you can get). Models are only as good as their assumptions, and the assumption here is of a particular type of curve, which we'll get into in a bit.
Schema inference is more algebraic (and your result is verifiably correct since using a bad schema with a sufficiently strict engine like a DataFrame will error on bad inputs). If you dig up enough edge cases and iron them out you can make quite a mature inference engine, though the edges get trickier the more mature your tool becomes.
The text embedding decomposition problem has a more probabilistic flavour: there isn't a clear 'right answer', which has always made topic models feel suspect to me.
I: Unwarping unwrapped
| page-dewarp |
|---|
I'm not a computer vision researcher so much as a long-time fanboy who took it upon himself to preserve some Python 2 code in 2021. I'd been mainly tinkering around the edges after the initial refactoring of page-dewarp, until this year when I finally reworked the reprojection optimisation at its core, without losing the spirit of the original.
To recap, the program uses a cubic sheet model (meaning it fits a cubic curve), specifically a "Hermite" one (whose curve height is set to 0 at both ends, as for a page resting on a flat surface). Matt Zucker's original blog post treats a photographed page as a smoothly curved surface rather than an open-ended pixel shuffling problem. You fit these splines to text contours you detect with OpenCV, then solve for the coefficients that best explain where the text ended up when projected through your camera model. It's physically grounded and a nice use of computational geometry (homography matrices or rotations in 3D, solved by SVD).
Solving the non-linear least squares problem
Zucker left a hint in his blog post that this was a non-linear least squares optimisation that hadn't been solved as such, and doing so could speed it up. The original code used SciPy's Powell method, which is 'derivative-free' (it explores the objective by evaluating it at different points without needing gradient information). This was a pragmatic choice as computing the gradient manually through the full projection pipeline (cubic splines → 3D geometry → Rodrigues rotation → camera projection → error) would require more complex Jacobian calculus.
The reprojection error objective we're minimising is textbook smooth least squares: we use squared differences of projected vs. detected keypoints, composed entirely of matrix operations, polynomial evaluation, and camera projection. This indeed makes it a natural fit for gradient-based methods like L-BFGS and Levenberg-Marquardt, provided those gradients are accurate. Finite-difference approximations introduce enough noise to mislead the optimiser, meaning poorly preconditioned steps, taking more iterations to converge, thus a slower overall program run.
I'd tried to pursue Levenberg-Marquardt (which Gilbert Strang covers at the end of Linear Algebra & Learning From Data) with GPUFit a couple years ago, but not being a C++ dev this wasn't much fun, and it ultimately didn't land.
I had been excited by what looked like a new Rust port of SciPy, but it quickly became clear the developer was using LLMs in a shall we say hands off way. I extracted what I needed and tried to contribute some fixes for the egregiously reward hacked tests, but there was really nothing of value there. If an optimiser is not implemented correctly it is simply not usable, and this was pretty much a total waste of my time, at best it was a Teachable MomentTM, always check the tests for reward hacking if it sounds too good to be true.
After putting this discouragement behind me, I circled back to the idea of Levenberg-Marquardt only to find it another dead end. SciPy's L-BFGS worked but converged poorly: finite-difference gradient estimates give the optimiser too noisy a signal to follow. When I swapped the optimiser method backend from SciPy to JAX with autodiff I finally got exact gradients, and we had a serious contender, by quite a margin:
| Image | SciPy Powell | JAX L-BFGS-B | Speedup | Eval Reduction |
|---|---|---|---|---|
| boston_cooking_a | 12.18s | 2.01s | 6.1× | ↓101× |
| boston_cooking_b | 9.69s | 1.21s | 8.0× | ↓155× |
| finnish_cooking_a | 12.54s | 1.31s | 9.6× | ↓198× |
| linguistics_thesis_a | 3.15s | 0.86s | 3.7× | ↓130× |
| linguistics_thesis_b | OOM | 0.31s | ∞ | ↓289× |
A JAXmas miracle
The speedup came from not misleading the optimiser with noisy gradients. It's less "JAX is fast" and more "the old code was flying blind". (For the record the first clue was the manually derived gradient)
RE: the OOM → 0.31s case
The linguistics_thesis_b image previously crashed due to numerical instability during optimisation.
It caused an Out Of Memory error due to numerical instability in the optimisation causing a super wonky prediction for the size of one of the arrays (which in itself is a bug, but one that should not happen with a properly conditioned optimiser). The image in question still exhibits a failure mode but that no longer results in a crash!
Batch processing
Once single-image perf was sorted the obvious next stop was batch processing. The typical user will have a stack of scans, not one at a time.
JAX's JIT has warmup cost so it's not worth it for single images, but batches can see an extra 3-5x speed boost. This is switched on automatically for more than 1 input image:
| Device | Sequential | Batched (40 images) | Speedup |
|---|---|---|---|
| CPU ★ | 36s | 8.7s | 4.1× |
| GPU | 53s | 11.2s | 4.7× |
Interestingly, on my hardware [single GPU, many cores] I see CPU batches exceeding GPU with JAX in the batch mode by some 30%. Don't underestimate CPU on SIMD!
Physically grounded methods
From around 2017, dewarping research began a shift to pixel-wise regression methods trained on crumpled paper datasets. It started off fairly reasonably, with "four way folds" (Das et al. at ACM DocEng 2017 used a CNN in a single optimisation step), then DocUNet (CVPR 2018) came along. Since then its descendants have framed dewarping as pixel displacement prediction.
This dataset is 130 photos of paper folded so much they surpass origami and border on crumpled (I had to put examples in in case it sounds like I'm exaggerating). None show pages in books. I'd suggest it's a benchmark for a problem nobody has, but which favours pixel-wise methods like convolution over spline models (which strikes me as picking winners).
| File | 1_2.jpg | 54_2.jpg | 52_2.jpg | 52_1.jpg | 51_2.jpg |
|---|---|---|---|---|---|
| Photo | |||||
| Comment | Origami | Scrunched | Diagonal fold | Propped up on 2 sides | Modern art |
| File | 50_2.jpg | 38_2.jpg | 32_1.jpg | 29_2.jpg | 27_1.jpg |
|---|---|---|---|---|---|
| Photo | |||||
| Comment | Origami | Crinkled | Multi-diagonal | Origami | Oblique fold |
| File | 24_2.jpg | 23_2.jpg | 20_2.jpg | 18_2.jpg | 17_2.jpg | 16_2.jpg |
|---|---|---|---|---|---|---|
| Photo | ||||||
| Comment | Oblique 4-way fold | Crumpled | Vertical accordion folds | Crumpled | 6-way fold with a crease | Crumpled/origami |
The paper claims to have been "the first learning-based method" for dewarping, by which they mean not including when you learn the parameters to a non-dense NN model:
it is often desirable to digitally flatten a document image when the physical document sheet is folded or curved. In this paper, we develop the first learning-based method to achieve this goal.
The dataset was made up of individual pages, which means they could be draped over chairs, in one there's a page propped up by a keyboard at one corner and a sellotape dispenser at the other. It's kind of like the opposite of a spherical cow. I can see value for it in universal image models doing data augmentation, I just find it regrettable that they chose to use the same name as the existing spline-fitting dewarping methods.
A page is a surface, it bends according to material properties, and in books with a spine this curvature is constrained. Maybe it's not a Hermite cubic, maybe it's a quartic. Image recognition nets could do model selection or solve the dewarping task directly without the optimisation.
We are now in the odd situation where papers lament how methods don't take advantage of the constraints of a 3D surface physical model
such geometric constraints are largely ignored in existing advanced solutions, which limits the rectification performance... 3D shape and textlines
— Excerpt from DocGeoNet, 2022
So it goes!
I read the ByteDance Seed1.5-VL paper this summer, who mention doing document image dewarping as OCR pretext, though they don't call it that:
real world distortions, such as perspective shifts, bends, and wrinkles
This just becomes data augmentation and moves away from directly regressing pixels. Interestingly they used Donut's SynthDoG (ECCV 2022).
Text detectors and other ideas for the future
Now that the program runs fast, I'd like to explore text detection models to automate parameter selection, since most failures I've seen are coming from not picking up enough contours and thus the overall page orientation, like this one (via):
| tk = 10 (default) | tk = 30 |
|---|---|
| Paragraphs 2 & 4 missing spans | Even coverage down the page |
My hunch is that text detector models won't give as precise contour shape info as the pipeline already has, but it could guide parameter sweeps for things like maximum text thickness/contour length, aiming to maximise overlap with detected text regions.
| Type | Span Contours | Text Det |
|---|---|---|
| thresh |
II: Putting the Data Model on the Map
| polars-genson | genson-core | avrotize-rs |
|---|---|---|
My life was much improved this year from learning the term "Map type."
I'd been struggling to articulate how flattening JSON ought be done when working with DataFrames, which must have a consistent schema that's known upfront.
In February I developed a Polars plugin polars-schema-index for "flattening nested columns with stable numerical indexing", which was an unsatisfying approach (a depth-first walk appending numbers to the column names), but was in the right general area.
Around August I had been exploring whether I could process Wikidata (~1.6TB extraction effort that could be as small as a few GB done right, wikidata-pq) because I wanted to mine a simple relational dataset from it and didn't care for SPARQL. This idea got shelved for the more compelling work of solving the general case of the problem I encountered when trying to ingest Wikidata.
An idea to mine Wikidata for a synthetic document generation corpus
For the record, the idea I had Wikidata for here was to generate realistic synthetic documents, since as you can see the outputs from SynthDoG, used by TikTok parent company ByteDance in their vision-language foundation models are semi-nonsensical:
I still want to write a data augmentation tool to make synthetic documents to train models for image tasks, ultimately to find a way to extract index pages with nested structures crossing over the boundary of multiple pages in a sequence, but that's a story for a new year.
Every entity ("Wikidatum"?) has labels in a few out of hundreds of possible languages. If you flatten that naively you get 200 mostly-null columns: you are treating them all as non-required fields of a JSON object or Polars struct column. This is fundamentally the wrong semantics for the labels field: the language codes are incidental not essential, i.e. they should be considered tied to the row rather than a property of the dataset.
I was already all in on data models and their use at runtime to give programs stronger 'contracts' (a.k.a. {pre/post}conditions), yet a gap I kept hitting—and for a long time struggled to verbalise—was having data without a schema, needing to make one before I can do anything useful. The fix is to infer it: type inference on the values, resolve the unions, and so on. None of my previous approaches (educated guesses, trial and error, vibecoding) really scaled.
In all I made five intertwined packages, the main ones being the core Rust crate and the Polars extension package:
- genson-core handles this Map type detection, as fast as possible in Rust,
- genson-cli just wraps genson-core as a CLI, and is handy for testing,
- avrotize-rs interconverts JSONSchema and Avro schema, in Rust,
- polars-genson wraps the JSON schema inference in a DataFrame operation, as a Python package,
- polars-jsonschema-bridge interconverts Polars Schema types and JSON Schema, in Rust.
Schema inference as constraint discovery
genson-core infers schemas from JSON, which I think of as constraint discovery. A schema isn't really describing what your data is so much as what it's allowed to be, its "contract" on the data. The inference task is to observe enough examples to figure out the boundaries. In an ideal world you'd stop as soon as you'd seen enough to make some decision, but that kind of algorithmic performance optimisation is easier said than done.
Since our goal is to read enough of the data to identify all the possible values to constraint to (in situations where you want to be completely sure, this will mean reading all the data), the inference has to be fast, so naturally this ended up in Rust again.
Map inference for sparse data
A Map type is actually an array in disguise: it looks like an object (sometimes called a "mapping", in Python a "dictionary") but the keys are row-level data, not a schema-level structure.
The Map type was the crux of this work. I needed more than genson-rs, which infers data types fine but treats all JSON objects as object types.
So how do you tell Map from object? Since the only distinction is dataset semantics—incidental keys vs essential ones—you have to check the full dataset and decide.
"en": {
"language": "en",
"value": "Hello"
},
"fr": {
"language": "fr",
"value": "Bonjour"
}
}
"type": "map",
"values": {
"type": "record",
"fields": [
{"name": "language"},
{"name": "value"}
]
}
}
{
"key": "en",
"value": {
"language": "en",
"value": "Hello"
}
},
{
"key": "fr",
"value": {
"language": "fr",
"value": "Bonjour"
}
}
]
en_language: string | null
en_value: string | null
fr_language: string | null
fr_value: string | null
de_language: null
de_value: null
es_language: null
es_value: null
... 200+ more nullable columns
labels: Map<string, Record>
// Where Record is:
{
language: string
value: string
}
Any number of languages, same schema
Since a Map is just an array in disguise, it can be of any length, as long as its values (the key-value pairs) are always the same data types (typically string-string). When the data types are heterogeneous, my solution is to "promote" scalars into both lists and objects. It also handles unions of map types (I call this "map unification"), and you can configure it all for fine control.
The program tries to do the right thing automatically. If users have to know a feature exists to benefit from it, most won't. Without schema evolution people just drop data they can't fit, which to me is a matter of correctness as much as UX convenience.
"en": "Hello",
"fr": "Bonjour",
"de": "Hallo"
}
"type": "object",
"additionalProperties": {
"type": "string"
}
}
| Symbol | Meaning | Example |
|---|---|---|
| |UK| | Total unique keys observed across all rows | Row 1 has {a,b}, Row 2 has {a,c} → |UK| = 3 |
| |RK| | Count of keys required (present in every row) | Only "a" is in both rows → |RK| = 1 |
"a": {
"index": 0,
"vowel": 0
}
}
"b": {
"index": 1,
"consonant": 0
}
}
"type": "map",
"values": {
"type": "record",
"fields": [
{"name": "index", "type": "int"}, // required
{"name": "vowel", "type": ["null", "int"]}, // nullable
{"name": "consonant", "type": ["null", "int"]} // nullable
]
}
}
fieldname__string.
This allows unification to succeed instead of failing.
"datavalue": {
"id": "Q42",
"type": "item"
}
}
"datavalue": "some-string"
}
{
"type": "object",
"properties": {
"id": {"type": ["null", "string"]},
"type": {"type": ["null", "string"]},
"datavalue__string": {"type": ["null", "string"]}
}
}
{"en": "Hello", "fr": "Bonjour"}
"en": "Hello",
"fr": "Bonjour"
}
{"en": "Hello"},
{"fr": "Bonjour"}
]
{"key":"en", "value":"Hello"},
{"key":"fr", "value":"Bonjour"}
]
| Parameter | Default | Effect |
|---|---|---|
| map_threshold | 20 | Objects with ≥N distinct keys become Map candidates |
| map_max_required_keys | None | If set, blocks Map inference when required keys exceed this limit |
| unify_maps | false | Enables merging of compatible heterogeneous record schemas |
| wrap_scalars | true | Promotes scalars to objects when they collide with record values |
| no_root_map | true | Prevents document root from becoming a Map type |
| force_field_types | {} | Override inference: {"labels": "map"} forces Map |
| wrap_root | None | Wraps entire schema under a single field name |
From JSON Schema to Avro
avrotize-rs handles schema translation from JSONSchema to Avro schema. I ported it from Clemens Vasters' Python original avrotize (which covers several other format interconversions) because I needed it fast and in Rust. JSON Schema is fine as an interchange format but Avro is what data engineers actually use, and crucially it has native Map support. (JSONSchema can represent Maps too but it's much uglier. I understand JSONSchema a lot better now, for what that's worth.)
The three together form a pipeline: raw JSON → inferred JSON Schema → Avro schema → typed processing.
Battle testing
Battle tested on all of Wikidata, which surfaced plenty of edge cases (see the snapshot tests if you're curious, which were made by 'reducing' reproducible examples). polars-genson wraps it as a Polars plugin so you can do schema inference directly on string columns in a DataFrame. There was an issue with very large datasets seemingly not being deallocated from the Python side, If you're working with large data you may want to consider going entirely Rust-side (I think mostly I didn't try this sooner as I was still acclimatising to Rust).
III: Unmixing axes of meaning
| polars-fastembed | picard-ica |
|---|---|
I always felt like I should use embeddings more than I did (what with everyone doing RAG) but I felt there was an unsolved engineering problem to avoid you having to deal with the plumbing before you can just use these models in your data pipelines.
The idea behind polars-fastembed was embeddings as a DataFrame operation: column of strings in, column of vectors out, locally, quickly. For retrieval (lookup) on said vector column, we could then use the existing polars-distance plugin.
Benchmarks
polars-fastembed wraps fastembed-rs, a Rust port of the Python fastembed package from qdrant. My main contribution to fastembed-rs this year was adding support for Snowflake's Arctic Embed models including the quantised variants. The initial response when I opened an issue was "Yeah, this package doesn't intend to keep feature parity with the Python version" but this was less a rejection as "please send PRs not issues", so I did.
My benchmark: embedding all 708 Python PEPs (realistic corpus size, varying document length).
| Model | Device | Time | Throughput |
|---|---|---|---|
| MiniLM | CPU | 60s | 8ms/kT |
| Arctic Embed XS | GPU (CUDA) | 5s | 5ms/kT |
| Luxical One ★ | CPU | 1.8s | 0.5ms/kT |
Decomposition with ICA
Once you have embeddings you either retrieve (a.k.a. semantic search) or decompose them somehow, polars-fastembed has both, the latter being the less straightforward.
I tried some decomposition with FastICA based on a literature review of papers like FASTopic that led me to S3, "Semantic Signal Separation" (ACL 2025), and found it useful enough that I developed picard-ica (PICARD is faster than FastICA and I correctly assumed I could make it go even faster in Rust).
I'd written off topic models as data pseudoscience, and ICA has restored some faith—the topics actually look like topics, not stopword junk. It's still not an exact science and I can still see more to do in controlling them.
In theory plain linear algebraic decomposition gets the goods here, the topics are nice semantic primitives trained into the embeddings. I put together a web app arxiv_explorer to try them out at reasonable scale and thought the keywords looked passable-to-good.
| 1-5 | 6-10 |
|---|---|
| 11-15 | 16-20 |
|---|---|
Embeddings tell you where something is in semantic space (its "meaning"), and their decomposition separates out what the dimensions of that space are. The paper gives it all a geometric interpretation, and it all seemed reasonable so I went ahead and implemented it. You can treat this decomposition as an optimisation task: how to separate out the components of meaning as you would separate colours of mixed paint or individual vocals in a multi-speaker recording (the cocktail party problem). Its technical name is BSS (Blind Signal Separation), the same line of research that's had great success in music source separation (vocal stem isolation etc).
PICARD implementation
PICARD stands for Preconditioned ICA for Real Data. The preconditioning refers to whitening (making the covariance spherical, i.e. isotropic baseline variance) and using an approximated Hessian to speed up convergence. It was made to handle real world data better than alternatives (FastICA and InfoMax) hence the name.
Correctness is required for speed here (sort of the perf parallel to "if it compiles it works"). If you screw up your implementation it simply cannot converge fast. Since the algo ultimately calls LAPACK, correctness is the bottleneck on perf. As a fan of correctness anyway, I found it a nice way to work.
Hitting 200 iterations instead of 20 means no amount of low-level optimisation can save you (albeit I did see maybe ~10% perf advantage of Rust over Python for the same number of iterations). I did make some well-intentioned 'warm up' operations and optimiser guidance but if it makes iteration slower it is likely not worth the lag. The convergence step counts don't match between picard-ica and original Python PICARD because of different RNG engines but close enough for me to be confident in them, and overall sufficiently fast to be useful.
ICA can be cast as likelihood maximisation for an unmixing matrix W (separating out the sources, here the topics) by approximating the 2nd order curvature of the objective function. The idea is similar to audio source separation (stem isolation, the cocktail party problem) and my sense is this whole space has a classic vs. end-to-end DL split. The linear algebra approach feels almost quaint next to the models people throw at audio these days, but for text embeddings it works and the results are interpretable.
I tried a few Rust linear algebra backends along the way with picard-ica and the Rust FastICA that preceded it (which is now an optional warmup routine along with JADE, another one using fourth order cumulants that model kurtosis). Specifically linfa/ndarray performed well, faer was nice in theory (since it's pure Rust) but I saw worse perf when I rewrote in it and didn't want to take the hit so unfortunately abandoned ship. Looks like faer's development is paused but the developer is also active in facet-rs so I'll be watching that. In particular the integration with Polars piqued my interest (but comments in the repo's issue tracker made me suspect it's no longer functional).
Since picard-ica is a Rust crate I just dropped it into polars-fastembed and it now underlies the topic modelling. Source separation on text embedding vectors directly to give topics that look good. Before I integrated it properly, ICA was a bottleneck in arxiv-explorer (a little FastAPI web app for UMAP viz of embeddings for all arXiv article titles + abstracts in a subfield/time range of the user's choice, loaded on demand from a dataset I preprocessed for this purpose).
Demo time: topic modelling this post
As a fun exercise I thought it'd be cool to embed this blog post, which has 3 projects, and see if the topics recover them.
Topic models have always felt like astrology to me, relying on a charitable reading. The FastICA approach at least produces less incoherent junk keywords, which I'll take.
My interpretation of these is of 2 subject axes whose poles are distinct topics (Rust engineering and type theory on one, schema engineering and gradients/ML/geometry on another), and a style axis separating tales of the grind from the tone of project overviews (source code here).
New and next for the embedding space
Over Christmas I began implementing MUVERA for multi-vector embeddings and benchmarking with BEIR (which came out from Google this summer), then I've got image and multimodal embeddings in my sights for 2026.
I have some ideas set aside a while back for little apps to make with embeddings that might be viable now this groundwork is down, for more creative experimentation rather than just mechanically getting from A to B.
I also took a shot at exporting some of the (quite small) Tarka 150M models to ONNX this month (who just announced even tinier 10M ones in "preview"). I'm excited to see if 2026 brings new attention to this small/fast/local embedding paradigm, now polars-fastembed is raring to go.
Happy new year! I'm looking for my next role in 2026, if you're working on hard data problems and can tolerate someone who will inevitably want to rewrite it in Rust, I'd like to hear from you. Email or DM.