2025: A Year of Dewarping, Inferring, and Unmixing

Solving inverse problems in three registers

I: Unwarping unwrapped
II: Putting the Data Model on the Map
III: Unmixing axes of meaning

I churned out 30 Python packages this year and, somewhat unexpectedly, an equal number of Rust crates (my first ever being in March).

Looking back, I seem to have a thing for inverse problems.

Dewarping page images to recover flattened geometry from photographed books
Schema inference on JSON to recover the shape and types from observed values (allowing it to fit neatly into a DataFrame's column schema)
ICA on neural text embeddings to recover meaningful topics.

Put another way, inverting a camera projection, undoing JSON serialisation, and unmixing vectors in an embedding space.

In the case of image dewarping, there is a correct answer, and you try to pull it out through optimisation (or as close as you can get). Models are only as good as their assumptions, and the assumption here is of a particular type of curve, which we'll get into in a bit.

Schema inference is more algebraic (and your result is verifiably correct since using a bad schema with a sufficiently strict engine like a DataFrame will error on bad inputs). If you dig up enough edge cases and iron them out you can make quite a mature inference engine, though the edges get trickier the more mature your tool becomes.

The text embedding decomposition problem has a more probabilistic flavour: there isn't a clear 'right answer', which has always made topic models feel suspect to me.

I: Unwarping unwrapped

page-dewarp

I'm not a computer vision researcher so much as a long-time fanboy who took it upon himself to preserve some Python 2 code in 2021. I'd been mainly tinkering around the edges after the initial refactoring of page-dewarp, until this year when I finally reworked the reprojection optimisation at its core, without losing the spirit of the original.

To recap, the program uses a cubic sheet model (meaning it fits a cubic curve), specifically a "Hermite" one (whose curve height is set to 0 at both ends, as for a page resting on a flat surface). Matt Zucker's original blog post treats a photographed page as a smoothly curved surface rather than an open-ended pixel shuffling problem. You fit these splines to text contours you detect with OpenCV, then solve for the coefficients that best explain where the text ended up when projected through your camera model. It's physically grounded and a nice use of computational geometry (homography matrices or rotations in 3D, solved by SVD).

Solving the non-linear least squares problem

Zucker left a hint in his blog post that this was a non-linear least squares optimisation that hadn't been solved as such, and doing so could speed it up. The original code used SciPy's Powell method, which is 'derivative-free' (it explores the objective by evaluating it at different points without needing gradient information). This was a pragmatic choice as computing the gradient manually through the full projection pipeline (cubic splines → 3D geometry → Rodrigues rotation → camera projection → error) would require more complex Jacobian calculus.

The reprojection error objective we're minimising is textbook smooth least squares: we use squared differences of projected vs. detected keypoints, composed entirely of matrix operations, polynomial evaluation, and camera projection. This indeed makes it a natural fit for gradient-based methods like L-BFGS and Levenberg-Marquardt, provided those gradients are accurate. Finite-difference approximations introduce enough noise to mislead the optimiser, meaning poorly preconditioned steps, taking more iterations to converge, thus a slower overall program run.

I'd tried to pursue Levenberg-Marquardt (which Gilbert Strang covers at the end of Linear Algebra & Learning From Data) with GPUFit a couple years ago, but not being a C++ dev this wasn't much fun, and it ultimately didn't land.

I had been excited by what looked like a new Rust port of SciPy, but it quickly became clear the developer was using LLMs in a shall we say hands off way. I extracted what I needed and tried to contribute some fixes for the egregiously reward hacked tests, but there was really nothing of value there. If an optimiser is not implemented correctly it is simply not usable, and this was pretty much a total waste of my time, at best it was a Teachable Moment^TM, always check the tests for reward hacking if it sounds too good to be true.

After putting this discouragement behind me, I circled back to the idea of Levenberg-Marquardt only to find it another dead end. SciPy's L-BFGS worked but converged poorly: finite-difference gradient estimates give the optimiser too noisy a signal to follow. When I swapped the optimiser method backend from SciPy to JAX with autodiff I finally got exact gradients, and we had a serious contender, by quite a margin:

Image	SciPy Powell	JAX L-BFGS-B	Speedup	Eval Reduction
boston_cooking_a	12.18s	2.01s	6.1×	↓101×
boston_cooking_b	9.69s	1.21s	8.0×	↓155×
finnish_cooking_a	12.54s	1.31s	9.6×	↓198×
linguistics_thesis_a	3.15s	0.86s	3.7×	↓130×
linguistics_thesis_b	OOM	0.31s	∞	↓289×

A JAXmas miracle

The speedup came from not misleading the optimiser with noisy gradients. It's less "JAX is fast" and more "the old code was flying blind". (For the record the first clue was the manually derived gradient)

RE: the OOM → 0.31s case

The linguistics_thesis_b image previously crashed due to numerical instability during optimisation.

It caused an Out Of Memory error due to numerical instability in the optimisation causing a super wonky prediction for the size of one of the arrays (which in itself is a bug, but one that should not happen with a properly conditioned optimiser). The image in question still exhibits a failure mode but that no longer results in a crash!

Batch processing

Once single-image perf was sorted the obvious next stop was batch processing. The typical user will have a stack of scans, not one at a time.

JAX's JIT has warmup cost so it's not worth it for single images, but batches can see an extra 3-5x speed boost. This is switched on automatically for more than 1 input image:

Device	Sequential	Batched (40 images)	Speedup
CPU ★	36s	8.7s	4.1×
GPU	53s	11.2s	4.7×

Interestingly, on my hardware [single GPU, many cores] I see CPU batches exceeding GPU with JAX in the batch mode by some 30%. Don't underestimate CPU on SIMD!

Physically grounded methods

From around 2017, dewarping research began a shift to pixel-wise regression methods trained on crumpled paper datasets. It started off fairly reasonably, with "four way folds" (Das et al. at ACM DocEng 2017 used a CNN in a single optimisation step), then DocUNet (CVPR 2018) came along. Since then its descendants have framed dewarping as pixel displacement prediction.

This dataset is 130 photos of paper folded so much they surpass origami and border on crumpled (I had to put examples in in case it sounds like I'm exaggerating). None show pages in books. I'd suggest it's a benchmark for a problem nobody has, but which favours pixel-wise methods like convolution over spline models (which strikes me as picking winners).

File	1_2.jpg	54_2.jpg	52_2.jpg	52_1.jpg	51_2.jpg
Photo
Comment	Origami	Scrunched	Diagonal fold	Propped up on 2 sides	Modern art

File	50_2.jpg	38_2.jpg	32_1.jpg	29_2.jpg	27_1.jpg
Photo
Comment	Origami	Crinkled	Multi-diagonal	Origami	Oblique fold

File	24_2.jpg	23_2.jpg	20_2.jpg	18_2.jpg	17_2.jpg	16_2.jpg
Photo
Comment	Oblique 4-way fold	Crumpled	Vertical accordion folds	Crumpled	6-way fold with a crease	Crumpled/origami

The paper claims to have been "the first learning-based method" for dewarping, by which they mean not including when you learn the parameters to a non-dense NN model:

it is often desirable to digitally flatten a document image when the physical document sheet is folded or curved. In this paper, we develop the first learning-based method to achieve this goal.

The dataset was made up of individual pages, which means they could be draped over chairs, in one there's a page propped up by a keyboard at one corner and a sellotape dispenser at the other. It's kind of like the opposite of a spherical cow. I can see value for it in universal image models doing data augmentation, I just find it regrettable that they chose to use the same name as the existing spline-fitting dewarping methods.

A page is a surface, it bends according to material properties, and in books with a spine this curvature is constrained. Maybe it's not a Hermite cubic, maybe it's a quartic. Image recognition nets could do model selection or solve the dewarping task directly without the optimisation.

We are now in the odd situation where papers lament how methods don't take advantage of the constraints of a 3D surface physical model

such geometric constraints are largely ignored in existing advanced solutions, which limits the rectification performance... 3D shape and textlines

— Excerpt from DocGeoNet, 2022

So it goes!

I read the ByteDance Seed1.5-VL paper this summer, who mention doing document image dewarping as OCR pretext, though they don't call it that:

real world distortions, such as perspective shifts, bends, and wrinkles

This just becomes data augmentation and moves away from directly regressing pixels. Interestingly they used Donut's SynthDoG (ECCV 2022).

Text detectors and other ideas for the future

Now that the program runs fast, I'd like to explore text detection models to automate parameter selection, since most failures I've seen are coming from not picking up enough contours and thus the overall page orientation, like this one (via):

tk = 10 (default)	tk = 30

Paragraphs 2 & 4 missing spans	Even coverage down the page

My hunch is that text detector models won't give as precise contour shape info as the pipeline already has, but it could guide parameter sweeps for things like maximum text thickness/contour length, aiming to maximise overlap with detected text regions.

Type	Span Contours	Text Det
thresh

II: Putting the Data Model on the Map

polars-genson	genson-core	avrotize-rs

My life was much improved this year from learning the term "Map type."

I'd been struggling to articulate how flattening JSON ought be done when working with DataFrames, which must have a consistent schema that's known upfront.

In February I developed a Polars plugin polars-schema-index for "flattening nested columns with stable numerical indexing", which was an unsatisfying approach (a depth-first walk appending numbers to the column names), but was in the right general area.

Around August I had been exploring whether I could process Wikidata (~1.6TB extraction effort that could be as small as a few GB done right, wikidata-pq) because I wanted to mine a simple relational dataset from it and didn't care for SPARQL. This idea got shelved for the more compelling work of solving the general case of the problem I encountered when trying to ingest Wikidata.

An idea to mine Wikidata for a synthetic document generation corpus

For the record, the idea I had Wikidata for here was to generate realistic synthetic documents, since as you can see the outputs from SynthDoG, used by TikTok parent company ByteDance in their vision-language foundation models are semi-nonsensical:

I still want to write a data augmentation tool to make synthetic documents to train models for image tasks, ultimately to find a way to extract index pages with nested structures crossing over the boundary of multiple pages in a sequence, but that's a story for a new year.

Every entity ("Wikidatum"?) has labels in a few out of hundreds of possible languages. If you flatten that naively you get 200 mostly-null columns: you are treating them all as non-required fields of a JSON object or Polars struct column. This is fundamentally the wrong semantics for the labels field: the language codes are incidental not essential, i.e. they should be considered tied to the row rather than a property of the dataset.

I was already all in on data models and their use at runtime to give programs stronger 'contracts' (a.k.a. {pre/post}conditions), yet a gap I kept hitting—and for a long time struggled to verbalise—was having data without a schema, needing to make one before I can do anything useful. The fix is to infer it: type inference on the values, resolve the unions, and so on. None of my previous approaches (educated guesses, trial and error, vibecoding) really scaled.

In all I made five intertwined packages, the main ones being the core Rust crate and the Polars extension package:

genson-core handles this Map type detection, as fast as possible in Rust,
genson-cli just wraps genson-core as a CLI, and is handy for testing,
avrotize-rs interconverts JSONSchema and Avro schema, in Rust,
polars-genson wraps the JSON schema inference in a DataFrame operation, as a Python package,
polars-jsonschema-bridge interconverts Polars Schema types and JSON Schema, in Rust.

Schema inference as constraint discovery

genson-core infers schemas from JSON, which I think of as constraint discovery. A schema isn't really describing what your data is so much as what it's allowed to be, its "contract" on the data. The inference task is to observe enough examples to figure out the boundaries. In an ideal world you'd stop as soon as you'd seen enough to make some decision, but that kind of algorithmic performance optimisation is easier said than done.

Since our goal is to read enough of the data to identify all the possible values to constraint to (in situations where you want to be completely sure, this will mean reading all the data), the inference has to be fast, so naturally this ended up in Rust again.

Map inference for sparse data

A Map type is actually an array in disguise: it looks like an object (sometimes called a "mapping", in Python a "dictionary") but the keys are row-level data, not a schema-level structure.

The Map type was the crux of this work. I needed more than genson-rs, which infers data types fine but treats all JSON objects as object types.

So how do you tell Map from object? Since the only distinction is dataset semantics—incidental keys vs essential ones—you have to check the full dataset and decide.

Map Inference Pipeline

Input JSON

{
  "en": {
    "language": "en",
    "value": "Hello"
  },
  "fr": {
    "language": "fr",
    "value": "Bonjour"
  }
}

→

Inferred Avro Schema

{
  "type": "map",
  "values": {
    "type": "record",
    "fields": [
      {"name": "language"},
      {"name": "value"}
    ]
  }
}

→

Normalised Output

[
  {
    "key": "en",
    "value": {
      "language": "en",
      "value": "Hello"
    }
  },
  {
    "key": "fr",
    "value": {
      "language": "fr",
      "value": "Bonjour"
    }
  }
]

Why This Matters

❌ Without Map Inference (Object)

// Flattened as fixed columns:

en_language: string | null
en_value: string | null
fr_language: string | null
fr_value: string | null
de_language: null
de_value: null
es_language: null
es_value: null
... 200+ more nullable columns

Schema explodes with every new language

✓ With Map Inference

// Recognised as Map type:

labels: Map<string, Record>

// Where Record is:
{
language: string
value: string
}

Any number of languages, same schema

Schema stays fixed regardless of data

Since a Map is just an array in disguise, it can be of any length, as long as its values (the key-value pairs) are always the same data types (typically string-string). When the data types are heterogeneous, my solution is to "promote" scalars into both lists and objects. It also handles unions of map types (I call this "map unification"), and you can configure it all for fine control.

The program tries to do the right thing automatically. If users have to know a feature exists to benefit from it, most won't. Without schema evolution people just drop data they can't fit, which to me is a matter of correctness as much as UX convenience.

Schema Inference Decision Tree

How genson-core decides between Record and Map types

Objects with more distinct keys than map_threshold (default: 20) become candidates for Map inference. Below this threshold, objects stay as Records with fixed fields.

{
  "en": "Hello",
  "fr": "Bonjour",
  "de": "Hallo"
}

→

{
  "type": "object",
  "additionalProperties": {
    "type": "string"
  }
}

Even if key count exceeds threshold, map_max_required_keys can block Map inference when too many keys are required across all rows. This distinguishes stable schemas (Records) from sparse key-value data (Maps).

|UK| ≥ threshold ✓

|RK| ≤ max_rk ✓

→

Map sparse key-value data

|UK| ≥ threshold ✓

|RK| > max_rk ✗

→

Record stable schema structure

Symbol	Meaning	Example
\|UK\|	Total unique keys observed across all rows	Row 1 has {a,b}, Row 2 has {a,c} → \|UK\| = 3
\|RK\|	Count of keys required (present in every row)	Only "a" is in both rows → \|RK\| = 1

When unify_maps=true, heterogeneous record schemas can be merged if their fields are compatible. Fields present in all records stay required; fields missing in some become nullable.

{
  "a": {
    "index": 0,
    "vowel": 0
  }
}

{
  "b": {
    "index": 1,
    "consonant": 0
  }
}

↓

{
  "type": "map",
  "values": {
    "type": "record",
    "fields": [
      {"name": "index", "type": "int"}, // required
      {"name": "vowel", "type": ["null", "int"]}, // nullable
      {"name": "consonant", "type": ["null", "int"]} // nullable
    ]
  }
}

When wrap_scalars=true and a field has both scalar and object values across rows, scalars are promoted to objects under a synthetic key like fieldname__string. This allows unification to succeed instead of failing.

{
  "datavalue": {
    "id": "Q42",
    "type": "item"
  }
}

{
"datavalue": "some-string"
}

↓

// Scalar promoted to: datavalue__string
{
  "type": "object",
  "properties": {
    "id": {"type": ["null", "string"]},
    "type": {"type": ["null", "string"]},
    "datavalue__string": {"type": ["null", "string"]}
  }
}

During normalisation, map_encoding controls how Map values are serialised. This decides the output shape when flattening them for destinations like DataFrames (where it's kv, creating Struct-type columns with "key" and "value" struct fields).

Input:


        {"en": "Hello", "fr": "Bonjour"}

{
"en": "Hello",
"fr": "Bonjour"
}

Native object

[
{"en": "Hello"},
{"fr": "Bonjour"}
]

Array of single-entry objects

[
{"key":"en", "value":"Hello"},
{"key":"fr", "value":"Bonjour"}
]

Explicit key-value pairs

Parameter	Default	Effect
map_threshold	20	Objects with ≥N distinct keys become Map candidates
map_max_required_keys	None	If set, blocks Map inference when required keys exceed this limit
unify_maps	false	Enables merging of compatible heterogeneous record schemas
wrap_scalars	true	Promotes scalars to objects when they collide with record values
no_root_map	true	Prevents document root from becoming a Map type
force_field_types	{}	Override inference: `{"labels": "map"}` forces Map
wrap_root	None	Wraps entire schema under a single field name

From JSON Schema to Avro

avrotize-rs handles schema translation from JSONSchema to Avro schema. I ported it from Clemens Vasters' Python original avrotize (which covers several other format interconversions) because I needed it fast and in Rust. JSON Schema is fine as an interchange format but Avro is what data engineers actually use, and crucially it has native Map support. (JSONSchema can represent Maps too but it's much uglier. I understand JSONSchema a lot better now, for what that's worth.)

The three together form a pipeline: raw JSON → inferred JSON Schema → Avro schema → typed processing.

Battle testing

Battle tested on all of Wikidata, which surfaced plenty of edge cases (see the snapshot tests if you're curious, which were made by 'reducing' reproducible examples). polars-genson wraps it as a Polars plugin so you can do schema inference directly on string columns in a DataFrame. There was an issue with very large datasets seemingly not being deallocated from the Python side, If you're working with large data you may want to consider going entirely Rust-side (I think mostly I didn't try this sooner as I was still acclimatising to Rust).

III: Unmixing axes of meaning

polars-fastembed	picard-ica

I always felt like I should use embeddings more than I did (what with everyone doing RAG) but I felt there was an unsolved engineering problem to avoid you having to deal with the plumbing before you can just use these models in your data pipelines.

The idea behind polars-fastembed was embeddings as a DataFrame operation: column of strings in, column of vectors out, locally, quickly. For retrieval (lookup) on said vector column, we could then use the existing polars-distance plugin.

Benchmarks

polars-fastembed wraps fastembed-rs, a Rust port of the Python fastembed package from qdrant. My main contribution to fastembed-rs this year was adding support for Snowflake's Arctic Embed models including the quantised variants. The initial response when I opened an issue was "Yeah, this package doesn't intend to keep feature parity with the Python version" but this was less a rejection as "please send PRs not issues", so I did.

My benchmark: embedding all 708 Python PEPs (realistic corpus size, varying document length).

Model	Device	Time	Throughput
MiniLM	CPU	60s	8ms/kT
Arctic Embed XS	GPU (CUDA)	5s	5ms/kT
Luxical One ★	CPU	1.8s	0.5ms/kT

Decomposition with ICA

Once you have embeddings you either retrieve (a.k.a. semantic search) or decompose them somehow, polars-fastembed has both, the latter being the less straightforward.

I tried some decomposition with FastICA based on a literature review of papers like FASTopic that led me to S3, "Semantic Signal Separation" (ACL 2025), and found it useful enough that I developed picard-ica (PICARD is faster than FastICA and I correctly assumed I could make it go even faster in Rust).

I'd written off topic models as data pseudoscience, and ICA has restored some faith—the topics actually look like topics, not stopword junk. It's still not an exact science and I can still see more to do in controlling them.

In theory plain linear algebraic decomposition gets the goods here, the topics are nice semantic primitives trained into the embeddings. I put together a web app arxiv_explorer to try them out at reasonable scale and thought the keywords looked passable-to-good.

1-5	6-10

11-15	16-20

Embeddings tell you where something is in semantic space (its "meaning"), and their decomposition separates out what the dimensions of that space are. The paper gives it all a geometric interpretation, and it all seemed reasonable so I went ahead and implemented it. You can treat this decomposition as an optimisation task: how to separate out the components of meaning as you would separate colours of mixed paint or individual vocals in a multi-speaker recording (the cocktail party problem). Its technical name is BSS (Blind Signal Separation), the same line of research that's had great success in music source separation (vocal stem isolation etc).

PICARD implementation

PICARD stands for Preconditioned ICA for Real Data. The preconditioning refers to whitening (making the covariance spherical, i.e. isotropic baseline variance) and using an approximated Hessian to speed up convergence. It was made to handle real world data better than alternatives (FastICA and InfoMax) hence the name.

Correctness is required for speed here (sort of the perf parallel to "if it compiles it works"). If you screw up your implementation it simply cannot converge fast. Since the algo ultimately calls LAPACK, correctness is the bottleneck on perf. As a fan of correctness anyway, I found it a nice way to work.

Hitting 200 iterations instead of 20 means no amount of low-level optimisation can save you (albeit I did see maybe ~10% perf advantage of Rust over Python for the same number of iterations). I did make some well-intentioned 'warm up' operations and optimiser guidance but if it makes iteration slower it is likely not worth the lag. The convergence step counts don't match between picard-ica and original Python PICARD because of different RNG engines but close enough for me to be confident in them, and overall sufficiently fast to be useful.

ICA can be cast as likelihood maximisation for an unmixing matrix W (separating out the sources, here the topics) by approximating the 2nd order curvature of the objective function. The idea is similar to audio source separation (stem isolation, the cocktail party problem) and my sense is this whole space has a classic vs. end-to-end DL split. The linear algebra approach feels almost quaint next to the models people throw at audio these days, but for text embeddings it works and the results are interpretable.

I tried a few Rust linear algebra backends along the way with picard-ica and the Rust FastICA that preceded it (which is now an optional warmup routine along with JADE, another one using fourth order cumulants that model kurtosis). Specifically linfa/ndarray performed well, faer was nice in theory (since it's pure Rust) but I saw worse perf when I rewrote in it and didn't want to take the hit so unfortunately abandoned ship. Looks like faer's development is paused but the developer is also active in facet-rs so I'll be watching that. In particular the integration with Polars piqued my interest (but comments in the repo's issue tracker made me suspect it's no longer functional).

Since picard-ica is a Rust crate I just dropped it into polars-fastembed and it now underlies the topic modelling. Source separation on text embedding vectors directly to give topics that look good. Before I integrated it properly, ICA was a bottleneck in arxiv-explorer (a little FastAPI web app for UMAP viz of embeddings for all arXiv article titles + abstracts in a subfield/time range of the user's choice, loaded on demand from a dataset I preprocessed for this purpose).

Demo time: topic modelling this post

As a fun exercise I thought it'd be cool to embed this blog post, which has 3 projects, and see if the topics recover them.

Topic models have always felt like astrology to me, relying on a charitable reading. The FastICA approach at least produces less incoherent junk keywords, which I'll take.

Rust engineering

Type semantics

polars-fastembed wraps fastembed-rs, a Rust port of the Python library

I tried some decomposition with FastICA based on a literature review of papers like FASTop

The convergence step counts don't match between picard-ica and original Python PICARD

Since picard-ica is a Rust crate I just dropped it into polars-fastembed

genson-rs couldn't do this (though it did dtype inference perfectly well)

I tried a few Rust linear algebra backends along the way with picard-ica

I ported it from the Python original because I needed it fast

In all I made three intertwined packages: genson-core, avrotize-rs, and polars-genson

Since a Map is just an array in disguise, it can be of any length

An object must have a particular length, and all its keys known

The difference between a Map type and an object comes down to whether the key should exist

The nice thing about the Map type is that it's allowed to be of any length!

A Map type is actually an array type, but it looks like an object type

If you recognise it as a Map rather than an object you preserve the structure

Note that the 'key' is always a string type in a Map

A page is a surface, it bends according to material properties

Schema engineering

ML / geometry

A field like X from JSON is really a key-value pair whose value is a scalar string field

In my case though all I wanted was its JSON schema format

JSON Schema is a good interchange format but Avro is often used by data engineers

The difference between a Map type and an object comes down to whether the key should exist

genson-core is the first library I know of that can infer Map types from JSON

Schema inference as constraint discovery: genson-core infers the schema from JSON

A Map type is actually an array type, but it looks like an object type

From JSON Schema to Avro: avrotize-rs handles schema translation

L-BFGS relies on successive gradients to infer local curvature

Around 2020, dewarping research shifted from physically-grounded to learning-based methods

Grid-based dewarping methods are privileged over spline fitting when regularity fails

Matt Zucker's blog post treats a photographed page as a bent surface

Train a network to predict a pixel displacement field for crumpled paper

Solving the non-linear least squares problem: Zucker left a tip in his blog post

SciPy's L-BFGS worked but converged slowly

The program uses a cubic sheet model, specifically designed for pages

Iteration / debugging

Design / goals

With some guidance from LLMs I circled back to try it again, but saw it perform poorly

that will have the same reliability

Since the algo ultimately calls LAPACK, I saw this perf as bottlenecked by BLAS

If you get the algo wrong and it hits 200 max iterations rather than 20 or 30

If you mess any of it up you simply cannot converge as fast

Wikidata indeed had examples of this, and my solution was to 'normalise' them

Without such a routine, you'd still have to deal with these awkward customers

Thus followed much battle testing and perf chasing

Dewarping page images to recover flattened geometry from photographed books

Embeddings tell you where something is in semantic space (its "meaning")

Around 2020, dewarping research shifted from physically-grounded methods

Once you have embeddings you typically want to perform semantic decomposition

You fit these splines to detected text contours, then solve for coefficients

A page is a surface, it bends according to material properties

A field like X from JSON is really a key-value pair

Computing embeddings locally removes API costs and latency

My interpretation of these is of 2 subject axes whose poles are distinct topics (Rust engineering and type theory on one, schema engineering and gradients/ML/geometry on another), and a style axis separating tales of the grind from the tone of project overviews (source code here).

New and next for the embedding space

Over Christmas I began implementing MUVERA for multi-vector embeddings and benchmarking with BEIR (which came out from Google this summer), then I've got image and multimodal embeddings in my sights for 2026.

I have some ideas set aside a while back for little apps to make with embeddings that might be viable now this groundwork is down, for more creative experimentation rather than just mechanically getting from A to B.

I also took a shot at exporting some of the (quite small) Tarka 150M models to ONNX this month (who just announced even tinier 10M ones in "preview"). I'm excited to see if 2026 brings new attention to this small/fast/local embedding paradigm, now polars-fastembed is raring to go.

Happy new year! I'm looking for my next role in 2026, if you're working on hard data problems and can tolerate someone who will inevitably want to rewrite it in Rust, I'd like to hear from you. Email or DM.