What I dug out of Wikidata

Debloating multilingual splits

Decimating Big Data
Schema sniped
- Known schema hardening

Decimating Big Data

Unlike web scraped datasets, you're entirely within your rights to do as you wish with Wikidata.

I wasn't too happy when the BBC recently sent me a copyright strike through HuggingFace, after I'd proudly reprocessed and distributed a subset of HuggingFace's own FineWeb dataset, dedicated to news articles that could be sifted out of the 'web scale' HuggingFace Fineweb dataset. The grounds for a DMCA Notice were that it contained material from BBC websites, though there was no right to reply I did wonder how can a subset of a dataset be in violation of copyright yet the full set not? Answer unclear.

Wikidata on the other hand is firmly in the open data category, and so we should really be trying to do all we can with it.

I have of course used queries on Wikidata through SPARQL/RDF linked data web services, but there's something inherently inferior to servitised and partial approaches.

Once you move past the JSON munging side, which I wrote about recently, you find that what appears on disk to be 1.5TB can be split by language to leave you with an English subset of just ~100 GB. The repo I developed this in is on GitHub, but I want to talk about the data (the knowledge it represents) more than techniques here.

This is worth pausing to appreciate: 95% shrinkage represents a qualitative change.

A 'bulk' dataset in the terabytes exists to be kept on a dedicated disk or nibbled away at, never loaded into memory in its entirety, and completely inaccessible without a server-level machine. While the 'data engineering' work is intellectually stimulating in its own right, far more so is the prospect of having all the facts in the known English language Wikiverse simply... to hand.

Schema sniped

I spent a lot of time working out a way to repeatedly point a JSON schema inference tool at the Wikidata claims dataset (500+ chunks each in batch files of ~1818 rows).

Known schema hardening

Duck typing "work on the basis of an implied schema, else crash downstream" (unknown/unknown) vs. runtime validation "assume an explicit schema, crash if incorrect" (known/unknown) are still just the first 2 parts of a grid whose quadrants index expected data model/actual data type.

In the known/known case you have an explicit schema expectation, an actual one inferred from data, so you never have to actually hardcode your expectation [runtime validation], you only use it if expectation ≠ reality to give a diff—superior to merely enumerating mismatches as it's already condensed.

So 'duck typing' might simply try to access a field and crash downstream, hence we pre-validate
Runtime validation might explicitly state a desired schema and crash if it fails to parse (like in Pydantic)
The approach I'm using here is to infer the schema, use that to validate, but also infer the schema so as to be able to give the actual diff (since we can distinguish 'allowed' diffs from 'bad' ones) rather than merely the sum of their individual fields.

So for example let's imagine we have a schema with one string field "foo" and we get:

foo="a" -> OK
foo="b" -> OK
foo=1 -> Error: field foo dtype [str] got dtype int
foo={bar=1} -> Error: field foo dtype [str] got dtype dict[str,int]
foo={baz=null} -> Error: field foo dtype [str] got dtype dict[str,null]

You now have 3 distinct errors to deal with, none of which actually tell you what the right schema typing for foo was (in this simple case we can see it's just that foo is a union dtype between str, int, and either a record type with non-required field bar: int, non-required and nullable field baz...

Obviously it's better to just infer the schema you get from the data and show a diff, and progressively "harden" the schema (or: make more comprehensively correct).