Introduction to snapshot testing

Write tests once, let pytest detect and store the values

I wrote my first 'snapshot test' last week and became an immediate fan.

I came across this concept last month when Pydantic (a notoriously high velocity team) announced they were sponsoring inline-snapshot, a Python library providing a peculiar function: snapshot().

Here I'll demystify the jargon and walk through basic usage.

A 'snapshot' is just a recorded value

The term "snapshot testing" was popularised by Jest, a JavaScript UI testing framework.

The name ‘snapshot’ for a captured value in that context conveys both that:

it's a recorded value
it's something you'd otherwise check visually (since manual UI testing is done by sight)

To clarify though, this 'snapshot' was text, not an image: "rendered" HTML made by a React app.

The visual metaphor doesn't carry over to use in backend unit tests (in a backend system we're working with components that are typically unseen). The essential idea remains of automatically recording an intermediate value for program components. You might also call it 'checkpointing'.

The main benefit you get out of this is in terms of developer experience. At best, writing tests this way becomes a one-off, eliminating manual rework.

In the first instance you don't need to fiddle around obtaining values to insert into your tests as expected values.

snapshot() is an unusual function: it simply outputs its input. However, when you run this function in a pytest session with the --inline-snapshot=create flag, it springs into action. Your source code is edited at that 'empty' snapshot() call site, inserting the value it's compared against as the function argument, turning the likes of assert 3 == snapshot() into assert 3 == snapshot(3). Since this snapshot() call is a no-op, it just evaluates to assert 3 == 3.

In this way the intermediate value becomes 'snapshotted': captured from the program data flow and recorded in perpetuity in the source code.

Inline means in the same file as the source code

Note that in other frameworks like Jest, they use "snapshot files", external to the test code. As the name suggests, inline-snapshot keeps the recorded values inline, in the source code (the values themselves rather than a reference to an external source).

This doesn't mean you need to colocate the definitions of these values at their point of usage, potentially cluttering your code. Snapshots can be used anywhere you'd use a literal value, so that means it works with 'code tidying' helpers like @mark.parametrize.

The second option for the --inline-snapshot flag is fix, which modifies an inline snapshot that was already created.

When things go wrong, the teardown message logged in the pytest output gives you a hint on the syntax to fix tests that break, meaning you don't have to revisit docs: again, smooth.

When you can use snapshot testing

You can use snapshot testing when the source of your value is known and static.

For example let’s say my supermarket has an API which tells me which fruits are in stock:

fruit_count: int
fruit_names: list[str]

I can snapshot these values so I can make sure I always get enough fruit in my weekly shop:

response["fruit_count"] = snapshot()
response["fruit_names"] = snapshot()

When you pytest --inline-snapshot=create the values will get filled in

assert response["fruit_count"] == snapshot(2)
assert response["fruit_names"] == snapshot(["Apple", "Banana"])

When you can’t use snapshot testing

Dynamic schema

You can’t snapshot something when the source of your value is not static.

For instance let’s say the supermarket has another API to tell me the aisle to find the fruits in:

{
  "aisle_1": {
    "bottom_shelf": "apple",
    "top_shelf": "banana",
  },
  "aisle_2": {
    "top_shelf": "cherry"
  },
}

I snapshot the banana’s location so I can ensure an ongoing supply of potassium:

banana_aisle = response["aisle_1"]
banana_location = banana_aisle["top_shelf"]
assert banana_location == snapshot("banana")

One day, the shopkeeper swaps the bananas and the cherries, breaking my test:

{
  "aisle_1": {
    "bottom_shelf": "apple",
    "top_shelf": "cherry"
  },
  "aisle_2": {
    "top_shelf": "banana"
  }
}

I need to change my test to

banana_aisle = response["aisle_2"]
banana_location = banana_aisle["top_shelf"]
assert banana_location == snapshot("banana")

You can’t snapshot the values here because you can’t say in advance where the banana_location is.

In other words the snapshot can lock in the value at a location (data), but only given a location (metadata).

If both the data (the value: "banana") and metadata (location of the value: aisle_2 -> top_shelf) can change then we can't "pin down" the value with the snapshot.

Consider other scenarios where both the value and its location could change:

Arbitrarily formatted unstructured or semi-structured data

To extract values reliably from a source with varying formats, for instance, if your supermarket changes to free text for product locations, you might have to create your own structured data to then snapshot, say by:

using a string operation (e.g. split) to extract an invariant portion of the value
writing a regular expression to capture the aisle number
using some sort of data structure (e.g. a trie) to extract a component within the value algorithmically

These are all ways around not having static intermediate data, by limiting the part of the data (the value) we wish to compare to, a part we're more confident will stay the same.

Some problems are not tractable to such approaches, a couple that come to mind being natural language (under arbitrary speaker) and HTML (under arbitrary web design rearrangement).

Consider how we would need to process text with item locations if the locations were either implied, or unreliably expressed in many variations of the English language. A regex would have a hard time!

Also consider how you might target a particular value within a web page (perhaps web scraping a public transit dashboard posting travel updates in non-standard formats during the COVID pandemic), hen a web designer rearranges the page’s HTML.

In both cases an LLM following a structured generation data model [i.e. obeying a contract of what information to extract from the unstructured inputs] would standardise the data into a form suitable to snapshot. That is, render it static, remove the variation which we cannot pin down in a test.

There is a speculative basis here for what might constitute a 'self-healing snapshot' workflow, where the location metadata is repaired by a LLM capable of interpreting the variation in the input to:

Confirm or deny the continued presence of the piece of information in the previous snapshot.
Locate the piece of information to recreate the snapshot when it changes (breaks)

In other words its role is to repair the location (metadata), rather than the value at that location (data).

Defensive metadata testing

If you consider the simpler form of the problem above,

banana_aisle = response["aisle_1"]
banana_location = banana_aisle["top_shelf"]
assert banana_location == snapshot("banana")

It would be very awkward but not out of the question to reformulate this code to 'defensively' question its assumptions, and guard against the change of location. Whether this is desirable is another question.

This might be more realistic if the data we're capturing is

{"product_type": "banana", "brand": "Del Monte"}

So in this case we might be snapshotting the brand, and it's reasonable that we know we're after bananas in the first place.

We might then make our test code more 'defensive' by checking the assumption that the bananas are in the aisle, super crudely by stringifying the aisle values to find which has the bananas in:

# Filter the API response to get the aisle(s) with bananas in
banana_aisles = {aisle: stock for aisle, stock in response.items() if "banana" in str(stock)}
# Ensure there are non-zero aisles with bananas
assert banana_aisles, "There are no aisles with bananas"

# Take the first aisle number we can find bananas in as the banana aisle
banana_aisle_no = next(iter(banana_aisles))
banana_aisle = aisles[banana_aisle_no]

# Do the same for the shelf
banana_shelves = {shelf: product for shelf, product in banana_aisle.items() if "banana" in str(product)}
assert banana_shelves, "The selected aisle doesn't contain shelves with bananas"
banana_shelf = next(iter(banana_shelves))

banana_location = banana_aisle[banana_shelf]
assert banana_location == snapshot({"product_type": "banana", "brand": "Del Monte"})

Now our snapshot is robust to changes in aisle and shelf, but not of banana brand.

One obvious issue here is that if the shop gets both (a new brand of banana in another aisle), our test snapshot could break, because the next() would take the first one it finds.

We could sort the filtered banana_aisles result to remove any randomness of known elements, but this doesn't necessarily make it deterministic if stock may still change.

One thing we might do is a kind of 'defensive snapshotting', which would operate to "shift left" the test failure by highlighting a change in the source of the value as early as possible.

I see this much like how Pydantic models can draw attention to changes in data schema at the earliest point possible, validating early on rather than at the point of use (which then typically leads to a hunt for the root cause).

In this case we made a few assumptions that might be covered by snapshot tests:

Is there just one aisle with bananas? If not, that might lead to the next() access receiving a different value from the sequence.
Can we invert the hierarchy of aisles to see this data banana-centrically? Is the snapshot of bananas to shelves static?

...and so on.

For instance, we might make the assumption that the number of banana brands we're dealing with stays the same:

assert len(banana_aisles) == snapshot()
assert len(banana_shelves) == snapshot()

...but depending on the context this might actually be an overly restrictive thing to pin down.

Snapshot testing IRL

For a less toy example, I'm considering how the aforementioned techniques might be used to cover the TfL tube network datasets accessible in my Python interface tubeulator.

One feature it can perform is to list platforms by tube line, and vice versa. The Overground line was recently split apart and renamed for Autumn 2024, which is exactly the sort of metadata/schema change whose implications you'd want to be fully aware of when building on.

Likewise I have been able to convert a significant amount of code relying on extracting data from semi-structured online documents to use snapshot testing, but could immediately see that these were vulnerable to the same values being moved around (and am still mulling over how to avoid playing cat and mouse with that aspect, but ideally without introducing LLM randomisation into otherwise deterministic programs).

There seems to be a lot on the table here and I'm intrigued to see if more tricks become apparent.