Executing codemods

How to author codemods safely

So what exactly is a codemod, practically how do we author one and what will it mean to execute it?

So far we have developed a concept of a "template" (a shared portion of a file with parameterised parts we call "holes", with two or more possible values). An important part of the concept is that there are distinct "cohorts" (sets of files that have the same value in a particular hole).

For example earlier we saw this template for dependabot configs:

    updates:
      - cooldown?: ⟨?0⟩
        directory: "/"
        package-ecosystem: ⟨?1⟩
        schedule:
          interval: ⟨?2⟩
    version: 2

which had three holes (numbered 0, 1, 2) and we can split the files into cohorts such as two cohorts by interval (6× weekly, 3× monthly).

At a glance our type system requirements look straightforward: swap a string for a string.

However even in such a simple example the reality is that there is a stronger schema, which we can get from SchemaStore (to take just one property):

{
  "definitions": {
        "schedule": {
          "properties": {
            "interval": {
              "$ref": "#/definitions/schedule-interval"
            },
          },
      },
    },
  },
}

which points to an enum of its possible values:

"schedule-interval": {
  "type": "string",
  "enum": [
    "daily",
    "weekly",
    "monthly",
    "quarterly",
    "semiannually",
    "yearly",
    "cron"
  ]
},

If we pass a substituted config file into this, we can validate our proposed substitution before making a mistake (which an operator applying bulk edits might miss).

The same goes for other config files: - pyproject.toml has a SchemaStore schema here - Cargo.toml does not have a SchemaStore entry (discussion thread here) - GitHub Actions workflows have a SchemaStore entry here

Further to the GitHub Actions (and particularly importantly as the one that is validated remotely), the with block of a GitHub Action can be validated based on what is found at the repo/revision's action.yml in the inputs value.

For example, take the beginning of the PyO3/maturin-action action.yml:

name: 'maturin-action'
description: 'GitHub Action to install and run a custom maturin command'
author: messense

inputs:
  token:
    description: Used to pull maturin distributions using GitHub API. Since there's a default, this is typically not supplied by the user.
    required: false
    default: ${{ github.token }}
  command:
    description: maturin command to run. Defaults to "build".
    required: false
    default: 'build'
  args:
    description: Arguments for the maturin command
    required: false
  maturin-version:
    description: Version of maturin to install like "v0.12.0".
    required: false

What a good codemod should look like

I have never gotten into tools for codemods, because it requires first learning their AST DSL pattern syntax, variable capture rules and transform grammar before I can have a sufficiently clear mental model to debug the rules. This is a barrier in particular because syntax should be the last thing on your mind when you're doing a task like a codemod whose main barrier is semantic — namely is this even the right change to make?

I want something ergonomic, above all else, or it is no good.

An element of that is not trying to do too much.

I am talking about replacing a block like (in the simplest case) this:

      - name: Publish to PyPI
        if: ${{ startsWith(github.ref, 'refs/tags/') }}
        uses: PyO3/maturin-action@v1
        with:
          command: upload
          args: --non-interactive --skip-existing wheels-*/*

with one like this

      - name: Publish to PyPI
        uses: pypa/gh-action-pypi-publish@release/v1
        with:
          packages-dir: dist/

The problem here is that half of the trouble of deciding your command will be in establishing what it should be (which a well-designed tool will guide you through how to think about).

If we try to approach this from the tool's perspective, we should therefore try to afford the user the ability to break their problem down into smaller pieces (rather than a "big bang" which they can't feel confident in).

Once you do that it is just a matter of individual steps:

change the uses: line from PyO3/maturin-action@v1 to pypa/gh-action-pypi-publish@release/v1
remove the args field
set packages-dir as wheels-*/* (??)

...and on that last point you see the kind of semantic trouble the user will face.

Straight away I am unsure whether this is the right thing to do, and this is all the worse because finding out at runtime (from lack of a predictive mental model) would mean make or break on package publishing (which has a lot of other ceremony that can make such mistakes a hassle, like git tags that have to be deleted, and those are now being immutable, so this could cause a ton of issues to get wrong). We want to derisk this as much as possible.

My point being, the trouble is that when you add on the difficulty of wielding some abstruse DSL pattern language on top of a problem with semantic concerns and then you try to do that at fleet scale, you are in for a bad time.

At a structural level what I'm talking about doing here is fairly simple, but it is also not a simple matter of string replacement, there is a non-obvious schema to it (some of which we can validate) and getting some parts correct will rely on a correct mental model of underlying tools the config is being written for.

If I was to try and do this on the command line using some imaginary tool, I'd probably come up with something that is usable like CSS selectors (which is another common way to access a tree and widely used).

A hypothetical tool could look something like:

schematic edit workflow.yml \
  --select ".jobs[*].steps[*]" \
  --where "uses == 'PyO3/maturin-action@v1'" \
  --set uses="pypa/gh-action-pypi-publish@release/v1" \
  --unset with.args \
  --set with.packages-dir="dist/" \
  --dry-run

The --select could also be like --select ".jobs.*.steps.*" rather than the more jq-style square bracketted stars

It also dawned on me we would express the uses match best as

--where "uses ^= 'PyO3/maturin-action@'"

like the way CSS has ^= (prefix), $= (suffix), *= (any substring) and ~= (word boundaries).

We would 'stack' multiple --where terms to indicate both (AND)

--where "uses == 'PyO3/maturin-action@v1'" \
        "name == 'Publish to PyPI'"

We can leave the template building as an analysis tool for now, perhaps even it isn't right for a rewriting tool.

I think the rewriting should most of all be simple, boring, and explicit - set a field and unset another at the given --where.

We can then add more such operators like rename:

--rename with.packages-dir=with.package_dir

move:

--move with.args -> with.packages-dir

replace:

--replace '
- name: Publish to PyPI
  uses: pypa/gh-action-pypi-publish@release/v1
  with:
    packages-dir: dist/
'

and maybe we want to do a templated replace

--replace-template '
- name: Publish to PyPI
  uses: pypa/gh-action-pypi-publish@release/v1
  with:
    packages-dir: {{ old.with.args }}
'

where it binds to the previous part somehow, giving controlled reuse of extracted values (which sits more neatly with the templating idea).

We already have the template/hole machinery but I think that is something to add after getting basics nailed down.

On which note, dry run must be a first class feature too, to build trust (either --dry-run or --diff depending on if we are showing a unified diff or just an indication of if it would succeed).

We could even add an --explain flag since the core risk here is semantic uncertainty. I'd suppose it would warn of anything you might not expect, like if the replacements were not uniform in some way, along the lines of:

Matched 12 workflow steps - 12 occurrences of PyO3/maturin-action@v1 - All have 'args' field present - None have 'packages-dir'

For instance some of the templates may have had optional fields — variation is to be expected — but if it's variation in the range of lines we modify then while it ought not halt the operation, we might want to tell the user. An --explain flag is a clear sign for interactive users wanting more certainty.

You might then (after getting the CLI right) provide a declarative interface to the same, along the lines of

migration:
  name: pypi-modernisation
  match:
    uses: PyO3/maturin-action@v1
  rewrite:
    uses: pypa/gh-action-pypi-publish@release/v1
    with:
      packages-dir: dist/
  warnings:
    - if: "with.args exists"
      message: "args discarded; verify wheel output location"

This is the more traditional way to do a codemod, but to me that should be secondary, because the moment you force everything to go through text patches I think you lose some of the will to make an expressive interface.