blog/content/posts/speedy-ifd.md
Jade Lovelace d06e3ef9d9 drafts
2023-02-05 14:03:51 -08:00

11 KiB

+++ date = "2022-10-18" draft = true path = "/blog/speedy-ifd" tags = ["haskell", "nix"] title = "Speedy import-from-derivation in Nix?" +++

Nix has a feature called "import from derivation", which is sometimes called "such a nice foot gun" (grahamc, 2022). I can't argue with its usefulness; it lets Nix do amazing things that can't be accomplished any other way, and avoid pointlessly checking build products into git. However, it has a dirty secret: it can atrociously slow down your builds.

The essence of this feature is that Nix can do operations such as build derivations whose result is used in the evaluation stage.

Nix build staging?

Nix, in its current implementation (there are efforts such as tvix to change this), can do one of two things at a given time.

  • Evaluate: run Nix expressions to create some derivations to build. This stage outputs .drv files, which can then be instantiated. Nix evaluation happens in serial (single threaded), and in a lazy fashion.
  • Build: given some .drv files, fetch the result from a binary cache or build from scratch.

How does import-from-derivation fit in?

Import-from-derivation (IFD for short) lets you do magical things: since Nix derivations can do arbitrary computation in any language, Nix expressions or other data can be generated by external programs that need to do pesky things such as parse cursed file formats such as cabal files.

N.B. I've heard that someone wrote a PureScript compiler targeting the Nix language, which was then targeted at parsing a Cabal file to do cabal2nix's job entirely within Nix. Nothing is sacred.

In order to achieve this, however, the evaluation stage can demand builds be run. In fact, such builds need to be run before proceeding with evaluation! So IFD serializes builds.

What constitutes IFD?

The following is a nonexhaustive list of things constituting IFD:

  • builtins.readFile someDerivation
  • import someDerivation
  • Any use of builtin fetchers:
    • builtins.fetchGit
    • builtins.fetchTree
    • builtins.fetchTarball
    • builtins.fetchurl
    • etc

Builtin fetchers

Use of builtin fetchers is a surprisingly common IFD problem. Sometimes it is done by mistake, but other times it is done for good reason, with unfortunate tradeoffs. I think it's reasonable to use IFD to import libraries such as nixpkgs, since fundamentally the thing needs to be fetched for evaluation to proceed, but other cases are more dubious.

One reason one might use the builtin fetchers is that there is no way (excepting calculated use of impure builders) to use a derivation to download a URL without knowing the hash ahead of time.

An example I've seen of this being done on purpose is wanting to avoid requiring contributors to have Nix installed to update the hash of some tarball, since Nix has its own bespoke algorithm for hashing tarball contents that nobody has yet implemented outside Nix. So the maintainers used an impure network fetch (only feasible with a builtin) and instituted a curse on the build times of Nix users.

The reason that impure fetching needs to be a builtin is because Nix has an important purity rule for derivations: either input is fixed and network access is disallowed, or output is fixed and network access is allowed. In the Nix model as designed (content-addressed store aside), derivations are identified only by what goes into them, but not the output.

Let's see why that is. Assume that network access is allowed in normal builders. If the URL but no hash goes in and network access is available, anything can come out without changing the store path. Such an impurity would completely break the fantastic property that Nix has no such thing as a clean build since the builds don't get dirtied to begin with.

Thus, if one is doing an impure network fetch, the resulting store path has to depend on the content without knowing the hash ahead of time. Therefore the fetch has to serialize all evaluation after it, since it affects the store paths of anything downstream of it during evaluation.

That said, it is, in my opinion, a significant design flaw in the Nix evaluator that it cannot queue all the derivations that are reachable, rather than stopping and building each in order.

Stories

I work at a Haskell shop which makes extensive use of Nix. We had a bug where Nix would go and serially build "all-cabal-hashes-component-*". For several minutes.

This was what I would call a "very frustrating and expensive developer UX bug". I fixed it in a couple of afternoons by refactoring the use of import-from-derivation to result in fewer switches between building and evaluating, which I will expand on in a bit.

Background on nixpkgs Haskell

The way that the nixpkgs Haskell infrastructure works is that it has a Stackage-based package set based on some Stackage Long-Term Support release, comprising package versions that are all known to work together. The set is generated via a program called hackage2nix, which runs cabal2nix against the entirety of Hackage.

cabal2nix is a program that generates metadata, input hashes, and hooks up dependencies declared in the .cabal files to Nix build inputs.

This set can then be overridden by overlays which can apply patches, override sources, introduce new packages, and do basically any other arbitrary modification.

At build time, the builder will be provisioned with a GHC package database with everything in the build inputs of the package, and it will build and test the package.

In this way, each dependency is turned into a Nix derivation so caching of dependencies for development shells, parallelism across package builds, and other useful properties simply fall out for free.

Where's the IFD?

nixpkgs Haskell provides a wonderfully useful function called callCabal2nix, which executes cabal2nix to generate the Nix expression for some Haskell source code at Nix evaluation time. Uh oh.

It also provides another wonderfully useful function called callHackage. This is a very sweet function: it will grab a package of the specified version off of Hackage, and call cabal2nix on it.

Wait, how does that work, since you can't just download stuff for fun without knowing its hash? Well, there's your problem.

"Figuring out hashes of stuff on Hackage" was solved by someone publishing a comically large GitHub repo called all-cabal-hashes with hashes of all of the tarballs on Hackage and CI to keep it up to date. Using this repo, you only have to deal with keeping one hash up to date: the hash of the version of all-cabal-hashes you're using, and the rest are just fetched from there.

Oh no

This repository has an obscene number of files in it, such that it takes dozens of seconds to unpack it. So it's simply not unpacked. Fetching a file out of it involves invoking tar to selectively extract the relevant file from the giant tarball of this repo.

That, in turn, takes around 7 seconds on the fastest MacBook available, for each and every package, in serial. Also, Nix checks the binary caches for each and every one, further compounding the fail.

I optimized it to take about 7 seconds, total. Although I am a witch, I think that there is some generally applicable intuition derived from this that can be used to make IFD go fast.

Making IFD go fast

Nix is great at building big graphs of dependencies in parallel and caching them. So what if we ask Nix to do that?

How can this be achieved?

What if you only demand one big derivation be built with IFD then reuse it across all the usage sites?

Details of some-cabal2nix

My observation was that hot-cache builds with a bunch of IFD are fine; it's refilling it that's horribly painful since Nix spends a lot of time doing pointless things in serial. What if we warmed up the cache by asking it to build all that stuff in one shot? Then, the rest of the IFD would hit a hot cache.

The major innovation in the fix, which I called some-cabal-hashes, is that it builds one derivation with IFD that contains everything that will be needed for further evaluation, then all the following imports will hit that already-built derivation.

Specifically, my build dependency graph now looks like:

                /- cabal2nix-pkg1 -\
some-cabal2nix -+- cabal2nix-pkg2 -+- some-cabal-hashes -> all-cabal-hashes
                \- cabal2nix-pkg3 -/

There are two notable things about this graph:

  1. It is (approximately) the natural graph of the dependencies of the build assuming that the Nix evaluator could keep going when it encounters IFD.

  2. It allows Nix to naturally parallelize all the cabal2nix-* derivations.

Then, all of the usage sites are import "${some-cabal2nix}/pkg1 or similar. In this way, one derivation is built, letting Nix do what it's good at. I did something clever also: I made some-cabal2nix have no runtime dependencies by copying all the resulting cabal files into the output directory. Thus, the whole thing can be fetched from a cache server and not built at all.

Acquiring the data to know what will be demanded by any IFD is the other piece of the puzzle, of course. I extracted the data from the overlays by calling the overlays with stubs first (to avoid a cyclic dependency), then evaluating for real with a callHackage function using the some-cabal2nix created using that information.

The last and very important optimization I did was to fix the tar invocation. tar files have a linear structure that is perfect for making tape archives (hence the name of the tool) which can be streamed to a tape: one file after the other, without any index. Thus, finding a file in the tarball takes an amount of time proportional to O(n) where n is the number of files in the archive.

If you call tar m times for the number of files you need, then you do O(n*m) work. However, if you call tar once with a set of files that you want such that it can do one parse through the file and an O(1) membership check of that set, then the overall time complexity is O(n). I can assume that is what tar actually does, since it does the entire extraction in basically the same time with a long file list as with one file.

Enough making myself sound like I am in an ivory tower with big O notation; regardless, extracting with a file list yielded a major performance win.

I also found that if you use the --wildcards option, tar is extremely slow and it seems worse with more files to extract. Use exact file paths instead.

FIXME: get permission to release the sources of that file