drafts

2023-02-05 14:02:36 -08:00 · 2023-02-05 14:02:36 -08:00 · d06e3ef9d9
commit d06e3ef9d9
parent 09cee8447c
3 changed files with 458 additions and 0 deletions
--- a/content/posts/speedy-ifd.md
+++ b/content/posts/speedy-ifd.md
@ -0,0 +1,245 @@
+++
+date = "2022-10-18"
+draft = true
+path = "/blog/speedy-ifd"
+tags = ["haskell", "nix"]
+title = "Speedy import-from-derivation in Nix?"
+++
+
+Nix has a feature called "import from derivation", which is sometimes called
+"such a nice foot gun" (grahamc, 2022). I can't argue with its
+usefulness; it lets Nix do amazing things that can't be accomplished any other
+way, and avoid pointlessly checking build products into git. However, it has a
+dirty secret: it can *atrociously* slow down your builds.
+
+The essence of this feature is that Nix can do operations such as build
+derivations whose result is used in the *evaluation stage*.
+
+## Nix build staging?
+
+Nix, in its current implementation (there are efforts [such as tvix][tvix] to
+change this), can do one of two things at a given time.
+
+* Evaluate: run Nix expressions to create some derivations to build. This stage
+  outputs `.drv` files, which can then be instantiated. Nix evaluation happens
+  in serial (single threaded), and in a lazy fashion.
+* Build: given some `.drv` files, fetch the result from a binary cache or build
+  from scratch.
+
+[tvix]: https://code.tvl.fyi/about/tvix
+
+### How does import-from-derivation fit in?
+
+Import-from-derivation (IFD for short) lets you do magical things: since Nix
+derivations can do arbitrary computation in any language, Nix expressions or
+other data can be generated by external programs that need to do pesky things
+such as parse cursed file formats such as cabal files.
+
+N.B. I've heard that someone wrote a PureScript compiler targeting the Nix
+language, which was then targeted at [parsing a Cabal file to do cabal2nix's job][evil-cabal2nix]
+entirely within Nix. Nothing is sacred.
+
+[evil-cabal2nix]: https://github.com/cdepillabout/cabal2nixWithoutIFD
+
+In order to achieve this, however, the evaluation stage can demand builds be
+run. In fact, such builds need to be run before proceeding with evaluation! So
+IFD serializes builds.
+
+### What constitutes IFD?
+
+The following is a nonexhaustive list of things constituting IFD:
+* `builtins.readFile someDerivation`
+* `import someDerivation`
+* *Any use* of builtin fetchers:
+  * `builtins.fetchGit`
+  * `builtins.fetchTree`
+  * `builtins.fetchTarball`
+  * `builtins.fetchurl`
+  * etc
+
+#### Builtin fetchers
+
+Use of builtin fetchers is a surprisingly common IFD problem. Sometimes it is
+done by mistake, but other times it is done for good reason, with unfortunate
+tradeoffs. I think it's reasonable to use IFD to import libraries such as
+nixpkgs, since fundamentally the thing needs to be fetched for evaluation to
+proceed, but other cases are more dubious.
+
+One reason one might use the builtin fetchers is that there is no way
+(excepting calculated use of impure builders) to use a derivation to download a
+URL without knowing the hash ahead of time.
+
+An example I've seen of this being done on purpose is wanting to avoid
+requiring contributors to have Nix installed to update the hash of some
+tarball, since Nix has its own bespoke algorithm for hashing tarball contents
+that nobody has yet implemented outside Nix. So the maintainers used an impure
+network fetch (only feasible with a builtin) and instituted a curse on the
+build times of Nix users.
+
+The reason that impure fetching needs to be a builtin is because Nix has an
+important purity rule for derivations: either input is fixed and network access
+is disallowed, or output is fixed and network access is allowed. In the Nix
+model as designed ([content-addressed store] aside), derivations are identified
+only by what goes into them, but not the output.
+
+Let's see why that is. Assume that network access is allowed in normal
+builders. If the URL but no hash goes in *and* network access is available,
+anything can come out without changing the store path. Such an impurity would
+completely break the fantastic property that Nix has no such thing as a clean
+build since the builds don't get dirtied to begin with.
+
+Thus, if one is doing an impure network fetch, the resulting store path has to
+depend on the content without knowing the hash ahead of time. Therefore the
+fetch *has* to serialize all evaluation after it, since it affects the store
+paths of anything downstream of it during evaluation.
+
+That said, it is, in my opinion, a significant design flaw in the Nix evaluator
+that it cannot queue all the derivations that are reachable, rather than
+stopping and building each in order.
+
+[content-addressed store]: https://github.com/NixOS/rfcs/blob/master/rfcs/0062-content-addressed-paths.md
+
+## Stories
+
+I work at a Haskell shop which makes extensive use of Nix. We had a bug where
+Nix would go and serially build "`all-cabal-hashes-component-*`". For several
+minutes.
+
+This was what I would call a "very frustrating and expensive developer UX bug".
+I fixed it in a couple of afternoons by refactoring the use of
+import-from-derivation to result in fewer switches between building and
+evaluating, which I will expand on in a bit.
+
+## Background on nixpkgs Haskell
+
+The way that the nixpkgs Haskell infrastructure works is that it has a
+[Stackage]-based package set based on some Stackage Long-Term Support release,
+comprising package versions that are all known to work together. The set is
+generated via a program called [`hackage2nix`][hackage2nix], which runs
+`cabal2nix` against the entirety of Hackage.
+
+`cabal2nix` is a program that generates metadata, input hashes, and hooks up
+dependencies declared in the `.cabal` files to Nix build inputs.
+
+This set can then be overridden by [overlays] which can apply patches, override
+sources, introduce new packages, and do basically any other arbitrary
+modification.
+
+At build time, the builder will be provisioned with a GHC package database with
+everything in the build inputs of the package, and it will build and test the
+package.
+
+In this way, each dependency is turned into a Nix derivation so caching
+of dependencies for development shells, parallelism across package builds, and
+other useful properties simply fall out for free.
+
+[Stackage]: https://www.stackage.org/
+[hackage2nix]: https://github.com/NixOS/cabal2nix/tree/master/cabal2nix/hackage2nix
+[overlays]: https://nixos.org/manual/nixpkgs/stable/#chap-overlays
+
+## Where's the IFD?
+
+nixpkgs Haskell provides a wonderfully useful function called `callCabal2nix`,
+which executes `cabal2nix` to generate the Nix expression for some Haskell
+source code at Nix evaluation time. Uh oh.
+
+It also provides another wonderfully useful function called `callHackage`. This
+is a very sweet function: it will grab a package of the specified version off
+of Hackage, and call `cabal2nix` on it.
+
+Wait, how does that work, since you can't just download stuff for fun without
+knowing its hash? Well, there's your problem.
+
+"Figuring out hashes of stuff on Hackage" was solved by someone publishing a
+comically large GitHub repo called `all-cabal-hashes` with hashes of all of the
+tarballs on Hackage and CI to keep it up to date. Using this repo, you only
+have to deal with keeping one hash up to date: the hash of the version of
+`all-cabal-hashes` you're using, and the rest are just fetched from there.
+
+### Oh no
+
+This repository has an obscene number of files in it, such that it takes dozens
+of seconds to unpack it. So it's simply not unpacked. Fetching a file out of it
+involves invoking tar to selectively extract the relevant file from the giant
+tarball of this repo.
+
+That, in turn, takes around 7 seconds on the fastest MacBook available, for
+each and every package, in serial. Also, Nix checks the binary caches for each
+and every one, further compounding the fail.
+
+I optimized it to take about 7 seconds, *total*. Although I *am* a witch, I
+think that there is some generally applicable intuition derived from this that
+can be used to make IFD go fast.
+
+# Making IFD go fast
+
+Nix is great at building big graphs of dependencies in parallel and caching
+them. So what if we ask Nix to do that?
+
+How can this be achieved?
+
+What if you only demand one big derivation be built with IFD then reuse it
+across all the usage sites?
+
+## Details of `some-cabal2nix`
+
+My observation was that hot-cache builds with a bunch of IFD are fine; it's
+refilling it that's horribly painful since Nix spends a lot of time doing
+pointless things in serial. What if we warmed up the cache by asking it to
+build all that stuff in one shot? Then, the rest of the IFD would hit a hot
+cache.
+
+*The* major innovation in the fix, which I called `some-cabal-hashes`, is that
+it builds *one* derivation with IFD that contains everything that will be
+needed for further evaluation, then all the following imports will hit that
+already-built derivation.
+
+Specifically, my build dependency graph now looks like:
+
+```
+                /- cabal2nix-pkg1 -\
+some-cabal2nix -+- cabal2nix-pkg2 -+- some-cabal-hashes -> all-cabal-hashes
+                \- cabal2nix-pkg3 -/
+```
+
+There are two notable things about this graph:
+
+1. It is (approximately) the natural graph of the dependencies of the build
+   assuming that the Nix evaluator could keep going when it encounters IFD.
+
+2. It allows Nix to naturally parallelize all the `cabal2nix-*` derivations.
+
+Then, all of the usage sites are `import "${some-cabal2nix}/pkg1` or similar.
+In this way, one derivation is built, letting Nix do what it's good at. I did
+something clever also: I made `some-cabal2nix` have no runtime dependencies by
+*copying* all the resulting cabal files into the output directory. Thus, the
+whole thing can be fetched from a cache server and not built at all.
+
+Acquiring the data to know what will be demanded by any IFD is the other piece
+of the puzzle, of course. I extracted the data from the overlays by calling the
+overlays with stubs first (to avoid a cyclic dependency), then evaluating for
+real with a `callHackage` function using the `some-cabal2nix` created using
+that information.
+
+The last and very important optimization I did was to fix the `tar` invocation.
+`tar` files have a linear structure that is perfect for making `t`ape
+`ar`chives (hence the name of the tool) which can be streamed to a tape: one
+file after the other, without any index. Thus, finding a file in the tarball
+takes an amount of time proportional to `O(n)` where `n` is the number of files
+in the archive.
+
+If you call tar `m` times for the number of files you need, then you do
+`O(n*m)` work. However, if you call `tar` *once* with a set of files that you
+want such that it can do one parse through the file and an `O(1)` membership
+check of that set, then the overall time complexity is `O(n)`. I can assume
+that is what `tar` actually does, since it does the entire extraction in
+basically the same time with a long file list as with one file.
+
+Enough making myself sound like I am in an ivory tower with big O notation;
+regardless, extracting with a file list yielded a major performance win.
+
+I also found that if you use the `--wildcards` option, `tar` is extremely slow
+and it seems worse with more files to extract. Use exact file paths instead.
+
+FIXME: get permission to release the sources of that file
+