diff --git a/content/posts/extensions-to-the-nix-store.md b/content/posts/extensions-to-the-nix-store.md new file mode 100644 index 0000000..7f2dc8f --- /dev/null +++ b/content/posts/extensions-to-the-nix-store.md @@ -0,0 +1,19 @@ ++++ +date = "2022-10-30" +draft = true +path = "/blog/extensions-to-the-nix-store" +tags = [] +title = "Extensions to the Nix store" ++++ + +This post is a summary/index of the proposed new features in the Nix store, +since I keep repeatedly struggling to find the documents for them. + +# Content addressability + +## "intensional store" + +This was introduced in section 6 in [Eelco's PhD thesis][phd-thesis], as an +opposite approach to the "extensional store" + +[phd-thesis]: https://edolstra.github.io/pubs/phd-thesis.pdf diff --git a/content/posts/fixing-up-pdfs.md b/content/posts/fixing-up-pdfs.md new file mode 100644 index 0000000..0891dee --- /dev/null +++ b/content/posts/fixing-up-pdfs.md @@ -0,0 +1,194 @@ ++++ +date = "2022-10-30" +draft = false +path = "/blog/workflow-pdfs" +tags = ["pdf"] +title = "My workflow: Managing and munging PDFs" ++++ + +Dealing with PDFs is something I do every day as someone working in software, +especially given that I tend toward both research and lower-level work where +papers and datasheets rule. + +I think that the humble PDF is one of my favourite file formats besides text: +- You can give someone one and it will work +- Vectors work great in it +- Old files also just work +- Anywhere in the continuum between "digital-native output" to "a scan" can be + represented and worked with nicely +- Search is typically pretty great when you have the right document since they + tend to be *large* so CTRL-F can go very far + +That said, "not being a text file" does sometimes make some tasks difficult, +metadata is often dubious, and I am usually drowning in a mountain of PDFs at +all times. + +Most of the stuff described in this post can probably be done with Adobe +Acrobat, but it is not available for my computer. All of the tools described +below have packaging in the AUR or main repos on Arch and are not hard to run +on other operating systems. + +# Fixing PDFs + +There's several tools I regularly use for fixing up PDFs off the internet, +since it's unfortunately common that they come in with bad metadata, or in +other problematic forms. + +## Page numbering + +PDF supports switching page numbering midway through the document, for +instance, if the front-matter is numbered in Roman numerals and the main +content is in Arabic numerals. Too often, large PDFs that run across my desk +don't have this set up properly, so the page numbers are annoyingly offset. + +You can fix this with the "page numbering" feature of [jPDF Tweak][jpdf-tweak]. + +[jpdf-tweak-manual](https://jpdftweak.sourceforge.net/manual/index.html) + +## Document outline + +PDF has a great feature called "document outline" or "bookmarks", which lets +you include the table of contents in searchable form that will show up in the +sidebar of good PDF viewers. + +Unfortunately, many PDFs don't have these set up, which makes big documents a +hassle to work with as you have to jump back and forth between the table of +contents page and the rest of the document to find things. Fortunately, these +can be fixed. + +There are three main tools that are useful for bookmarks hacking: +- [jPDF Tweak][jpdf-tweak], a multi-tool for doing various metadata hacking. +- [JPdfBookmarks], a powerful bookmarks-specific editor. +- [HandyOutliner], a small tool mostly useful to turn textual + tables of contents into bookmarks. + +[jpdf-tweak]: https://jpdftweak.sourceforge.net/ +[HandyOutliner]: https://handyoutlinerfo.sourceforge.net/ +[JPdfBookmarks]: https://sourceforge.net/projects/jpdfbookmarks/ + +### Hyperlinked table of contents + +This is the most convenient case: the author put in a hyperlinked table of +contents, but somehow the tooling didn't create a document outline. If this +happens, you can get a perfect outline with almost no work. + +Use the "Extract links from current page and add them as bookmarks" button in +[JPdfBookmarks] to deal with this. It will do as it says: just grab all the +hyperlinks and turn them directly into a document outline. + +This is great since generally the hyperlinks will have correct page positions +and so the outline will go to the right spot on the page in addition to going +to the right page. + +### Textual table of contents + +If you can cleanly get or create a table of contents such as the following: + +```text +I. Introduction 1 +1. Introduction 3 +1.1. Software deployment 3 +1.2. The state of the art 6 +1.3. Motivation 13 +1.4. The Nix deployment system 14 +1.5. Contributions 14 +1.6. Outline of this thesis 16 +1.7. Notational conventions 17 +``` + +Then the best bet is probably to use [HandyOutliner] to ingest that table of +contents as text and create bookmarks. + +Often copy support in PDF tables of contents is pretty awful (and I can only +imagine it does horrors to screen readers), so it may need some serious amount +of cleanup in a text editor, as was the case for me while making an outline for +Eelco Dolstra's PhD thesis on Nix. + +Another way this can be done is with the "Bookmarks" tab in [jPDF +Tweak][jpdf-tweak] and importing a CSV you make. + +Such a CSV looks like so: + +``` +1;O;Acknowledgements;3 +1;O;Contents;5 +1;O;I. Introduction;9 +2;O;1. Introduction;11 +3;O;1.1. Software deployment;11 +3;O;1.2. The state of the art;14 +``` + +The columns are: + +1. Depth +2. Open ("O" if the level in the tree should start opened, else "") +3. Data +4. Page number. You can also put coordinates at the end if truly motivated. + +## Encrypted PDFs + +These are annoying. You can strip the encryption with `qpdf`: + +```text +qpdf --decrypt input.pdf output.pdf +``` + +## Pages are in the wrong order/PDFs need merging + +Imagine that you have been fighting a scanner to scan some document and the +software for it is bad and doesn't show previews large enough to make out the +page numbers. Exasperated, you just save the PDF knowing the pages are in the +wrong order and spread over multiple files. + +For this, use [pdfarranger], which makes it easy to reorder pages as desired. + +[pdfarranger]: https://github.com/pdfarranger/pdfarranger + +# Having too many PDFs in my life + +## Directory full of PDFs to search + +Relatable problem! Use [pdfgrep]: + +```text +pdfgrep -nri 'somequery' . +``` + +[pdfgrep]: https://pdfgrep.org/ + +## Too many bloody PDFs; overflowing disorganized directories + +Academics have this problem and equally have solutions: Use [Zotero] or similar +research document management software to categorize and tag documents. + +[Zotero]: https://www.zotero.org/ + +## Getting more of them + +As I have student credentials, I can use the University library to get +documents. However, getting authenticated to publisher sites is annoying: I +often don't use the University library's search system since it can have poor +results, but the login pages for individual publisher sites are confusing as +well. + +UBC uses OpenAthens for their access control on publisher sites. They have a +rather nice uniform redirector service that can log in and redirect back to +sites: + + +I made a little bookmarklet to authenticate to publisher sites: + +```javascript +javascript:void(location.href='https://go.openathens.net/redirector/ubc.ca?url='+location.href) +``` + +It's also possible to use a well-known Web site to "acquire" papers, which is +often more convenient than the silly barriers that publishers use to extract +profits from keeping publicly-funded knowledge unfree (paper authors are paid +*nil* by journals), even with legitimate access. If one were to use such a +hypothetical Web site, it is easiest to use by putting the DOI of papers into +it. + +Also, paper authors probably have copies of their papers, and are typically +happy to send them to you for free if you email them. + diff --git a/content/posts/speedy-ifd.md b/content/posts/speedy-ifd.md new file mode 100644 index 0000000..65d4bde --- /dev/null +++ b/content/posts/speedy-ifd.md @@ -0,0 +1,245 @@ ++++ +date = "2022-10-18" +draft = true +path = "/blog/speedy-ifd" +tags = ["haskell", "nix"] +title = "Speedy import-from-derivation in Nix?" ++++ + +Nix has a feature called "import from derivation", which is sometimes called +"such a nice foot gun" (grahamc, 2022). I can't argue with its +usefulness; it lets Nix do amazing things that can't be accomplished any other +way, and avoid pointlessly checking build products into git. However, it has a +dirty secret: it can *atrociously* slow down your builds. + +The essence of this feature is that Nix can do operations such as build +derivations whose result is used in the *evaluation stage*. + +## Nix build staging? + +Nix, in its current implementation (there are efforts [such as tvix][tvix] to +change this), can do one of two things at a given time. + +* Evaluate: run Nix expressions to create some derivations to build. This stage + outputs `.drv` files, which can then be instantiated. Nix evaluation happens + in serial (single threaded), and in a lazy fashion. +* Build: given some `.drv` files, fetch the result from a binary cache or build + from scratch. + +[tvix]: https://code.tvl.fyi/about/tvix + +### How does import-from-derivation fit in? + +Import-from-derivation (IFD for short) lets you do magical things: since Nix +derivations can do arbitrary computation in any language, Nix expressions or +other data can be generated by external programs that need to do pesky things +such as parse cursed file formats such as cabal files. + +N.B. I've heard that someone wrote a PureScript compiler targeting the Nix +language, which was then targeted at [parsing a Cabal file to do cabal2nix's job][evil-cabal2nix] +entirely within Nix. Nothing is sacred. + +[evil-cabal2nix]: https://github.com/cdepillabout/cabal2nixWithoutIFD + +In order to achieve this, however, the evaluation stage can demand builds be +run. In fact, such builds need to be run before proceeding with evaluation! So +IFD serializes builds. + +### What constitutes IFD? + +The following is a nonexhaustive list of things constituting IFD: +* `builtins.readFile someDerivation` +* `import someDerivation` +* *Any use* of builtin fetchers: + * `builtins.fetchGit` + * `builtins.fetchTree` + * `builtins.fetchTarball` + * `builtins.fetchurl` + * etc + +#### Builtin fetchers + +Use of builtin fetchers is a surprisingly common IFD problem. Sometimes it is +done by mistake, but other times it is done for good reason, with unfortunate +tradeoffs. I think it's reasonable to use IFD to import libraries such as +nixpkgs, since fundamentally the thing needs to be fetched for evaluation to +proceed, but other cases are more dubious. + +One reason one might use the builtin fetchers is that there is no way +(excepting calculated use of impure builders) to use a derivation to download a +URL without knowing the hash ahead of time. + +An example I've seen of this being done on purpose is wanting to avoid +requiring contributors to have Nix installed to update the hash of some +tarball, since Nix has its own bespoke algorithm for hashing tarball contents +that nobody has yet implemented outside Nix. So the maintainers used an impure +network fetch (only feasible with a builtin) and instituted a curse on the +build times of Nix users. + +The reason that impure fetching needs to be a builtin is because Nix has an +important purity rule for derivations: either input is fixed and network access +is disallowed, or output is fixed and network access is allowed. In the Nix +model as designed ([content-addressed store] aside), derivations are identified +only by what goes into them, but not the output. + +Let's see why that is. Assume that network access is allowed in normal +builders. If the URL but no hash goes in *and* network access is available, +anything can come out without changing the store path. Such an impurity would +completely break the fantastic property that Nix has no such thing as a clean +build since the builds don't get dirtied to begin with. + +Thus, if one is doing an impure network fetch, the resulting store path has to +depend on the content without knowing the hash ahead of time. Therefore the +fetch *has* to serialize all evaluation after it, since it affects the store +paths of anything downstream of it during evaluation. + +That said, it is, in my opinion, a significant design flaw in the Nix evaluator +that it cannot queue all the derivations that are reachable, rather than +stopping and building each in order. + +[content-addressed store]: https://github.com/NixOS/rfcs/blob/master/rfcs/0062-content-addressed-paths.md + +## Stories + +I work at a Haskell shop which makes extensive use of Nix. We had a bug where +Nix would go and serially build "`all-cabal-hashes-component-*`". For several +minutes. + +This was what I would call a "very frustrating and expensive developer UX bug". +I fixed it in a couple of afternoons by refactoring the use of +import-from-derivation to result in fewer switches between building and +evaluating, which I will expand on in a bit. + +## Background on nixpkgs Haskell + +The way that the nixpkgs Haskell infrastructure works is that it has a +[Stackage]-based package set based on some Stackage Long-Term Support release, +comprising package versions that are all known to work together. The set is +generated via a program called [`hackage2nix`][hackage2nix], which runs +`cabal2nix` against the entirety of Hackage. + +`cabal2nix` is a program that generates metadata, input hashes, and hooks up +dependencies declared in the `.cabal` files to Nix build inputs. + +This set can then be overridden by [overlays] which can apply patches, override +sources, introduce new packages, and do basically any other arbitrary +modification. + +At build time, the builder will be provisioned with a GHC package database with +everything in the build inputs of the package, and it will build and test the +package. + +In this way, each dependency is turned into a Nix derivation so caching +of dependencies for development shells, parallelism across package builds, and +other useful properties simply fall out for free. + +[Stackage]: https://www.stackage.org/ +[hackage2nix]: https://github.com/NixOS/cabal2nix/tree/master/cabal2nix/hackage2nix +[overlays]: https://nixos.org/manual/nixpkgs/stable/#chap-overlays + +## Where's the IFD? + +nixpkgs Haskell provides a wonderfully useful function called `callCabal2nix`, +which executes `cabal2nix` to generate the Nix expression for some Haskell +source code at Nix evaluation time. Uh oh. + +It also provides another wonderfully useful function called `callHackage`. This +is a very sweet function: it will grab a package of the specified version off +of Hackage, and call `cabal2nix` on it. + +Wait, how does that work, since you can't just download stuff for fun without +knowing its hash? Well, there's your problem. + +"Figuring out hashes of stuff on Hackage" was solved by someone publishing a +comically large GitHub repo called `all-cabal-hashes` with hashes of all of the +tarballs on Hackage and CI to keep it up to date. Using this repo, you only +have to deal with keeping one hash up to date: the hash of the version of +`all-cabal-hashes` you're using, and the rest are just fetched from there. + +### Oh no + +This repository has an obscene number of files in it, such that it takes dozens +of seconds to unpack it. So it's simply not unpacked. Fetching a file out of it +involves invoking tar to selectively extract the relevant file from the giant +tarball of this repo. + +That, in turn, takes around 7 seconds on the fastest MacBook available, for +each and every package, in serial. Also, Nix checks the binary caches for each +and every one, further compounding the fail. + +I optimized it to take about 7 seconds, *total*. Although I *am* a witch, I +think that there is some generally applicable intuition derived from this that +can be used to make IFD go fast. + +# Making IFD go fast + +Nix is great at building big graphs of dependencies in parallel and caching +them. So what if we ask Nix to do that? + +How can this be achieved? + +What if you only demand one big derivation be built with IFD then reuse it +across all the usage sites? + +## Details of `some-cabal2nix` + +My observation was that hot-cache builds with a bunch of IFD are fine; it's +refilling it that's horribly painful since Nix spends a lot of time doing +pointless things in serial. What if we warmed up the cache by asking it to +build all that stuff in one shot? Then, the rest of the IFD would hit a hot +cache. + +*The* major innovation in the fix, which I called `some-cabal-hashes`, is that +it builds *one* derivation with IFD that contains everything that will be +needed for further evaluation, then all the following imports will hit that +already-built derivation. + +Specifically, my build dependency graph now looks like: + +``` + /- cabal2nix-pkg1 -\ +some-cabal2nix -+- cabal2nix-pkg2 -+- some-cabal-hashes -> all-cabal-hashes + \- cabal2nix-pkg3 -/ +``` + +There are two notable things about this graph: + +1. It is (approximately) the natural graph of the dependencies of the build + assuming that the Nix evaluator could keep going when it encounters IFD. + +2. It allows Nix to naturally parallelize all the `cabal2nix-*` derivations. + +Then, all of the usage sites are `import "${some-cabal2nix}/pkg1` or similar. +In this way, one derivation is built, letting Nix do what it's good at. I did +something clever also: I made `some-cabal2nix` have no runtime dependencies by +*copying* all the resulting cabal files into the output directory. Thus, the +whole thing can be fetched from a cache server and not built at all. + +Acquiring the data to know what will be demanded by any IFD is the other piece +of the puzzle, of course. I extracted the data from the overlays by calling the +overlays with stubs first (to avoid a cyclic dependency), then evaluating for +real with a `callHackage` function using the `some-cabal2nix` created using +that information. + +The last and very important optimization I did was to fix the `tar` invocation. +`tar` files have a linear structure that is perfect for making `t`ape +`ar`chives (hence the name of the tool) which can be streamed to a tape: one +file after the other, without any index. Thus, finding a file in the tarball +takes an amount of time proportional to `O(n)` where `n` is the number of files +in the archive. + +If you call tar `m` times for the number of files you need, then you do +`O(n*m)` work. However, if you call `tar` *once* with a set of files that you +want such that it can do one parse through the file and an `O(1)` membership +check of that set, then the overall time complexity is `O(n)`. I can assume +that is what `tar` actually does, since it does the entire extraction in +basically the same time with a long file list as with one file. + +Enough making myself sound like I am in an ivory tower with big O notation; +regardless, extracting with a file list yielded a major performance win. + +I also found that if you use the `--wildcards` option, `tar` is extremely slow +and it seems worse with more files to extract. Use exact file paths instead. + +FIXME: get permission to release the sources of that file +