drafts

2023-02-05 14:02:36 -08:00 · 2023-02-05 14:02:36 -08:00 · d06e3ef9d9
commit d06e3ef9d9
parent 09cee8447c
3 changed files with 458 additions and 0 deletions
--- a/content/posts/extensions-to-the-nix-store.md
+++ b/content/posts/extensions-to-the-nix-store.md
@ -0,0 +1,19 @@
 +++
 date = "2022-10-30"
 draft = true
 path = "/blog/extensions-to-the-nix-store"
 tags = []
 title = "Extensions to the Nix store"
 +++
 This post is a summary/index of the proposed new features in the Nix store,
 since I keep repeatedly struggling to find the documents for them.
 # Content addressability
 ## "intensional store"
 This was introduced in section 6 in [Eelco's PhD thesis][phd-thesis], as an
 opposite approach to the "extensional store"
 [phd-thesis]: https://edolstra.github.io/pubs/phd-thesis.pdf
--- a/content/posts/fixing-up-pdfs.md
+++ b/content/posts/fixing-up-pdfs.md
@ -0,0 +1,194 @@
 +++
 date = "2022-10-30"
 draft = false
 path = "/blog/workflow-pdfs"
 tags = ["pdf"]
 title = "My workflow: Managing and munging PDFs"
 +++
 Dealing with PDFs is something I do every day as someone working in software,
 especially given that I tend toward both research and lower-level work where
 papers and datasheets rule.
 I think that the humble PDF is one of my favourite file formats besides text:
 - You can give someone one and it will work
 - Vectors work great in it
 - Old files also just work
 - Anywhere in the continuum between "digital-native output" to "a scan" can be
  represented and worked with nicely
 - Search is typically pretty great when you have the right document since they
  tend to be *large* so CTRL-F can go very far
 That said, "not being a text file" does sometimes make some tasks difficult,
 metadata is often dubious, and I am usually drowning in a mountain of PDFs at
 all times.
 Most of the stuff described in this post can probably be done with Adobe
 Acrobat, but it is not available for my computer. All of the tools described
 below have packaging in the AUR or main repos on Arch and are not hard to run
 on other operating systems.
 # Fixing PDFs
 There's several tools I regularly use for fixing up PDFs off the internet,
 since it's unfortunately common that they come in with bad metadata, or in
 other problematic forms.
 ## Page numbering
 PDF supports switching page numbering midway through the document, for
 instance, if the front-matter is numbered in Roman numerals and the main
 content is in Arabic numerals. Too often, large PDFs that run across my desk
 don't have this set up properly, so the page numbers are annoyingly offset.
 You can fix this with the "page numbering" feature of [jPDF Tweak][jpdf-tweak].
 [jpdf-tweak-manual](https://jpdftweak.sourceforge.net/manual/index.html)
 ## Document outline
 PDF has a great feature called "document outline" or "bookmarks", which lets
 you include the table of contents in searchable form that will show up in the
 sidebar of good PDF viewers.
 Unfortunately, many PDFs don't have these set up, which makes big documents a
 hassle to work with as you have to jump back and forth between the table of
 contents page and the rest of the document to find things. Fortunately, these
 can be fixed.
 There are three main tools that are useful for bookmarks hacking:
 - [jPDF Tweak][jpdf-tweak], a multi-tool for doing various metadata hacking.
 - [JPdfBookmarks], a powerful bookmarks-specific editor.
 - [HandyOutliner], a small tool mostly useful to turn textual
  tables of contents into bookmarks.
 [jpdf-tweak]: https://jpdftweak.sourceforge.net/
 [HandyOutliner]: https://handyoutlinerfo.sourceforge.net/
 [JPdfBookmarks]: https://sourceforge.net/projects/jpdfbookmarks/
 ### Hyperlinked table of contents
 This is the most convenient case: the author put in a hyperlinked table of
 contents, but somehow the tooling didn't create a document outline. If this
 happens, you can get a perfect outline with almost no work.
 Use the "Extract links from current page and add them as bookmarks" button in
 [JPdfBookmarks] to deal with this. It will do as it says: just grab all the
 hyperlinks and turn them directly into a document outline.
 This is great since generally the hyperlinks will have correct page positions
 and so the outline will go to the right spot on the page in addition to going
 to the right page.
 ### Textual table of contents
 If you can cleanly get or create a table of contents such as the following:
 ```text
 I. Introduction 1
 1. Introduction 3
 1.1. Software deployment 3
 1.2. The state of the art 6
 1.3. Motivation 13
 1.4. The Nix deployment system 14
 1.5. Contributions 14
 1.6. Outline of this thesis 16
 1.7. Notational conventions 17
 ```
 Then the best bet is probably to use [HandyOutliner] to ingest that table of
 contents as text and create bookmarks.
 Often copy support in PDF tables of contents is pretty awful (and I can only
 imagine it does horrors to screen readers), so it may need some serious amount
 of cleanup in a text editor, as was the case for me while making an outline for
 Eelco Dolstra's PhD thesis on Nix.
 Another way this can be done is with the "Bookmarks" tab in [jPDF
 Tweak][jpdf-tweak] and importing a CSV you make.
 Such a CSV looks like so:
 ```
 1;O;Acknowledgements;3
 1;O;Contents;5
 1;O;I. Introduction;9
 2;O;1. Introduction;11
 3;O;1.1. Software deployment;11
 3;O;1.2. The state of the art;14
 ```
 The columns are:
 1. Depth
 2. Open ("O" if the level in the tree should start opened, else "")
 3. Data
 4. Page number. You can also put coordinates at the end if truly motivated.
 ## Encrypted PDFs
 These are annoying. You can strip the encryption with `qpdf`:
 ```text
 qpdf --decrypt input.pdf output.pdf
 ```
 ## Pages are in the wrong order/PDFs need merging
 Imagine that you have been fighting a scanner to scan some document and the
 software for it is bad and doesn't show previews large enough to make out the
 page numbers. Exasperated, you just save the PDF knowing the pages are in the
 wrong order and spread over multiple files.
 For this, use [pdfarranger], which makes it easy to reorder pages as desired.
 [pdfarranger]: https://github.com/pdfarranger/pdfarranger
 # Having too many PDFs in my life
 ## Directory full of PDFs to search
 Relatable problem! Use [pdfgrep]:
 ```text
 pdfgrep -nri 'somequery' .
 ```
 [pdfgrep]: https://pdfgrep.org/
 ## Too many bloody PDFs; overflowing disorganized directories
 Academics have this problem and equally have solutions: Use [Zotero] or similar
 research document management software to categorize and tag documents.
 [Zotero]: https://www.zotero.org/
 ## Getting more of them
 As I have student credentials, I can use the University library to get
 documents. However, getting authenticated to publisher sites is annoying: I
 often don't use the University library's search system since it can have poor
 results, but the login pages for individual publisher sites are confusing as
 well.
 UBC uses OpenAthens for their access control on publisher sites. They have a
 rather nice uniform redirector service that can log in and redirect back to
 sites:
 <https://docs.openathens.net/libraries/redirector-link-generator>
 I made a little bookmarklet to authenticate to publisher sites:
 ```javascript
 javascript:void(location.href='https://go.openathens.net/redirector/ubc.ca?url='+location.href)
 ```
 It's also possible to use a well-known Web site to "acquire" papers, which is
 often more convenient than the silly barriers that publishers use to extract
 profits from keeping publicly-funded knowledge unfree (paper authors are paid
 *nil* by journals), even with legitimate access. If one were to use such a
 hypothetical Web site, it is easiest to use by putting the DOI of papers into
 it.
 Also, paper authors probably have copies of their papers, and are typically
 happy to send them to you for free if you email them.
--- a/content/posts/speedy-ifd.md
+++ b/content/posts/speedy-ifd.md
@ -0,0 +1,245 @@
 +++
 date = "2022-10-18"
 draft = true
 path = "/blog/speedy-ifd"
 tags = ["haskell", "nix"]
 title = "Speedy import-from-derivation in Nix?"
 +++
 Nix has a feature called "import from derivation", which is sometimes called
 "such a nice foot gun" (grahamc, 2022). I can't argue with its
 usefulness; it lets Nix do amazing things that can't be accomplished any other
 way, and avoid pointlessly checking build products into git. However, it has a
 dirty secret: it can *atrociously* slow down your builds.
 The essence of this feature is that Nix can do operations such as build
 derivations whose result is used in the *evaluation stage*.
 ## Nix build staging?
 Nix, in its current implementation (there are efforts [such as tvix][tvix] to
 change this), can do one of two things at a given time.
 * Evaluate: run Nix expressions to create some derivations to build. This stage
  outputs `.drv` files, which can then be instantiated. Nix evaluation happens
  in serial (single threaded), and in a lazy fashion.
 * Build: given some `.drv` files, fetch the result from a binary cache or build
  from scratch.
 [tvix]: https://code.tvl.fyi/about/tvix
 ### How does import-from-derivation fit in?
 Import-from-derivation (IFD for short) lets you do magical things: since Nix
 derivations can do arbitrary computation in any language, Nix expressions or
 other data can be generated by external programs that need to do pesky things
 such as parse cursed file formats such as cabal files.
 N.B. I've heard that someone wrote a PureScript compiler targeting the Nix
 language, which was then targeted at [parsing a Cabal file to do cabal2nix's job][evil-cabal2nix]
 entirely within Nix. Nothing is sacred.
 [evil-cabal2nix]: https://github.com/cdepillabout/cabal2nixWithoutIFD
 In order to achieve this, however, the evaluation stage can demand builds be
 run. In fact, such builds need to be run before proceeding with evaluation! So
 IFD serializes builds.
 ### What constitutes IFD?
 The following is a nonexhaustive list of things constituting IFD:
 * `builtins.readFile someDerivation`
 * `import someDerivation`
 * *Any use* of builtin fetchers:
  * `builtins.fetchGit`
  * `builtins.fetchTree`
  * `builtins.fetchTarball`
  * `builtins.fetchurl`
  * etc
 #### Builtin fetchers
 Use of builtin fetchers is a surprisingly common IFD problem. Sometimes it is
 done by mistake, but other times it is done for good reason, with unfortunate
 tradeoffs. I think it's reasonable to use IFD to import libraries such as
 nixpkgs, since fundamentally the thing needs to be fetched for evaluation to
 proceed, but other cases are more dubious.
 One reason one might use the builtin fetchers is that there is no way
 (excepting calculated use of impure builders) to use a derivation to download a
 URL without knowing the hash ahead of time.
 An example I've seen of this being done on purpose is wanting to avoid
 requiring contributors to have Nix installed to update the hash of some
 tarball, since Nix has its own bespoke algorithm for hashing tarball contents
 that nobody has yet implemented outside Nix. So the maintainers used an impure
 network fetch (only feasible with a builtin) and instituted a curse on the
 build times of Nix users.
 The reason that impure fetching needs to be a builtin is because Nix has an
 important purity rule for derivations: either input is fixed and network access
 is disallowed, or output is fixed and network access is allowed. In the Nix
 model as designed ([content-addressed store] aside), derivations are identified
 only by what goes into them, but not the output.
 Let's see why that is. Assume that network access is allowed in normal
 builders. If the URL but no hash goes in *and* network access is available,
 anything can come out without changing the store path. Such an impurity would
 completely break the fantastic property that Nix has no such thing as a clean
 build since the builds don't get dirtied to begin with.
 Thus, if one is doing an impure network fetch, the resulting store path has to
 depend on the content without knowing the hash ahead of time. Therefore the
 fetch *has* to serialize all evaluation after it, since it affects the store
 paths of anything downstream of it during evaluation.
 That said, it is, in my opinion, a significant design flaw in the Nix evaluator
 that it cannot queue all the derivations that are reachable, rather than
 stopping and building each in order.
 [content-addressed store]: https://github.com/NixOS/rfcs/blob/master/rfcs/0062-content-addressed-paths.md
 ## Stories
 I work at a Haskell shop which makes extensive use of Nix. We had a bug where
 Nix would go and serially build "`all-cabal-hashes-component-*`". For several
 minutes.
 This was what I would call a "very frustrating and expensive developer UX bug".
 I fixed it in a couple of afternoons by refactoring the use of
 import-from-derivation to result in fewer switches between building and
 evaluating, which I will expand on in a bit.
 ## Background on nixpkgs Haskell
 The way that the nixpkgs Haskell infrastructure works is that it has a
 [Stackage]-based package set based on some Stackage Long-Term Support release,
 comprising package versions that are all known to work together. The set is
 generated via a program called [`hackage2nix`][hackage2nix], which runs
 `cabal2nix` against the entirety of Hackage.
 `cabal2nix` is a program that generates metadata, input hashes, and hooks up
 dependencies declared in the `.cabal` files to Nix build inputs.
 This set can then be overridden by [overlays] which can apply patches, override
 sources, introduce new packages, and do basically any other arbitrary
 modification.
 At build time, the builder will be provisioned with a GHC package database with
 everything in the build inputs of the package, and it will build and test the
 package.
 In this way, each dependency is turned into a Nix derivation so caching
 of dependencies for development shells, parallelism across package builds, and
 other useful properties simply fall out for free.
 [Stackage]: https://www.stackage.org/
 [hackage2nix]: https://github.com/NixOS/cabal2nix/tree/master/cabal2nix/hackage2nix
 [overlays]: https://nixos.org/manual/nixpkgs/stable/#chap-overlays
 ## Where's the IFD?
 nixpkgs Haskell provides a wonderfully useful function called `callCabal2nix`,
 which executes `cabal2nix` to generate the Nix expression for some Haskell
 source code at Nix evaluation time. Uh oh.
 It also provides another wonderfully useful function called `callHackage`. This
 is a very sweet function: it will grab a package of the specified version off
 of Hackage, and call `cabal2nix` on it.
 Wait, how does that work, since you can't just download stuff for fun without
 knowing its hash? Well, there's your problem.
 "Figuring out hashes of stuff on Hackage" was solved by someone publishing a
 comically large GitHub repo called `all-cabal-hashes` with hashes of all of the
 tarballs on Hackage and CI to keep it up to date. Using this repo, you only
 have to deal with keeping one hash up to date: the hash of the version of
 `all-cabal-hashes` you're using, and the rest are just fetched from there.
 ### Oh no
 This repository has an obscene number of files in it, such that it takes dozens
 of seconds to unpack it. So it's simply not unpacked. Fetching a file out of it
 involves invoking tar to selectively extract the relevant file from the giant
 tarball of this repo.
 That, in turn, takes around 7 seconds on the fastest MacBook available, for
 each and every package, in serial. Also, Nix checks the binary caches for each
 and every one, further compounding the fail.
 I optimized it to take about 7 seconds, *total*. Although I *am* a witch, I
 think that there is some generally applicable intuition derived from this that
 can be used to make IFD go fast.
 # Making IFD go fast
 Nix is great at building big graphs of dependencies in parallel and caching
 them. So what if we ask Nix to do that?
 How can this be achieved?
 What if you only demand one big derivation be built with IFD then reuse it
 across all the usage sites?
 ## Details of `some-cabal2nix`
 My observation was that hot-cache builds with a bunch of IFD are fine; it's
 refilling it that's horribly painful since Nix spends a lot of time doing
 pointless things in serial. What if we warmed up the cache by asking it to
 build all that stuff in one shot? Then, the rest of the IFD would hit a hot
 cache.
 *The* major innovation in the fix, which I called `some-cabal-hashes`, is that
 it builds *one* derivation with IFD that contains everything that will be
 needed for further evaluation, then all the following imports will hit that
 already-built derivation.
 Specifically, my build dependency graph now looks like:
 ```
                /- cabal2nix-pkg1 -\
 some-cabal2nix -+- cabal2nix-pkg2 -+- some-cabal-hashes -> all-cabal-hashes
                \- cabal2nix-pkg3 -/
 ```
 There are two notable things about this graph:
 1. It is (approximately) the natural graph of the dependencies of the build
   assuming that the Nix evaluator could keep going when it encounters IFD.
 2. It allows Nix to naturally parallelize all the `cabal2nix-*` derivations.
 Then, all of the usage sites are `import "${some-cabal2nix}/pkg1` or similar.
 In this way, one derivation is built, letting Nix do what it's good at. I did
 something clever also: I made `some-cabal2nix` have no runtime dependencies by
 *copying* all the resulting cabal files into the output directory. Thus, the
 whole thing can be fetched from a cache server and not built at all.
 Acquiring the data to know what will be demanded by any IFD is the other piece
 of the puzzle, of course. I extracted the data from the overlays by calling the
 overlays with stubs first (to avoid a cyclic dependency), then evaluating for
 real with a `callHackage` function using the `some-cabal2nix` created using
 that information.
 The last and very important optimization I did was to fix the `tar` invocation.
 `tar` files have a linear structure that is perfect for making `t`ape
 `ar`chives (hence the name of the tool) which can be streamed to a tape: one
 file after the other, without any index. Thus, finding a file in the tarball
 takes an amount of time proportional to `O(n)` where `n` is the number of files
 in the archive.
 If you call tar `m` times for the number of files you need, then you do
 `O(n*m)` work. However, if you call `tar` *once* with a set of files that you
 want such that it can do one parse through the file and an `O(1)` membership
 check of that set, then the overall time complexity is `O(n)`. I can assume
 that is what `tar` actually does, since it does the entire extraction in
 basically the same time with a long file list as with one file.
 Enough making myself sound like I am in an ivory tower with big O notation;
 regardless, extracting with a file list yielded a major performance win.
 I also found that if you use the `--wildcards` option, `tar` is extremely slow
 and it seems worse with more files to extract. Use exact file paths instead.
 FIXME: get permission to release the sources of that file