build systems Content

2024-01-27 19:42:48 -08:00 · 2024-01-27 19:42:48 -08:00 · 0c81e45ce4
commit 0c81e45ce4
parent 7d10d64233
1 changed files with 249 additions and 0 deletions
--- a/content/posts/build-systems-ca-tracing.md
+++ b/content/posts/build-systems-ca-tracing.md
@ -0,0 +1,249 @@
 +++
 date = "2024-01-27"
 draft = false
 path = "/blog/build-systems-ca-tracing"
 tags = ["build-systems", "nix"]
 title = "Build systems: content addressed tracing"
 +++
 An idea I have lying around is something I am going to call "ca-tracing" for
 the purposes of this post. The concept is to instrument builds and observe what
 they actually did, and record that for future iterations such that excess
 dependencies can be ignored if, *even if inputs changed*, the instructions are
 the same and the files actually observed by the build are the same.
 # Implementation
 ## Assumptions
 This idea assumes a hermetic build system, since we need to know if anything
 might have differed from build to build, so we need a complete accounting of
 the inputs to the build. It is not necessarily the case that such a hermetic
 build system would be Nix-like, however, it is easiest to describe on top of a
 Nix-like; first one with build identity, then one that lacks build identity
 like Nix.
 This also assumes a content-addressed build system with early cut-off like Nix
 with [ca-derivations]. In Nix's case, input-addressed builds are executed, then
 renamed to a content-addressed path: if a build with different inputs is
 executed once more with the same output, it is recorded as resolving to that
 output, and further builds are cut off.
 [ca-derivations]: https://www.tweag.io/blog/2021-12-02-nix-cas-4/
 <aside>
 Build identity is a term I invented referring to the idea that a build can know
 about previous builds. Systems without build identity include those which
 identify builds entirely with hashes, and the names are meaningless, such as
 Nix. Build identity is an assumption that causes problems for multitenancy in
 build systems, since there may be several versions of a package being built all
 the time, based off of different versions from each other. I've [used the term
 in a previous post][postmodern-build-sys].
 [postmodern-build-sys]: https://jade.fyi/blog/the-postmodern-build-system/
 There may be a recognized term for this property that I have not found, please
 [email me](https://jade.fyi/about) or poke me on Mastodon if you know it.
 </aside>
 ## Conceptual implementation
 Conceptually, a build is a function:
 > (*inputs*, *instructions*) -> *outputs*
 We wish to narrow *inputs* to *inputs<sub>actual</sub>*, and save this
 information alongside *outputs*. In a following build, we can then verify if
 *instructions'* matches a previous build (*instructions*) and if so, extract
 the values of the same dynamically observed *inputs'<sub>actual</sub>*, but
 relative to *inputs'* and compare them to the values of
 *inputs<sub>actual</sub>* from the previous build.
 Since our build system is hermetic, if this hits cache, it can be assumed to have
 identical results, modulo any nondeterminism (which we assume to be
 unfortunate but unproblematic, and is there regardless of this technique).
 ## Making it concrete
 A build ("derivation" in Nix) in a Nix-like system is a specification of:
 * Inputs (files, other derivations)
 * Environment variables
 * Command to execute
 The point of ca-tracing is to remove excess inputs, so let's contemplate how to
 do that.
 ### File names
 The inputs are files named based on `hash(contents)` in Nix, but we don't
 know which contents we will actually access. This is a problem, since the file
 paths of *inputs* need to remain constant across multiple executions of the
 build (the paths for *inputs* must equal the paths for *inputs'*), since the
 part of *inputs* that changed may be irrelevant to this build.
 In a system that doesn't look like Nix, the input file paths might be the same
 across two builds on account of not containing hashes, so this would not be a
 problem.
 We can solve the file names problem by replacing the hash parts in the input
 filenames with random values per-run. These hashes should never appear, even in
 part, in the output, if the builder is not doing things with them that would
 render the build non-deterministic.
 Unfortunately the file names may appear in the output through the ordering of
 deterministic hash tables, for instance, which could be a problem; this exists
 in practice in ELF hash tables for instance. Realistically we would need
 file-type-specific rewriters to fixup execution output to a deterministic
 result following multiple runs.
 We would also have to rewrite those hashes within blocks of data read from
 within the builder, but that's *possibly* just a few FUSE crimes away to be
 able to do live, on-demand.
 Following the build, the temporary hashes of the inputs can be substituted for
 their concrete values pointing to the larger inputs †.
 <aside>
 † This creates a similar content-addressing equivalence problem as
 [ca-derivations] themselves could introduce if they were differently designed,
 where two paths might mean the same thing. The solution adopted by
 ca-derivations is to hash the output with placeholders in place of its own hash
 and then substitute the hash of the path within all files in it.
 Specifically, consider a derivation Dep that depends on a derivation A.
 Derivation A changes some file not looked at by Dep, producing derivation B,
 and Dep has its rebuild skipped. Should the resulting path for Dep point to A
 or B?
 Perhaps the solution here is to use a content-addressed store or filesystem
 with block cloning (zfs, btrfs, xfs) for which shoving duplicates in it is
 ~free, and actually *realize* the value of *inputs<sub>actual</sub>* to disk.
 This would sadly not eliminate the need for randomizing and rewriting input
 paths due to causality, since we simply do not know what paths are referenced
 yet.
 </aside>
 ### Tracing, filesystem
 To trace a build, one would have to pull the filesystem activity. This is
 possible with some BPF tracing constrained to some cgroup on Linux, so that is
 not the hard part.
 The data that would have to be known is:
 * Observed directory listings with hashes
 * Read file names matching *inputs*, with associated hashes
 * Extremely annoyingly: `fstat(2)` results for all queried files in inputs
  (this is extremely annoying because everything calls `fstat` all the time
  pointlessly or to check for files being present, and it includes things like
  the length of a file, which could *in principle* cause unsoundness if not
  recorded).
 This would then all be compared to the equivalent paths in *inputs'* and if the
 hashes match, the previous build could be immediately used.
 ## Avoiding build identity; how would this work in Nix?
 Nix is built on top of an on-disk key-value store (namely, the directory
 `/nix/store`), which is a mapping:
 > Hash -> Value
 Thus, we just need to construct a hash in such a way that both Build and Build'
 get the same hash value.
 We could achieve this by modifying the derivation in a deterministic manner
 such that two modified-derivations share a hash if they could plausibly have
 ca-tracing applied. Specifically, rewrite the input hashes to something like
 the following:
 > hash("ca-tracing" + name + position-in-inputs) + "-" + name
 When a build is invoked, modify the derivation, hash it, and check for the
 presence of a record of a modified-derivation of the same hash, and then check
 if the actually-used filesystem objects when applied to *inputs'* remain the
 same.
 # Use cases
 This idea is almost certainly best suited for builds using the smallest
 possible unit of work, both in terms of usefulness and likelihood of bugs in
 the rewriting. To use the terminology from [Build Systems à la Carte][bsalc],
 it is likely most useful for systems that are closer to constructive traces
 than deep constructive traces.
 [bsalc]: https://www.microsoft.com/en-us/research/uploads/prod/2018/03/build-systems.pdf
 For example, if this is applied to individual compiler jobs in a C++ project,
 it can eliminate rebuilds from imprecise build system dependency tracking,
 whereas if the derivation/unit of work is larger, the rebuild might be
 necessary anyway.
 # Problems
 * There could exist multiple instances of a modified-derivation with different
  filesystem activity, due to, say, a bunch of rebuilds against very
  differently patched inputs. This system would have to be able to either
  represent that or just discard old ones.
 * Real programs abuse `fstat(2)` way too much and it's very likely that this
  whole thing might not actually get any cache hits in practice if `fstat`
  calls are considered. Without visibility into processes we cannot know if
  `fstat` calls' results are actually used for anything more than checking if a
  file exists.
  This might benefit from some limited dynamic tracing inside processes to
  determine whether the fstat result is actually read.
 * The whole enterprise is predicated on generalized sound rewriting, which is
  likely very hard; see below.
 ## Naive rewriting is a bad idea
 The implementation of ca-derivations itself, where it just rewrites hashes
 appearing in random binaries with the moral equivalent of `sed`, is extremely
 unsound with respect to compression, ordered structures (even NAR files would
 fall victim to this), and any other kind of non-literal storage of store paths,
 and this approach just adds yet more naive rewriting that is likely to explode
 spectacularly at runtime.
 Naively rewriting store paths is an extension of the original idea of Nix doing
 runtime dependencies by naively scanning for reference paths. However,
 crucially, the latter does not *modify* random binaries without any knowledge
 of their contents, and the worst case scenario for that reference scanning is a
 runtime error when someone downloads a binary package.
 Realistically, this would have to be done with a "[diffoscope] of rewriters",
 which can parse any format and rewrite references in it. We can check soundness of a
 build under rewriting by simply running it more times. The rewriter need
 not be a trusted component, since its impact is only as far as breaking your
 binaries (reproducibly so), which Nix is great at already!
 In an actual implementation, I would even go so far as saying the rewriter
 *must not* be part of Nix since it is generally useful, and it is fundamentally
 something that would have to move pretty fast and perhaps even have per-project
 modifications such that it cannot possibly be in a Nix stability guarantee.
 [diffoscope]: https://diffoscope.org/
 # Related work
 This is essentially the idea of edef's incomplete project [Ripple], an
 arbitrary-program memoizer, among other work, but significantly scaled down to
 be less general and possibly more feasible. Compared to her project, this idea
 doesn't look into processes at all, and simply involves tracing filesystem
 accesses to read-only resources in an already-hermetic build system.
 Thanks to edef for significant feedback and discussion about this post. You can
 [sponsor her on GitHub here][edef-gh] if you want to support her work on making
 computers more sound such as the Nix content addressed cache project, tvix, and
 also her giving these ideas to Arch Linux developers.
 [edef-gh]: https://github.com/sponsors/edef1c
 [Ripple]: https://nlnet.nl/project/Ripple/