the nix pitch post
This commit is contained in:
parent
ed2539fa3f
commit
398f83f925
1 changed files with 435 additions and 0 deletions
435
content/posts/the-nix-pitch/index.md
Normal file
435
content/posts/the-nix-pitch/index.md
Normal file
|
|
@ -0,0 +1,435 @@
|
|||
+++
|
||||
date = "2023-02-05"
|
||||
draft = true
|
||||
path = "/blog/the-nix-pitch"
|
||||
tags = ["nix"]
|
||||
title = "The Nix pitch"
|
||||
+++
|
||||
|
||||
I have probably caught a reputation of being a completely unrepentant Nix
|
||||
shill. This blog post is pure Nix shilling, but intends to explain why it is so
|
||||
transformational that people shill it so much.
|
||||
|
||||
This post is partially inspired by a line of thinking raised by [Houyhnhnm
|
||||
Computing], ("hyou-nam") a blog series about an alternate universe of computing
|
||||
framed as what would happen if the sentient horses from Gulliver's Travels saw
|
||||
human computers. If you want more ideas of absurdities of how our computers
|
||||
work and could be so much better, I highly recommend reading this site.
|
||||
|
||||
[Houyhnhnm Computing]: https://ngnghm.github.io/
|
||||
|
||||
There are a few properties that make the Nix ecosystem extremely interesting:
|
||||
|
||||
* Reproducibility (I promise you care about this!)
|
||||
* Cross-language build system integration
|
||||
* Incremental builds are trustworthy due to sandboxing
|
||||
* Drift-free configuration management
|
||||
|
||||
[A friend said][mgattozzi-nix] that "unfortunately all my problems are now nix
|
||||
problems if I don't understand it". This is essentially true: Nix is a machine
|
||||
for converting otherwise-unsolved packaging and deployment problems into Nix
|
||||
problems.
|
||||
|
||||
[mgattozzi-nix]: https://twitter.com/mgattozzi/status/1617604038517817344
|
||||
|
||||
The Nix ecosystem consists of the following components (which do not
|
||||
necessarily need to be used at the same time):
|
||||
|
||||
* Nix, a sandboxed build system with binary caching.
|
||||
|
||||
Nix knows how to build "derivations", build descriptions that amount to shell
|
||||
scripts. If the result of a derivation exists on a binary cache, it will be
|
||||
fetched instead of built locally.
|
||||
* Nix language, a functional domain specific language for writing things to be
|
||||
built by Nix.
|
||||
|
||||
If we consider the Nix language to be "Haskell", and derivations are "Bash",
|
||||
Nix is a compiler from Haskell to Bash.
|
||||
* nixpkgs, the package collection used by NixOS (also usable without NixOS, and
|
||||
on macOS). The most up-to-date and largest distro repository on the planet.
|
||||
|
||||
Allows composition of multiple language ecosystems with extensive programming
|
||||
language support.
|
||||
|
||||
Has some of the best Haskell packaging available anywhere, and the only
|
||||
distro packaging of Haskell worth using for development.
|
||||
* NixOS, a configuration management system masquerading as a Linux
|
||||
distribution. It uses a domain specific language embedded in the Nix language
|
||||
to define system images that are then built with Nix.
|
||||
|
||||
# Case studies
|
||||
|
||||
Let's go through some case studies of frustrating problems you may have had
|
||||
with computers that don't happen in the Nix universe.
|
||||
|
||||
## Case study: docker image builds
|
||||
|
||||
Traditionally, Docker images are built by running some shell scripts inside a
|
||||
containerized system, with full network access. These are *impossible* to
|
||||
reproduce: the very instant you run `apt update && apt upgrade`, your image is
|
||||
no longer reproducible. Let's tell a story.
|
||||
|
||||
You're working on your software one day, and unbeknownst to you, `libfoo` has a
|
||||
minor upgrade in your distribution of choice. You rebuild your images and
|
||||
production starts experiencing random segmentation faults. So you revert it and
|
||||
go investigate why the new image has been broken. This has happened a few times
|
||||
before, and you never know why it happened. It seems to happen at random when
|
||||
you upgrade the base image, so you stick with the same base image from a year
|
||||
ago before the last upgrade.
|
||||
|
||||
Today, you have received a feature request: generate screenshots of the website
|
||||
to embed in Discord previews. Sweet, just add headless Chromium to the Dockerfile,
|
||||
and .... oops, it's been deleted from the mirror because the version is too
|
||||
old, and updating the package database with `apt update` would require fixing
|
||||
`libfoo` as well as `libbar` (since that also broke in the meantime). Damn it!
|
||||
|
||||
Also, your image is 700MB, because it includes several toolchains, an ssh
|
||||
client, git, and other things necessary to build the software. You could copy
|
||||
the built product out, but doing so would require building an integration test
|
||||
for the whole thing to ensure that nothing of importance was removed.
|
||||
|
||||
---
|
||||
|
||||
What went wrong? Dockerfiles don't specify their dependencies fully: they fetch
|
||||
arbitrary content off the internet which may change without warning, and are
|
||||
thus impossible to reproduce. There is no way to tell if software actually
|
||||
requires some path to exist at runtime. It is impractical to use multiple
|
||||
versions of the package set concurrently while working through
|
||||
incompatibilities in other parts of the software: an upgrade is all or nothing.
|
||||
|
||||
What if you could declaratively specify system packages in the build
|
||||
configuration, not pull in build dependencies for runtime, and have everything
|
||||
come from a lockfile so it doesn't change unexpectedly? What if you could pull
|
||||
only Chromium from the latest distro repositories while working on migrating
|
||||
the rest?
|
||||
|
||||
Is this a broader failure of Linux distributions due to choosing global
|
||||
coherence with respect to [the golden rule of software
|
||||
distributions][golden-rule]? Can we have nice things?
|
||||
|
||||
[golden-rule]: https://www.haskellforall.com/2022/05/the-golden-rule-of-software.html
|
||||
|
||||
## Case study: configuration drift with Ansible
|
||||
|
||||
Ansible is a popular configuration management system that brings systems into
|
||||
an expected state by ssh'ing into them and running Python scripts. I have used
|
||||
it seven years ago to build a lab environment, and I hope they have improved
|
||||
it, but my beefs with it are at the design level. Story time!
|
||||
|
||||
A year ago, you added a log sender with filebeat to ship logs to an
|
||||
Elasticsearch cluster to aggregate all the logs. Recently, you changed the
|
||||
application to send all logs to the systemd journal to introduce structured
|
||||
logging. You changed the service to use journalbeat now and deleted the old
|
||||
filebeat service configuration but for some reason, you're getting duplicate
|
||||
log entries. What?
|
||||
|
||||
You build a new machine and it does not exhibit the same behaviour.
|
||||
|
||||
You look at one of the machines and realize it is concurrently running filebeat
|
||||
and journalbeat. Whoops. You forgot to set the state of the old filebeat
|
||||
service to `stopped`, and instead deleted the rule. Because Ansible doesn't
|
||||
know about things it does not manage, the system *contains configuration that
|
||||
diverges from what is checked in* to the git repository with the configurations.
|
||||
|
||||
---
|
||||
|
||||
What went wrong? Ansible doesn't own the system, it merely manages its own area
|
||||
of the system. "You should have used [HashiCorp Packer]" rings through your
|
||||
head. Building new system images and deleting the old machines is a great
|
||||
solution to this issue, but it experiences exactly the same problem as Docker
|
||||
during the image build process. If this is acceptable, it's honestly a great
|
||||
solution over mutable configuration-management systems.
|
||||
|
||||
[HashiCorp Packer]: https://www.packer.io/
|
||||
|
||||
Imagine if you could change the configuration and know that none of the old one
|
||||
was still around. Imagine being able to revert the entire system to an older
|
||||
version, even on bare metal, without needing such a big hammer as snapshots,
|
||||
which are also easy to forget to use for pure configuration changes.
|
||||
|
||||
## Case study: zfs via dkms
|
||||
|
||||
On most distributions, if you want to use a kernel module that's not available
|
||||
in the mainline kernel, you have to use `dkms`, which is essentially some scripts
|
||||
for invoking `make` to build the kernel modules in question. This is then
|
||||
generally hooked into the package manager so that the modules are rebuilt every
|
||||
time the Linux kernel is updated. `dkms` needs to be separate from the system
|
||||
package manager since the system package manager does not know how to build
|
||||
patched packages, source based packages, and similar things. Story time!
|
||||
|
||||
Several months ago, a new Linux kernel update broke the compilation of
|
||||
zfs-on-linux. This is fine, it happens sometimes. I use zfs on the root
|
||||
filesystem of my desktop machine, and I currently run Arch Linux on it. Arch
|
||||
like most distros uses `dkms` to build these out of tree kernel modules.
|
||||
|
||||
I ran `pacman -Syu` and waited a few minutes. I then thoughtlessly closed the
|
||||
terminal and restarted my computer, since there was no error visible in the
|
||||
bottom of the logs. Whoops, it can't mount the root filesystem. That seems
|
||||
rather important!
|
||||
|
||||
I then had to get out an Arch ISO to `chroot` into the system, install an older
|
||||
Linux kernel and rerun the `dkms` build.
|
||||
|
||||
---
|
||||
|
||||
What went wrong? The system package manager only knows how to handle binary
|
||||
packages, which means that anything source based is second class, and is
|
||||
handled via hacks such as a hook to build out-of-tree modules at the end of the
|
||||
installation process. If this fails, it can't revert the upgrade it just
|
||||
finished. By design, most binary distros' package managers can have partial
|
||||
upgrade failures, and when the driver for the root filesystem is in such an
|
||||
upgrade, render your system unbootable.
|
||||
|
||||
Since the distro is not "you", they may have diverging priorities or concerns:
|
||||
perhaps they don't feel comfortable shipping the zfs module as a binary, so you
|
||||
have to build it from source on your computer. You can't do anything about
|
||||
these decisions: do binary distributions actively enable software freedom or
|
||||
ownership over your computing experience?
|
||||
|
||||
Imagine if system upgrades were atomic and would harmlessly fail if some
|
||||
source-based dependencies could not build. Imagine if you could seamlessly
|
||||
patch arbitrary packages without needing to change distributions or manually
|
||||
keep track of when you have to do so. Imagine if there weren't a distinction
|
||||
between distro packages and packages you have patched or written yourself.
|
||||
|
||||
Imagine if you could check in compiler patches to your work repository and
|
||||
everyone would get the new binaries when they pull it next, without building
|
||||
anything on their machines.
|
||||
|
||||
# The Nix pitch
|
||||
|
||||
Leveraging the Nix ecosystem, you can solve:
|
||||
|
||||
* Consistent development environments for everyone working on your project,
|
||||
with arbitrary dependencies in any language: you can ship your Rust SQL
|
||||
migration tool to your TypeScript developers. The distro war is over,
|
||||
everyone gets the exact same versions of the development tools with minimal
|
||||
effort. People can run whatever distro they want including macOS, and distro
|
||||
issues are basically gone.
|
||||
|
||||
This also means that for personal projects, upgrading the system compiler
|
||||
does not break the build. Upgrades are done on your terms, by updating a
|
||||
lockfile in the project itself. You can have as many versions of a program as
|
||||
you'd like on your system, and they don't interfere with each other.
|
||||
|
||||
It's possible to pull some tools from a newer version of nixpkgs than is used
|
||||
for the rest of the system, and this has no negative effects besides disk
|
||||
use.
|
||||
* Fast, reproducible, and small image builds for Docker, Amazon, and anything
|
||||
else with the nixpkgs infrastructure or Determinate Systems [ephemera]. You
|
||||
know it reproduces because everything going into it is locked to a version.
|
||||
* System configuration is no longer something to be avoided: when you work on
|
||||
your NixOS system configuration, you get the results of your work on all your
|
||||
machines and you get it forever, since you check it into Git.
|
||||
* Patching software is easy, and you can ship arbitrary patches to the package
|
||||
set for projects anywhere you use Nix.
|
||||
|
||||
There is no distinction besides binary caching between packages in the
|
||||
official repositories and what you create yourself. You can run your own
|
||||
binary cache for your project and build your patched dependencies in CI. I
|
||||
didn't care about software freedom until I actually *had* it.
|
||||
* You can simply rollback to previous NixOS system images if an upgrade goes
|
||||
sideways. The entire system is one derivation with dependencies on everything
|
||||
it needs, and switching configurations is a matter of running a script that
|
||||
more or less switches a symlink and fixes up any systemd services. System
|
||||
upgrades cause extremely short downtime.
|
||||
|
||||
Workload configuration/version changes behave exactly the same as OS updates.
|
||||
* You don't have to think about the disparate configuration formats various
|
||||
programs use on NixOS. You just write your nginx config in Nix and it's no
|
||||
big deal.
|
||||
* Software is composable in Nix: you can build a Haskell program that depends
|
||||
on a Rust library without tearing your hair out, since Cabal can just look in
|
||||
pkg-config and not have to know how to build any of it.
|
||||
|
||||
Machine learning Python libraries require funny system packages? Nix just
|
||||
makes the Python libraries depend on the system packages.
|
||||
* If you've used Arch, you may like the Arch User Repository. This is
|
||||
unnecessary under Nix: nixpkgs is liberal in what they accept as packages,
|
||||
and is both the largest and most up to date distro repository out there.
|
||||
|
||||
Since Nix is a source based build system, you can just package what you need
|
||||
and put it in your configuration, to upstream later or never.
|
||||
|
||||
You can get proprietary software: you can literally install the huge Intel
|
||||
Quartus toolchain for FPGA development from nixpkgs.
|
||||
|
||||
Need patched software? Patch it, it's a few lines of Nix code to create a
|
||||
modified package based on one defined in nixpkgs, which will naturally be
|
||||
rebuilt if it changes upstream.
|
||||
|
||||
The critical insight in why nixpkgs is so large is that maintainers aren't
|
||||
special. I maintain packaging in nixpkgs for packages which I also develop.
|
||||
Another reason for their success is that packages can depend on older e.g.
|
||||
llvm: global coherence is not required, multiple versions of libraries can
|
||||
and do exist.
|
||||
|
||||
[ephemera]: https://twitter.com/grhmc/status/1575518762358513665
|
||||
|
||||
## It's not all rosy
|
||||
|
||||
Nix has a lot of ways it needs to grow, especially in governance.
|
||||
|
||||
* Documentation is poor. Often the best choice is to read the nixpkgs source
|
||||
code, an activity [for which I have a guide][finding-functions-in-nixpkgs].
|
||||
|
||||
There has been much work to make this better, but it is somewhat fragmented
|
||||
effort, hampered by both Flakes and the new CLI being in limbo for a long
|
||||
time.
|
||||
* Tooling isn't the greatest.
|
||||
|
||||
The UX design of the `nix` CLI is not very good, with unfortunate design decisions
|
||||
as the command to update everything being:
|
||||
|
||||
> `nix flake update`
|
||||
|
||||
However, to update one input:
|
||||
|
||||
> `nix flake lock --update-input nixpkgs`
|
||||
|
||||
This is [filed upstream](https://github.com/NixOS/nix/issues/5110) and is
|
||||
thankfully showing slow movement in a good direction.
|
||||
|
||||
The older `nix-build`/`nix-shell`/`nix-instantiate`/`nix-store` CLI design is
|
||||
more troublesome since it crystallized over many years rather than being
|
||||
designed upfront.
|
||||
|
||||
There are some language servers for Nix language, namely `rnix-lsp` and
|
||||
`nil`, and they both are OK, but their job is made much harder by Nix being a
|
||||
dynamic language and some of the patterns used commonly in Nix code being
|
||||
implemented in libraries, rendering their analysis challenging at best.
|
||||
|
||||
For example, package definitions in nixpkgs are written as functions taking
|
||||
their dependencies as arguments. Static analysis of this is nearly hopeless
|
||||
without seeing the call site: you don't know anything about these values.
|
||||
|
||||
The NixOS domain specific language is evaluated entirely in the Nix language,
|
||||
which slows it down and makes diagnostics challenging.
|
||||
* Currently there are significant governance issues.
|
||||
|
||||
There are conflicts of interest with the major corporate sponsors of Nix,
|
||||
Determinate Systems, employing many people in the Nix community. For example,
|
||||
the sudden introduction of [Zero to Nix][ztn] alienating [some of the
|
||||
official docs team][ztn-docs].
|
||||
|
||||
This conflict of interest is especially relevant with respect to Flakes, the
|
||||
"experimental" built-in lockfile/project-structure system, which was
|
||||
developed as consulting (by people now working at Determinate Systems) for
|
||||
Target *first*, then brought [to RFC][flake-rfc] in experimental form, which
|
||||
was closed. The great flakification was done amidst the `nix` CLI redesign
|
||||
(also experimental) which has now been strongly tied to flakes with
|
||||
non-flakes as an afterthought, in spite of the composability issues with
|
||||
flakes such as inability to have dependencies between flakes in the same Git
|
||||
repository, thus incompatibility with monorepos.
|
||||
|
||||
Currently the state of flakes is that a lot of people use it, in spite of
|
||||
experimental status. The people who don't want flakes as the only way of
|
||||
doing things are understandably very frustrated, some of them even going so
|
||||
far as to [rewrite Nix][tvix].
|
||||
|
||||
The maintenance of the C++ Nix implementation is not very healthy and has a
|
||||
large PR backlog while at the same time the BDFL, Eelco Dolstra, commits
|
||||
directly to master. This situation is disappointing.
|
||||
|
||||
[ztn]: https://zero-to-nix.com/
|
||||
[ztn-docs]: https://discourse.nixos.org/t/parting-from-the-documentation-team/24900
|
||||
[finding-functions-in-nixpkgs]: https://jade.fyi/blog/finding-functions-in-nixpkgs/
|
||||
[flake-rfc]: https://github.com/NixOS/rfcs/pull/49
|
||||
[tvix]: https://tvl.fyi/blog/rewriting-nix
|
||||
|
||||
## How did they do it?
|
||||
|
||||
Every large company has rebuilt something Nix-like at some level, since at some
|
||||
point everyone needs to have the same development environment which is the same
|
||||
as production. Nix provides that tooling in a much more accessible form.
|
||||
|
||||
Here's some things they did to achieve it (see also [Eelco Dolstra's PhD thesis
|
||||
for extensive details][eelco-thesis]):
|
||||
|
||||
[eelco-thesis]: https://edolstra.github.io/pubs/phd-thesis.pdf
|
||||
|
||||
### Every store path is a unique version and dependency closure
|
||||
|
||||
One of the *key* insights of the Nix implementation is that every path in the
|
||||
Nix store (typically at `/nix/store`) has a hash in its name of either its
|
||||
build plan or its content.
|
||||
|
||||
Nix derivations are either fixed-output, input-addressed, or
|
||||
content-addressed. Fixed-output derivations can access the network, but their
|
||||
output must match the expected hash. Input-addressed derivations are the
|
||||
bread and butter of Nix today: the hash in the name of the output depends on
|
||||
the content of the build plan. [Content-addressed
|
||||
derivations][ca-derivations] are an experimental feature, potentially
|
||||
promising to save a lot of compute time doing pointless rebuilds by allowing
|
||||
multiple build plans to generate the same output path if the output is
|
||||
identical (for example, consider the case of a shell comment change in a
|
||||
build script).
|
||||
|
||||
All references to items in the Nix store (for example, in shebang,
|
||||
derivation dependencies lists, shared object paths, etc) are by full path,
|
||||
thus effectively creating a [Merkle tree] when they are themselves hashed:
|
||||
the hashes of dependencies are included in the hash of the build plan.
|
||||
|
||||
The upshot of this is that any number of versions of a package can coexist,
|
||||
allowing programs from older distribution versions, development versions, and
|
||||
any other weird versions to run on the same system without trampling over
|
||||
each others' libraries or requiring sandboxing.
|
||||
|
||||
This feature is necessary to the implementation of something like `nix-shell`
|
||||
which brings packages into scope for the duration of a shell session, after
|
||||
which they may be garbage collected later.
|
||||
|
||||
<aside>
|
||||
One corollary of this insight is that it is impossible to provide Nix-like
|
||||
behaviour of allowing multiple versions of a package to coexist on a single
|
||||
system without either assigning hash-based names and making all references
|
||||
use that globally unique name, or doing what nixpkgs'
|
||||
<code>buildFHSUserEnv</code> and
|
||||
Docker do and using Linux namespaces to run each application in its own
|
||||
filesystem with the dependencies appearing at the expected locations.
|
||||
|
||||
Assume that a system was built using package version numbers to associate
|
||||
dependencies: openssl 3.0 depends on glibc 2.36 by path, and so on. What
|
||||
does one do if we rebuilt openssl 3.0 against some hypothetical
|
||||
incompatible glibc 2.37? Does openssl get a new version number? The only
|
||||
fully general solution is to ensure that changing any bit in a dependency
|
||||
will change its dependents' versions.
|
||||
</aside>
|
||||
|
||||
### Builds must be hermetic for trustworthy incremental builds
|
||||
|
||||
Nix builds are either sandboxed or forced to have an expected output, leaving
|
||||
very little room for the typical incremental build issues everyone has had to
|
||||
arise, since it is known exactly what went into the build. If it built today,
|
||||
it's highly unlikely to have a different result tomorrow.
|
||||
|
||||
### Archive encoding leaves room for creativity
|
||||
|
||||
Nix chooses to hash archives *consistently*: when `.tar.gz` and other archive
|
||||
files are unpacked, they are repacked into a archive format (NAR, "Nix
|
||||
ARchive") that has exactly one encoding per directory structure, and that is
|
||||
then hashed.
|
||||
|
||||
Recently, GitHub upgraded their Git version, changing the `tar` file encoder
|
||||
and changing the hashes of all their source archives. These archives themselves
|
||||
have never been guaranteed to be themselves bit-for-bit stable; just their
|
||||
contents. However, they had been stable in practice for years. Build systems
|
||||
that pin source archives from GitHub should hash contents instead of archives
|
||||
because of this.
|
||||
|
||||
### Immutability/scratch-building system images make configuration drift impossible
|
||||
|
||||
NixOS uses the most reliable paradigm for configuration management: full
|
||||
ownership over the system, effectively generating the system from scratch
|
||||
every time, modulo reusing some of the bits that didn't change.
|
||||
|
||||
It keeps the configuration immutable once built. To change it, you rebuild the
|
||||
configuration and then switch to it relatively atomically.
|
||||
|
||||
This contrasts with the way that other configuration management systems
|
||||
(besides Packer and other image building tools) work, attempting to mutate
|
||||
the system into the desired state, potentially allowing unmanaged pieces to
|
||||
creep in and enable drift.
|
||||
|
||||
[ca-derivations]: https://www.tweag.io/blog/2021-12-02-nix-cas-4/
|
||||
[Merkle tree]: https://en.wikipedia.org/wiki/Merkle_tree
|
||||
Loading…
Add table
Add a link
Reference in a new issue