blog/content/posts/tracing-dx-ideas.md
2023-07-20 00:47:38 -07:00

6.4 KiB

+++ date = "2022-12-26" draft = true path = "/blog/tracing-dx-ideas" tags = ["haskell", "opentelemetry", "developer-experience"] title = "Make tracing easy easily! Developer experience ideas" +++

I interned at Mercury for several months and built out a lot of developer experience improvements. Many of these were driven by having a good sense of whether something will be feasible in an afternoon and knowing that I can get away with spending an afternoon programming something nobody asked for yet.

OpenTelemetry/Tracing

Many of the highest impact ideas I had were related to OpenTelemetry tracing: my goal was to make tracing the first choice to investigate any kind of problem from development to production. This blog post catalogues the ideas that I implemented, how much work they were, and whether I think they're worth it.

I made the back-end emit a header trace-link, which contains a link to the Honeycomb trace for the request.

How easy was it?

1 afternoon of work (plus a couple days work later once we had to start hitting the API due to the new data model with environments). Most of this work is open source and reusable for Haskell apps.

What did it accomplish?

This was probably the best tracing adoption improvement I made because it lets devs directly look at misbehaving requests in browser dev tools and then open the trace in one click. It singlehandedly got a handful of people to start using tracing.

It doesn't really give any capability that isn't available by copying the trace ID out of the second component of the traceparent header you're already sending if you're using the w3c trace propagator, however, doing that is very arduous and manual.

If you have trace ID generation code, you can also start emitting trace IDs in other places, such as logs, exception reporting systems, and anywhere else you might want to follow requests through.

How to do it

If you're using the hs-opentelemetry ecosystem for Haskell, the relevant code is here, in the package hs-opentelemetry-vendor-honeycomb:

https://github.com/iand675/hs-opentelemetry/tree/main/vendors/honeycomb

What this package does is:

  1. Find where data is going using the Honeycomb Auth API: you need to know the dataset, tenancy name, and environment that the API key is going into.

    In my design, this data is acquired at startup time so trace link generation is just string concatenation thereafter.

  2. Create Direct Trace Links using the trace ID then put them in a header.

Instrument the test suite

How easy was it?

Implementing the hspec stuff originally took about half a week since it involved reading substantial amounts of hspec internals and poking around in a debugger. I assume probably similar times for initially adding instrumentation to any other test framework/language, with some adjustment for how well documented they are (deduct some points from hspec for confusing documentation).

However, once the integration to your test framework of choice exists, it takes a few minutes to add it to a new codebase.

What did it accomplish?

I was initially surprised at this having as big an impact as it did, but Honeycomb wound up being the easiest and cleanest way to view test suite runs and get database logs, exceptions and other useful debugging info to fix broken tests. This was a very worthwhile project and saved a handful of people probably a couple of hours each debugging thorny test failures.

How to do it

I wrote a Haskell library that starts spans for each test case in hspec: hs-opentelemetry-instrumentation-hspec. Plug this in per the example in the sources, and then you're done.

Bonus points if you print out a trace link at the end, since you can just reuse the trace link infrastructure from above for this.

You may also need to modify the way you do database interactions in tests to use instrumentation, for example.

Instrument scheduled tasks

How easy was it?

20 minutes to initialize tracing that already existed for the app, but in the scheduled tasks system.

What did it accomplish?

This one achieved ridiculously good results basically immediately: it's significantly easier to debug scheduled task misbehaviour and performance.

How to do it

Initialize tracing in your scheduled task runner, then create a context/root span for the task execution. Bonus points if you propagate the trace ID context from whatever invoked the scheduled task so you can correlate it with the initiating request in your tracing system.

Database

While I was working at Mercury we were using Postgres, but these ideas are fairly generic.

Speedy test startup

I debugged an issue after introducing instrumentation to the test-suite, in which migrations would run for 15 seconds or so on every test startup. This is because the migration system was running hundreds of migrations on startup. I solved this by restoring a snapshot of a pre-migrated database with pg_restore, saving about 10 seconds and not changing anything semantically (by comparison, a persistent test database has more risk of divergence).

The fastest way that I know of for creating a new Postgres database in a desired state is to use the template feature of createdb with the -T option, or CREATE DATABASE yourname TEMPLATE yourtemplate. This is a filesystem-level copy which makes it extremely fast (less than a second on a highly complex schema; compare to approximately 5 seconds to load the SQL for that in).

This can be used to create a database for each concurrent test. Those test databases can in turn be wiped after each test case with some kind of function that uses TRUNCATE (again, very low level; doesn't look at the data) to wipe the tables in preparation for the next case.

This leads to:

Make testing migrations easy: ban down migrations

I wrote a script for testing database migrations. The idea that I had was born out of frustration in dealing with wiped development databases while working on migrations (which, to be clear, were easy to create, but still take 30 seconds or something): what if you just snapshot the development database then repeatedly run a migration?