more tracing stuff i guess?

2023-06-24 09:51:09 +02:00 · 2023-06-24 09:51:09 +02:00 · 3a1d2710d6
commit 3a1d2710d6
parent fdef491dae
1 changed files with 62 additions and 11 deletions
--- a/content/posts/tracing-dx-ideas.md
+++ b/content/posts/tracing-dx-ideas.md
@ -0,0 +1,160 @@
+++
+date = "2022-12-26"
+draft = true
+path = "/blog/tracing-dx-ideas"
+tags = ["haskell", "opentelemetry", "developer-experience"]
+title = "Make tracing easy easily! Developer experience ideas"
+++
+
+I interned at Mercury for several months and built out a lot of developer
+experience improvements. Many of these were driven by having a good sense of
+whether something will be feasible in an afternoon and knowing that I can get
+away with spending an afternoon programming something nobody asked for yet.
+
+# OpenTelemetry/Tracing
+
+Many of the highest impact ideas I had were related to OpenTelemetry tracing:
+my goal was to make tracing the first choice to investigate any kind of problem
+from development to production. This blog post catalogues the ideas that I
+implemented, how much work they were, and whether I think they're worth it.
+
+## Put a link to traces in a header
+
+I made the back-end emit a header `trace-link`, which contains a link to the
+Honeycomb trace for the request.
+
+#### How easy was it?
+
+1 afternoon of work (plus a couple days work later once we had to start hitting
+the API due to the new data model with environments). Most of this work is open
+source and reusable for Haskell apps.
+
+#### What did it accomplish?
+
+This was probably the best tracing adoption improvement I made because it lets
+devs directly look at misbehaving requests in browser dev tools and then open
+the trace in one click. It singlehandedly got a handful of people to start
+using tracing.
+
+It doesn't really give any capability that isn't available by copying the trace
+ID out of the second component of the `traceparent` header you're already
+sending if you're using the [w3c trace propagator], however, doing that is very
+arduous and manual.
+
+If you have trace ID generation code, you can also start emitting trace IDs in
+other places, such as logs, exception reporting systems, and anywhere else you
+might want to follow requests through.
+
+#### How to do it
+
+If you're using the hs-opentelemetry ecosystem for Haskell, the relevant code
+is here, in the package `hs-opentelemetry-vendor-honeycomb`:
+
+https://github.com/iand675/hs-opentelemetry/tree/main/vendors/honeycomb
+
+What this package does is:
+
+1. Find where data is going using the [Honeycomb Auth API]: you need to know
+   the dataset, tenancy name, and environment that the API key is going into.
+
+   In my design, this data is acquired at startup time so trace link generation
+   is just string concatenation thereafter.
+
+2. Create [Direct Trace Links] using the trace ID then put them in a header.
+
+
+[Honeycomb Auth API]: https://docs.honeycomb.io/api/auth/
+[Direct Trace Links]: https://docs.honeycomb.io/api/direct-trace-links/
+[w3c trace propagator]: https://www.w3.org/TR/trace-context/
+
+## Instrument the test suite
+
+#### How easy was it?
+
+Implementing the hspec stuff originally took about half a week since it involved reading
+substantial amounts of hspec internals and poking around in a debugger. I
+assume probably similar times for initially adding instrumentation to any other
+test framework/language, with some adjustment for how well documented they are
+(deduct some points from hspec for confusing documentation).
+
+However, once the integration to your test framework of choice exists, it takes
+a few minutes to add it to a new codebase.
+
+#### What did it accomplish?
+
+I was initially surprised at this having as big an impact as it did, but
+Honeycomb wound up being the easiest and cleanest way to view test suite runs
+and get database logs, exceptions and other useful debugging info to fix broken
+tests. This was a very worthwhile project and saved a handful of people
+probably a couple of hours each debugging thorny test failures.
+
+#### How to do it
+
+I wrote a Haskell library that starts spans for each test case in hspec:
+[hs-opentelemetry-instrumentation-hspec]. Plug this in per the example in the
+sources, and then you're done.
+
+Bonus points if you print out a trace link at the end, since you can just reuse
+the trace link infrastructure from above for this.
+
+You may also need to modify the way you do database interactions in tests to
+use instrumentation, for example.
+
+[hs-opentelemetry-instrumentation-hspec]: https://github.com/iand675/hs-opentelemetry/tree/main/instrumentation/hspec
+
+## Instrument scheduled tasks
+
+#### How easy was it?
+
+20 minutes to initialize tracing that already existed for the app, but in the
+scheduled tasks system.
+
+#### What did it accomplish?
+
+This one achieved ridiculously good results basically immediately: it's
+significantly easier to debug scheduled task misbehaviour and performance.
+
+#### How to do it
+
+Initialize tracing in your scheduled task runner, then create a context/root
+span for the task execution. Bonus points if you propagate the trace ID context
+from whatever invoked the scheduled task so you can correlate it with the
+initiating request in your tracing system.
+
+# Database
+
+While I was working at Mercury we were using Postgres, but these ideas are
+fairly generic.
+
+## Speedy test startup
+
+I debugged an issue after introducing instrumentation to the test-suite, in
+which migrations would run for 15 seconds or so on every test startup. This is
+because the migration system was running hundreds of migrations on startup. I
+solved this by restoring a snapshot of a pre-migrated database with
+`pg_restore`, saving about 10 seconds and not changing anything semantically
+(by comparison, a persistent test database has more risk of divergence).
+
+The fastest way that I know of for creating a new Postgres database in a
+desired state is to use the template feature of `createdb` with the `-T`
+option, or `CREATE DATABASE yourname TEMPLATE yourtemplate`. This is a
+filesystem-level copy which makes it extremely fast (less than a second on a
+highly complex schema; compare to approximately 5 seconds to load the SQL for
+that in).
+
+This can be used to create a database for each concurrent test. Those test
+databases can in turn be wiped after each test case with some kind of function
+that uses `TRUNCATE` (again, very low level; doesn't look at the data) to wipe
+the tables in preparation for the next case.
+
+This leads to:
+
+## Make testing migrations easy: ban down migrations
+
+I wrote a script for testing database migrations. The idea that I had was born
+out of frustration in dealing with wiped development databases while working on
+migrations (which, to be clear, were easy to create, but still take 30 seconds
+or something): what if you just snapshot the development database then
+repeatedly run a migration?
+
+