Hazel Weakly

Scaling Mastodon: The Compendium

2022-11-27T00:00:00Z

This blog post will be kept up to date as I find out more information and publish my findings. It’s currently organized in no particular order, as a collection of several fragmented thoughts.

Nginx

Nginx config for object storage

The nginx config used to proxy to an object storage with a cache

you will have to tune nginx by increasing its worker_rlimit_nofile and worker_connections values.

– scaling a mastodon server

ok kewl, good to remember I suppose.

You may also need to remediate https://github.com/mastodon/mastodon/pull/21840 via setting your response timeout to 300s in nginx instead of 30 or even 60s.

Edit: That should hopefully no longer be the case, sweet.

Postgres

The Sobbing SysAdmin’s Guide to Postgres Tuning

IF THE MASTODON INSTANCE BANS ME FOR FUCKING UP THE HARD DRIVE I WILL FACE GOD AND WALK BACKWARDS INTO HELL

– postgres

There are a few rules of tuning postgres: The first is that you have to do it. The second is that nobody knows how to do it.

Now that we know the rules, let’s go forth and explain how to deal with max_connections specifically. This section is gonna be written as if I know what I am talking about but please be assured that I most certainly do not.

A rule of thumb for rails and postgres: Thou shalt not ever fuck up and manage to get more DB connections going than we have in max_connections for postgres. However, thou shalt also keep max_connections as low as fucking possible because absolutely everything in postgres falls over and shits the bed if you start getting hard contention due to trying to have more connections than is allowed.

In this case postgres won’t literally shit the bed, but your sidekiq queues will be unable to connect to postgres until you’re below max_connections again. “Oh that’s fine” says the clueless person. “I will just set max_connections to above 9000” says the fool.

New rule of thumb: If you have to set postgres max_connections to above 512, don’t.

Why? Well, why do you need that many? You probably don’t and adding more will cause latent system instability later on. What can be the case for us is that, to the best of my understanding, there’s a few things going on.

Here’s what I think we keep running into:

a mastodon sysadmin says “oh wow the sidekiq queues are slow, we need to add more workers”
this adds more connections to postgres, which degrades performance slightly
postgres starts “doing more IO”
performance counterintuitively goes down because queries start taking longer
GOTO 10

At some point you’re going to run out of max_connections. If you raise it to an absurd number like above 1024, the next issue you’re probably going to run into is that your storage system probably can’t actually handle the IO demands you’re theoretically placing on it.

Here’s what the above sequence looks like from the system’s point of view:

just having connections will slowly cause more and more slowdown over time
Which means more of those connections will slowly become active as things take longer and longer
More active connections hammers the IO way harder
Which slows things down
*the server sobbing* “please please im already dying”

So what number do you actually want to set it to? Luckily, this postgres tuning guide has a “helpful” formula that explains how to find an ideal limit:

max_connections < max(num_cores, parallel_io_limit) /
                  (session_busy_ratio * avg_parallelism)

So clearly, don’t set your postgres max_connections to anything more than *insert magic numbers*.
OBVIOUSLY.
EASY.

Ever tried to figure out the performance characteristics and “average parallelism” of a rails application?

AN ERRAND FOR FOOLS WHO DRINK THE MILK OF INNOCENCE.

If you use a db pool like pgbouncer you get to conveniently avoid this most of the time by naturally not really needing to set postgresql connections beyond 500-ish. However, why you need to do so is never really explained. So here’s the explanation: because any value of max_connections over 999 will cause your children will be devoured by Australian evil spirits.

(But seriously, you can get as much as a 46% drop in queries per second in some cases)

`DB_POOL` notes from Nora’s blog

The DB_POOL variable controls how many database connections a Ruby on Rails process will use. (MAX_THREADS controls this for Puma, the server used in web.) In addition, the web service takes a variable called WEB_CONCURRENCY to control how many processes it runs. Similarly, streaming has STREAMING_CLUSTER_NUM to control the number of processes.

The sum of MAX_THREADS times WEB_CONCURRENCY in web, STREAMING_CLUSTER_NUM times DB_POOL in streaming, and all the sidekiq DB_POOL variables, must be less than max_connections in your Postgres config. If it’s more, you’ll experience database contention.

In the example above, assuming the rest of the configuration is default and you have 200 database connections available, I’d set the following:

web: MAX_THREADS = 10, WEB_CONCURRENCY=3 for 30 connections

streaming: STREAMING_CLUSTER_NUM = 3, DB_POOL = 15 for 45 connections

sidekiq-default-push-pull: DB_POOL = 25, -c 25 for 25 connections

sidekiq-default-pull-push: DB_POOL = 25, -c 25 for 25 connections

sidekiq-pull-default-push: DB_POOL = 25, -c 25 for 25 connections

sidekiq-push-default-pull: DB_POOL = 25, -c 25 for 25 connections

sidekiq-push-scheduler: DB_POOL = 5, -c 5 for 5 connections

sidekiq-push-mailers: DB_POOL = 5, -c 5 for 5 connections

For a sum of 185 connections. This means there will be 15 loose database connections for things like migrations and manually connecting to the database to do queries and maintenance.

– Scaling Mastodon in the face of an exodus

When to pgbouncer

If you start running out of available Postgres connections (the default is 100) then you may find PgBouncer to be a good solution.

– why pgbouncer

note: this implies that nobody actually tries to run past 100 connections without pgbouncer. There’s probably a reason for this, annoying as it is. (Ruby + activerecord in particular seems to be quite prone to doing blocking IO inside a database transaction cause why not).

When you reach the point where it makes sense to move Postgres to its own physical machine, I recommend maintaining pgBouncer on each machine that wants to connect to it, rather than putting pgBouncer on the same machine as Postgres.

– scaling a mastodon server

note: read replicas are suggested to be unneeded even at 128k active users.

Idle Hands are the Devil’s Workshop

A happy postgres is one where the amount of idle transactions is low (but not constantly zero). Think of the 80 20 rule as a nice rule of thumb; if more than 20% of your connections are idle… That’s not great.

If you want to look at this, stack overflow has an example of a useful SQL query you can run in postgres.

select  * from
    (select state, count(*) from pg_stat_activity  where pid <> pg_backend_pid() group by 1 order by 1) q1,
    (select setting::int res_for_super from pg_settings where name=$$superuser_reserved_connections$$) q2,
    (select setting::int max_conn from pg_settings where name=$$max_connections$$) q3;

This will return a table looking something like this:

        state        | count | res_for_super | max_conn
---------------------+-------+---------------+----------
 active              |    12 |             3 |     300
 idle                |   127 |             3 |     300
 idle in transaction |     6 |             3 |     300
                     |     6 |             3 |     300
(4 rows)

The state column shows has a count for each state (active, idle, idle in transaction). res_for_super is for connections reserved for superuser access, and max_conn is the max connections you’ve specified in your settings. They’re duplicated since they’re their own column but its fine; I’m sure there’s a prettier query that can give you this information but this one works.

If you’ll notice, there’s quite a few idle transactions happening here. That’s because the server is in a state of low database usage. This is why you want to use something like pgbouncer so that you can keep the amount of idle connections as low as possible in order to prevent overprovisioning your max_connections.

Postgres Calculator Math

Here’s the calculator math that I’ve ~~stolen from nora~~ come up with.

Let’s assume the following systemd services (annotated with every setting that causes a connection to postgres). The @Nx here denotes a systemd unit template file where N is the number of units you’ve started that correspond to this sidekiq queue. There are also several DB_POOL variables. Since they are all different yet called the same environment variable, I am changing them to be unique here so that it makes sense in a calculation formula.

So, here’s the list of services:

mastodon-web
- WEB_CONCURRENCY
- MAX_THREADS
mastodon-streaming
- STREAMING_CLUSTER_NUM
- DB_POOL_streaming
mastodon-sidekiq-push@N1
- DB_POOL_push
mastodon-sidekiq-pull@N2
- DB_POOL_pull
mastodon-sidekiq-scheduler@N3
- DB_POOL_scheduler
- note: you should never have more than one scheduler running, however you may set DB_POOL and concurrency to whatever you want it to be.
mastodon-sidekiq-mailing@N4
- DB_POOL_mailing
mastodon-sidekiq-default@N5
- DB_POOL_default
mastodon-sidekiq-ingress@N6
- DB_POOL_ingress

And, of course, you’re running postgres somewhere. Postgres has a max_connections set in its configuration somewhere.

The formula for total connections is:

total_mastodon_connections =
  (WEB_CONCURRENCY * MAX_THREADS) +
  (STREAMING_CLUSTER_NUM * DB_POOL_streaming) +
  (N1 * DB_POOL_push) +
  (N2 * DB_POOL_pull) +
  (N3 * DB_POOL_scheduler) +
  (N4 * DB_POOL_mailing) +
  (N5 * DB_POOL_default) +
  (N6 * DB_POOL_ingress)

Now, if this number is over max_connections in your postgres configuration, you lost. In fact, if this number is more than 90% of max_connections, you’re probably much closer to IMPENDING DOOM than you would ever feel comfortable in public.

Last miscellaneous note: you want postgresql behind a proxy even if it’s on a single node. It’s just too liable to be painful otherwise. Have to stop all clients to get the database back online.

Bouncey Bouncy Bounce Bounce Bounce

PgBouncer is a single-threaded process which means it only uses a single CPU. […]

In general, a single PgBouncer can process up to 10,000 connections. 1,000 or so can be active at one time. […]

Adjusting connection counts may also require you to adjust some system limits to allow PgBouncer to utilize the number of sockets required […]

– postgres at scale

When do you need more than one pgbouncer?

PgBouncer’s CPU usage is 100%.

Application queries through PgBouncer wait times increase while Postgres itself is not similarly loaded.

– postgres at scale

Likely causes are pgbouncer can’t keep up with number of connections to the database, or the size of result set being returned is too much.

How to test: Run this SQL query on the postgres database.

select state, count(*)
from pg_stat_activity
where backend_type = 'client backend'
group by 1;

If your idle connections is zero (or very close to zero) pgbouncer is bottlenecked.

SELECT ‘bottle’ FROM ‘neck’ WHERE id = unknown

If you want to find some bottlenecks in your database, according to @AndresFreundTec@mastodon.social, you can run the below query and analyze its output as a starting point.

SELECT backend_type, state, wait_event_type, wait_event, count(*)
  FROM pg_stat_activity
    WHERE pid <> pg_backend_pid()
      AND wait_event_type IS DISTINCT FROM 'Activity'
  GROUP BY 1, 2, 3, 4
  ORDER BY count(*) DESC;

Here’s an example of what that would look like:

  backend_type  |        state        | wait_event_type |      wait_event      | count
----------------+---------------------+-----------------+----------------------+-------
 client backend | idle                | Client          | ClientRead           |    52
 client backend | active              | Lock            | relation             |    13
 client backend | idle in transaction | Client          | ClientRead           |    11
 client backend | active              | Client          | ClientRead           |     2
 checkpointer   |                     | Timeout         | CheckpointWriteDelay |     1
 client backend | active              |                 |                      |     1
(6 rows)

If you’re write latency bound, the query will show a lot of WALWrite wait events.

Setting synchronous_commit = off can alleviate that (although understand roughly what it’s doing first). Here’s a nice explainer

One particular warning, also from @AndresFreundTec, is that setting synchronous_commit = off means your transactions aren’t immediately guaranteed to be durable. That… Should be fine for Mastodon… I think

Memory X Memory the not-anime

Database tip: if you need to increase max_connections on PostgreSQL, make sure to check what work_mem is set to. If max_connections X work_mem is more than double the RAM you have on the server, maybe lower work_mem.

– @fuzzychef@m6n.io

Redis

Lies, damned lies, and redis

Mastodon uses Redis. [It supports], SIDEKIQ_REDIS_URL, CACHE_REDIS_URL and just REDIS_URL. (Actually, Mastodon supports REDIS_HOST, REDIS_PORT etc variants separately for all three).

– mastodon scaling docs for redis

note: redis is used for BOTH volatile cache and persistent data. sidekiq, list feeds, home feeds, and streaming API are all needed to be in persistent redis which shouldn’t be lost.

note: This is written as if using a separate redis for cache and persistent data is optional. It is not.

read and weep.

[…] it’s important that Sidekiq be run against a Redis instance that is not configured as a cache but as a persistent store. […] I recommend using two separate Redis instances, each configured appropriately, if you wish to use Redis for caching and Sidekiq. Redis namespaces do not allow for this configuration and come with many other problems, so using discrete Redis instances is always preferred.

Speaking of which, the blog post that was linked from the redis wiki is very nice. You should read it: storing data with redis.

This one is also potentially useful. performance tuning for redis

How to redis correctly

So, from the storing data with redis post:

There are several questions to answer when determining how to use Redis for different datasets:

Can I flush the dataset without affecting other datasets?

Can I tune the persistence strategy per dataset? For transactional data, you want real-time persistence with AOF. For cache, you want infrequent RDB snapshots or no persistence at all.

Can I scale Redis per dataset? Redis is single-threaded […] Datasets in the same Redis instance will share that budget. What happens when your traffic spikes and the cache data uses the entire budget? Now your job queue slows to a crawl.

The conclusions are that, for mastodon’s two needs (cache + storage), you must use two separate redis instances or you’re not going to be able to actually change your persistence strategy. Everything else is practically irrelevant; if you can’t change the persistence strategy, there’s no point in using redis for both usecases.

Redis itself can be scaled using either Redis Sentinel or Redis Cluster.

For Sidekiq, only the Sentinel option is viable, as Sidekiq uses a small number of frequently updated keys. With Sentinel, we get fail-over, but we won’t increase the server’s throughput.

For the home feed caches, we might use Redis Cluster, which will distribute the many cache keyes across available nodes.

the architecture of mastodon

Storage

Object Storage

At this point it’s worth mentioning that if you want to go further [beyond ~20k users], you’ll need to be using object storage (S3 or similar) for user file uploads, or else manually figure out a shared filesystem between all of the machines in your cluster (very likely possible, but probably not worth it compared to even just self-hosting Minio)

– scaling a mastodon server

note: we have learned that a shared file system does not actually work unless it is a local shared file system. Sidekiq is too latency sensitive otherwise.

note: the parenthesis gives it away. Nobody running mastodon at scale has ever tried to do it without an object database and we have unwittingly ran into edgecases where scaling advice leads us astray here. Remediation is to move to an object storage solution sooner rather than later.

NFS: No FUCKING Scale

That’s it, that’s the ~~tweet~~ toot.

Don’t use NFS for anything; the mastodon documentation claims you can use it. You cannot. Don’t even think about it. Run an object storage locally if you have to; it’s simpler now with projects like seaweedfs and a very good idea.

Sidekiq and Ruby

Sidekiq scaling indications

These jobs are split into as many tiny jobs as we can manage, because that’s how you can make parallelize them best and thus make the most optimal use of hardware and horizontal scaling. But if you’ve got 10 threads and 22,000 followers, do not be surprised that there are delays. In fact, that is how the need for scaling Sidekiq shows itself: the dreaded backlog.

– scaling a mastodon server

note: this is a sidekiq scaling indicator. However, we can’t scale sidekiq beyond the database and filesystem allows

Actually, there are more reasons that the backlog can grow, such as if there’s a technical issue causing individual jobs to take longer than they normally would, or getting stuck indefinitely reducing the effective number of threads available for processing

– scaling a mastodon server

note: this is very buried but very important. Indicators of sidekiq backlog growing can also be jobs getting stuck. We encountered this with NFS.

note: Hypothesis. We ended up wanting to scale workers up because we were getting a lot of stuck workers due to file system issues. Then when things resolved, we actually had too many workers hitting the database all at once, then we got too much database contention which locked up those workers, leading us to reduce workers, causing a vicious cycle depending on which was misbehaving more, postgres or NFS.

Sidekiq queues and how they hate you

One thing to remember is that there should only be one scheduler in your entire cluster, and it doesn’t need many threads (5 is fine). […] It’s just that default is the most important one, with push and ingress being close second. mailers is also important but even just 25 threads will get you very far because the rate of sending e-mails isn’t that high.

– scaling a mastodon server

note: the ellipsis here is frustrating. There is an entire paragraph that sums up to “you can setup your queues a bunch of ways. Nobody’s ever done performance measurements on them lol good luck bro”

note: my personal hypothesis is as follows. Given the math calculation from Nora’s blog post, each thread in each process has its own separate database connection. As such, thread * process is always the math we need to use for everything. With all of that in mind, we should experience an irrelevant amount of overhead from sidekiq -q single-queue (xN) vs -q q1,q2,q3,q4 (xN). The difference washes out and database connections are not necessarily used more efficiently unless we can somehow use less sidekiq processes.

I fleshed this math out more in the postgres math section

note: tl;dr, single queue for each service. use systemd service templates. ramp them up as needed. rely entirely on pgbouncer to not cause database contention even though it’s fucking ridiculous that we would need to do that.

Sidekiq memory fragmentation

One of the most important things we’ve learned over the years about Sidekiq is that a bad interaction between the C-Ruby runtime and the malloc memory allocator included in Linux’s glibc can cause extremely high memory usage. I’ll talk about what causes this bad interaction in a later email, but for now, let’s just concentrate on the effects.

Sidekiq with high concurrency settings, when running on Linux, can have what looks like a “memory leak”. A single Sidekiq process can slowly grow from 256MB of memory usage to 1GB in less than 24 hours. However, rather than a leak, this is actually memory fragmentation.

– wisdom of the ancients from a mailing list

Locks That Bind and Binds that Lock

Rails effectively does this

Start db transaction
Upload image to media storage
INSERT or UPDATE statement
Commit the transaction

Let’s walk through the implications of this briefly. But first, go ahead and scream into the void; it’ll be helpful.

Now wasn’t that refreshing?

Ok, so the implications here:

If your media storage is the file system, you can hit the file system with the database and the media storage at the same time.
If your file system is the same file system you can cause slowdowns on that disk from two separate directions that are both now mutually related.
If that file system is NFS, now the network is involved inside that database transaction
Oh also this is in sidekiq so it’s all parallel and concurrent

The lesson here is use an object storage from day one if you can. Preferably one that doesn’t live on the same disk as postgres. NFS in particular is going to be a very poor choice here. It’s bad enough that, honestly, mastodon documentation should warn against using it rather than presenting it as an option.

mastodon-web and mastodon-streaming

At some point you will definitely want Puma to be on a separate machine from Sidekiq, and then have more machines with Puma, and more machines with Sidekiq. […] Just don’t forget that once your Puma isn’t on the same machine as your nginx, you will need to specify TRUSTED_PROXY_IP with the internal IP of the load balancer so that Puma can correctly parse users’ IP addresses for stuff like rate limiting. […] use an upstream block in your nginx configuration to list these Pumas and nginx will do the load balancing between them.

note: separate machines with just puma and just sidekiq are what we need to start moving towards.

The streaming API will get you pretty far on default configuration, but at some point it too will not be able to answer all of the connections. […] The moment when this becomes necessary can be difficult to detect, because for people who’ve already connected, the streaming API will continue to work, it’s new connections that will be rejected.

– scaling a mastodon server

You can read me but you’ll never clock me

Note, however, that the Sidekiq jobs will need to perform both reads & writes from the main node, hence the scaled-up [read only replicas] are only for the other clients (web, mobile, streaming).

the architecture of mastodon

Sidekiq cool triq

Your sidekiq command in a systemd service should look a little different than most guidelines actually show.

The most important things are using systemd templates and structuring the sidekiq start command to only have one queue. Here’s a truncated example of a pull queue sidekiq:

# mastodon-sidekiq-pull@.service
[Unit]
Description=mastodon-sidekiq-pull
After=network.target

[Service]
Type=simple
User=mastodon
# ... snip
Environment="DB_POOL=10"
ExecStart=/usr/bin/bundle exec sidekiq -c $DB_POOL -q pull
# ... snip

A few numbers:

DB_POOL and the -c NN number need to match up. They don’t have to, but… they should.
ONLY ONE QUEUE IN THE SYSTEMD SERVICE

Here’s why.

This starts one process and creates 10 connections to the database. The overhead between this vs 2 systemd units with 10 threads is basically zero. There are reasons to do it (concurrency vs parallelism and ruby has a GIL which limits parallelism capabilities), but DB connection number is the same. So, realistically, there’s almost zero downside to having more systemd services.

BUT. You get the ability to log out various sidekiq queues and quickly narrow down which one is erroring. You also get the ability to scale up an individual queue better on demand. Keep the c number smaller (no more than 25) and make more as needed, it’s fine. That’s why these are systemd templates.

If you want to look a very nice approach in more detail, see this blog post by Justin Warren. It does this quite well but lets you reuse the same template and modify paramaters via environment files rather than by editing the systemd templates themselves. (I would suggest following the other advice I’ve given here: one queue per systemd service, don’t do weighted queues, etc; keep sidekiq as simple as possible).

Last bits of advice for sidekiq systemd services. Here are the magical numbers for the queues to use that have been tested by for you to use.

For the default, push, and pull sidekiq queues: Set DB_POOL to 10 and set -c to the value of $DB_POOL.
For the ingress and scheduler queue: Set DB_POOL to 5 and set -c to the value of $DB_POOL.
For the mailer queue: Set DB_POOL to 1 and set -c to the value of $DB_POOL.

Ingress is very very CPU bound and will suck up a whole CPU core with very little threads, so be prepared to spin up multiple processes for the ingress queue. I used to say set the DB_POOL to 20, but I have changed this to 10 to reflect real world usage; shit just runs nicer at 10.

Ruby Knobs

WEB_CONCURRENCY controls the number of worker processes
MAX_THREADS controls the number of threads per process

Those above environment variables apply to mastodon-web and only mastodon-web. The sidekiq queue has two knobs: processes and threads. Each mastodon-sidekiq-${queue}@N creates 1 new process. Each process can allocate X threads according to the -c X setting in the ExecStart of the systemd service.

As a further annoyance, DB_POOL is the third hidden and extremely fucked up knob you have for the sidekiq services. DB_POOL can be different from the concurrency but it should always be DB_POOL >= X (where X is the concurrency in sidekiq -c X -q ...). As a simplification, I don’t really ever see anyone ever set DB_POOL to anything other than exactly X.

So, DB_POOL is local to sidekiq and also applies only to sidekiq (and also mastodon-streaming because surprise! Chuckles, that’s why). However, you have two notions of pool here. One that’s local to that particular sidekiq queue, and one that’s relevant to postgres. Postgres has a setting max_connections that is the global max_connections.

Crawling Up the Elephant’s Trunk

What’s going on in there? Where do duplicate queues come from? How does that get resolved? Some helpful people on discord have given me some clues :)

Mastodon does a series of operations synchronously, but none of them are atomic. one of those operations is setting the deduplication key, another one is creating a local db entry, another one is scheduling a background task to blast out the post to all the instances you have followers on

@untitaker@hachyderm.io

In theory, these are atomic, but none of them are atomic, so every single step has the ability to cause duplications and errors.

The order of operations here is (thanks @unlambda@hachyderm.io):

Check deduplication key
Write status to db
Process tags and update tag tables (in separate transaction)
Process mentions and update mention stats (in a separate transaction)
Schedule background jobs
Do some updates to tables which indicate “potential friendships” (mastodon will suggest people that you might want to follow based on who you have replied to)
Set deduplication key in redis

This will cause issues in several scenarios, but the following scenario is the one you’ll notice if your DB is overloaded:

You have a large backlog of DB transactions and it takes a while
Everything works, but nginx gives up after 60 seconds
However rails does not cancel the request and continues to process it
The app/webserver/etc retries after getting a timeout indication
- repeat until the app stops retrying

The summary here is that there’s a hidden component to scaling here:

Your server API response should not be shorter than the connection timeout
If it is you’ll add a bunch of duplication work for yourself and exacerbate the issue

Eating RCE and porridge for breakfast

Are you seeing a lot of Mastodon::RaceConditionErrors in your logs for sidekiq?

According to this issue that might be totally expected.

This particular issue has been fixed:

I am constantly getting this as well, and from what I can tell it’s because the retry timeout for RedisLock is set to a default of 10 seconds while the default expiration time for RedisLock is 15 minutes, per #16291

When called from ActivityPub::Activity::Announce this causes Sidekiq to retry until its timeout and then throw Mastodon::RaceConditionError which causes another Sidekiq retry.

But only by making the timeout and expiration the same, 15 minutes. The issue itself still remains and you will run into it if your sidekiq queue times ever go above 15 minutes.

references

Mother of All Outages

2023-04-19T00:00:00Z

Y’all ready for a story about one of the wildest ~~fuckups~~ production outages I ever took part in? Buckle up; we’re going for a ride far, far away from any security cameras.

Setting the Scene

At a previous job we had some fairly intense mismanagement. No tech debt was ever allowed to be handled. No good deed was ever unpunished. No non-white-male person was paid a market salary.

Y’know, the usual.

We had all of our infrastructure set up by one lonely SRE person for years. Then I came on, and two engineers from other teams joined the SRE team.

Our tech stack for the backend servers? VMs with Nomad, AWS, and sparkles. Amazingly cost effective, quite honestly.

Because business, the company had recently gone through a massive round of layoffs; they were contrite, they were distraught, they were thorough in their assurances to everyone that there wouldn’t be any more layoffs. Naturally, I knew they were lying; I knew it before they did, but I saw it plain as day.

Due to all of *gestures* this, the engineering department scored MASSIVELY badly in happiness. They were looking at staggeringly terrible end-of-year attrition rates.

I’m sure this had absolutely nothing to do whatsoever with the encrypted anonymous spreadsheet that “Someone Who Isn’t Me” started and spread around the entire engineering org to bring some salary transparency to light.

The fact that fem presenting people were massively underpaid and that quite a few people living in lower cost of living areas got extremely bad salaries also had nothing to do with this, I’m sure.

Naturally the solution to impending attrition woes was to do nothing. Haha. Business. BIZZ. NIZZ.

Drain in the Membrane

About a month before PAIN day, the person who setup all the infrastructure tech… Left.

I totally get it; greener pastures, better pay, less ~~illegal corporate exploitation~~ drama. Excellent choice, really; who could blame him? Looking back, I kinda wish I had made the same choice at the time. But now, behold! I was now the expert in the stack (kind of).

I mean, I wasn’t the expert they wanted or needed, but I was “The Person Who Is Currently Here” which is kind of the same thing except for where it’s not.

That said, everything continued to work flawlessly for a very very long time until one fateful day (about one month after The Expert left).

One small side tangent: our nomad servers looked like this

3 controller nodes + N worker nodes.

The controllers also ran consul and vault.

All of the observability infrastructure, integration with AWS, cron jobs, timers, event processing, etc, ran on the nomad worker nodes

The Fateful Day of PAIN

I get a message from a coworker that our observability stuff is just dead. Completely gone. Can’t get it working.

So I looked at the clusters … Turns out everything was down. Consul just shat the bed, and nothing could reach each other.

“How do we fix it?” they asked.

“Fuck if I know. None of us set it up” I replied, helpfully, like a broken Clippy from a pirated Word install.

However, I did get a debriefing on maintenance, and got to learn some of the quirks of the system.

One was you had to be very careful when restarting the nomad servers, but it was generally fine if you did an expand + cycle + shrink.

So, I made the decision to try that. And here’s where I fucked up:

I learned later that the expand cycle shrink mentioned by the former coworker was for the worker nodes only. (Obvious miscommunication in retrospect)
For controllers going from 3 to 4 causes split brain.

The second point was also obvious in retrospect. I was working in a broken system that nobody understood in a toxic company under pressure from people who never once prioritized doing the right thing or addressing tech debt or, forbid, prevention of issues. Of course my choices were bad

Long story short: I split brained the cluster and then cycled it. WOO!

This caused a very important thing to happen.

You remember me saying “oh Consul and Vault are also on those nodes”?

When you split the brain, the new nodes don’t join a quorum. Thus, state isn’t transferred.

… oh fuck.

We lost 3 years of secrets, credentials, configurations, etc.
Some of which didn’t exist anywhere else.
State replication of Consul had never been setup.
State replication of Vault had never been setup.
We had no backups of anything and no way to get them back.

✨ Gone ✨

Not only that. But now everything was on fire because the controller nodes were completely broken (they were already 90% broken but now they were 100% broken).

Luckily we had infrastructure as code! We can fix this! Right??

No.
We needed to bootstrap.
Nothing can help now.

The Strap and the Boot

I spent the next week, the next weekend, and through into the week after that rebuilding everything and reverse engineering stuff. I poured over chat lots, buried secrets, git histories, and hidden AWS configs. We got 90% of it back, but the other stuff was gone forever.

14 days after the incident and nothing had been fixed yet, despite the clusters now having been rebuilt and made fully operational. Why?

BOOTSTRAPPING.

We had health checks, crash loops, healthz, all that shit. None of it is ever calibrated from a cold start. You can’t.

We had dependency loops, cycles in services, we had missing stuff that wasn’t in the code, we had code that had never been run, we had code that was for future use and code that was retroactively added to guess how things were set up.

We got about 25% of everything back up, kind of, sorta; if you squinted, you could see where things were supposed to go, vaguely, that is.

Pissing on Faces and Pretending it’s Rain

Some of this restoration actually took place during an Enginering on-site. Myself and the one other SRE person left on the team worked during the entire on-site when we were supposed to be having fun; we dug into logs, poured over services, and attempted to baby things back to life one by one.

Then it was Friday. What happened that Friday? AWS released Serverless RDS and somehow, our RDS cluster got completely corrupted. No health checks failed, no alarms went off; pure, silent, deadly corruption. I had two options: try to fix the database, or restore it from a snapshot. Normally, this doesn’t matter; however, this database was years old, it was ancient, it was one of the first things ever setup.

And restoring from a snapshot means changing the ARN. But that ARN? It was a string that was hardcoded into almost every piece of infrastructure. Changing that ARN would take days of pleading with the gods of chaos. So naturally I tried really fuckin hard to not need to; unfortunately, I lost that battle.

Monday, 8am PST, I restored the database from a snapshot.
Monday, 9am PST, people start disappearing from slack.
Monday, 9:30am PST, calendar appointments with HR, the CTO, and CEO, start popping up on calendars of engineers.
Monday, 10:00am PST, people are posting titanic gifs in chats, frantically sending each other email addresses and phone numbers.
Monday, 10:30am PST, we figure out that about 90-95% of the engineering org is going to be laid off.
Monday, 11am PST, I have a meeting with HR and cheerfully explain that they’re now spending $10k a month in idle CI machines, and $6k a month for an RDS instance that isn’t connected to anything. I helpfully offer that they can call me or The Expert if they need help with repairing the current dumpster fire in the future, and say that we would be more than happy to give our consulting rates if asked.
Monday, 12:00pm PST, my work laptops are wiped remotely.

The Wailing of Cassandra

Now, fuck-ups happen, incidents happen, that’s fine. Why is this one the Mother of All Outages, for me? I don’t even know; I suppose it’s because the whole thing felt so fucking pointless. The whole damned thing, from beginning to end. Pointless. And to lay off everyone with the system broken beyond repair? To assume that you can keep on limping, wasting slowly away while clinging to the dying embers of a celestial god for eternity? I, to this day, have never understood the decisions that were made; there were many, but this was by far the least understandable.

It just kills me because I saw this coming. I actually made bets with The Expert about how long it would take for this to happen after he left. We guessed 2 weeks to 2 months (we were right). We both underestimated the severity, though. By a lot

But we knew it was coming.

The other thing is that the entire thing could’ve been avoided had we been given time to move our vault over to the managed enterprise vault license.

That.
We.
Already.
Fucking.
Bought.
To.
Prevent.
This.

Epilogue

The company severely downsized the engineering org later. And, from back-channel news, I discovered that almost every engineer that made the cut left soon after.

The observability stack? Still not working.

But also so is nothing else.

A month or two after that, they re-hired The Expert to bring the system back up; the consultant fees he charged were nearly the same as his original salary, but for 10% of the hours. Once he did that and documented it, the company apparently had planned to migrate the system to Kubernetes so that external consultants could maintain it. As far as I know, this was never completed.

To this day, The Expert occasionally consults for them here and there, reaping the ghosts of horrors past.

No one really seemed to do the math and realize that this cost the company more than what they saved by laying off 90% of their engineers. Having never once learned their lessons, they weren’t about to start now, I suppose.

Values of Convenience: Why Do We Not Make Life Better For Others?

2023-05-16T00:00:00Z

I was asked recently for my thoughts on a wonderful article about software correctness, human convenience, and flossing, and I ended up dumping out an entire blog post worth of thoughts. So, this blog post serves as both a reminder to myself to write more, and also a sincere apology to my wonderfully patient friend, Kelly, who graciously puts up with me dumping absolutely unholy amounts of text into their phone at all hours of the day.

I really liked the blog post, by the way. Hillel is an excellent writer, and I find myself agreeing with just about everything he’s ever written. He’s got some fascinating takes, and I find them so grounded in reality and experience. One thing that can be pretty difficult, especially with Formal Methods or other “Big Math” computer science topics is that it can become so easy to get so deeply inside your realm that you become wholly divorced from the concept of anyone ever having to actually learn it, much less apply it. Not all things need to be applied, of course, or even learned; but there’s this extraordinary clarity that comes from having polished an idea to a fine shine on the frustrated tears of students or inexperienced engineers that is very difficult to replicate in any other manner. Consequently, his work really resonates with me. “Proving systems right” is so inherently human; after all, what is a proof other than a miserable pile of arguments, and what is “correctness” other than a human ideal laced with emotion, not yet sullied by the ravages of reality.

One takeaway that I have from the post is that there’s an idea that I don’t see explored a lot, and it’s one of what exactly “smoothness” looks like. What does it mean for something to be convenient to use? And, more importantly, why the fuck does it matter at all. If it’s good for you, why don’t you floss? If it’s healthy for you, why don’t you eat salad more often?

I want to take a moment here and think about this from the other direction. Rather than thinking about what smoothness is, what does inconvenience look like? What does friction feel like? I think we, as humans, really want to experience friction like a hill; we really want to feel like there’s a smoothly rising slope and you can sort of calculate how much friction you’re willing to endure in order to get a certain trade-off. “Oh, that’s 2 frictions? But 4 goody-goody-yum-yum points? Sure, I’ll take that; it’s within my friction budget for the week.”

Pffh.

In reality, I think friction is a lot more like a thousand cliffs of varying size. But, not only are the cliffs of varying size, you are cursed with a few inconvenient truths.

Everyone friction cliff will be a different size to each individual
The second you scale your cliff, you will immediately forget how high it was
Every single cliff, no matter how tiny, can completely derail your ability to progress
Every time you attempt to estimate the size of the cliff you scaled, you will underestimate it
You will eventually forget that people are not on the same journey of cliffs as you

So, really, you’re fucked; completely and utterly fucked. Forget the curse of knowledge, or expert blindness; you’re doomed to eventually be the cliff that someone else must scale.

Joy! And now that we’ve thought about that cheerful note, let’s go to the next part of the article that really stood out to me, which is helpfully depressing to me for entirely separate reasons.

Similarly, in academia, UI/UX is low prestige work. […] The incentive structures are all messed up.

I think there’s potentially some very interesting implications here, and I want to unpack those. For the record, I agree with Hillel here, but this immediately brought to mind for me a single scorching thought

“That’s because it’s woman’s work.”

We’re currently trapped in a ~~dystopian hellscape~~ patriarchal society where the undercurrent of the internet and technological cultures are one of egocentric bias and rugged individualism.

On the egocentric side, the incentives around removing barriers seem to be non-existent or even dis-incentivized entirely. Why make life easier for the next person if it “devalues” your own resume of achievements? For people to stand on your shoulders in an egocentric zero-sum society, it would imply that you had not achieved greatness yourself. Far from the “standing on the shoulders of giants” ideals that we love to pretend we believe in, we seem to tend towards viewing that sort of progress as being at the expense of those who came before, as if the very act of forging ahead diminishes the path itself and those who laid it.

On the patriarchal side of things, I notice a similar pattern around work that’s “feminine” (and thus codified as inferior in nature) being work that focuses on community, empathy, helping others, and enriching culture. You can see this in how we value the work of nurses and teachers; when their work shifted from being one of dominance and control to that of nurturing and care, the careers became associated with women and with that came a lowering of respect and pay.

What does it mean for something to be convenient, and what does it mean for something to have friction? If we think of convenience as making life better for others, as working to build that which breathes wholeness into the soul of a community, then friction is merely the absence of that life. The cliffs of friction are the same as the cliffs of neglect; malevolence isn’t required, disinterest alone can build cliffs that no one could ever hope to scale.

What is this convenience? This enriching of the other? I want to tug on that a bit, and not in the least because I’ve been thinking a lot lately about Christopher Alexander and adrienne maree brown. (Warning: In the interest of brevity, I am about to condense hundreds of pages of nuanced literature into a few sentences and I humbly beg forgiveness for doing so, and as an aside: I will be eternally grateful to Erin Kissane for writing this amazing blog post, among others, that really exposed me for the first time to the duality of these two writers).

One thing that’s currently fascinating about them, to me, is that they seem to approach the same problem from opposite sides. They both want to build a healthier world that gives life to humans: Alexander through the question of how to build community-producing structure, and adrienne through the question of how to build structure-producing community. But the longer I think about it, the less coincidental I find it that Alexander–a white man–focuses on the systemic structure, approaching the problem from the top down; while adrienne–a mixed-race black queer woman–focuses on the community, seeing it as an essential and necessary prerequisite to the very idea of being able to build a structure in the first place, and consequently approaches the problem from the bottom up.

What is convenience?
What is convenience but a miserable pile of humanity?
What is convenience but a mirror that reflects purely the ability of one to reach another and thereby forge raw human connection from the aethers of desire?

What is convenience but the idea that in order to build a tower to reach the heavens, you must first reach into the heart of humanity itself.

Why is Browser Observability Hard

2023-07-10T00:00:00Z

So the big thing that makes everything so difficult for browsers is that opentelemetry has a concept of a lifecycle for telemetry that doesn’t map very well to how you ergonomically propagate context and correlate traces together. Opentelemetry works super super well in cases where you have a very linear callstack that’s fully synchronous in design. Something like request -> function A(a1, a2, a3...) -> function B(b1, b2, b3...) -> ... -> function N(n1,n2,n3...) -> response where the total lifetime of that is “reasonably short.” That is, to put it mildly, not the case in front-end systems. Front-End systems are event based inherently and work based off asynchronous callbacks and event loops, which is one of the architecture styles that fits most poorly into the “tree-like” structure that otel wants you to give it. Technically opentelemetry can work with and express anything that’s a directed acyclic graph (by way of using both links and parent/child relationships carefully), but using links is really annoying in most SDKs and it’s universally unclear how to most clearly initiate “child” spans if you don’t have visibility into the lifespan of the callee vs that of the caller.

On top of that, there’s React; simultaneously the best and worst thing to happen to frontend development. In addition to the browser being async and event-loop driven, React is a runtime on top of this which specifically is designed to:

give you no control over the lifetime of any root span
encourage you to make the lifetimes of every node as long as possible (for efficiency reasons)
not give you lifecycle hooks granular enough to synchronize your span lifecycle to that of a component

Even if 3 was solved by introducing the concept of “on component creation, on component render, on component removal, on component re-render” and whatever else that was required for creating autoinstrumentation, that wouldn’t really work meaningfully. For one thing, you would have to build that into react.js itself and not anything on top of it. For another, root spans that can last indefinite amounts of time don’t work well in opentelemetry. Some people don’t refresh their browser tabs for weeks or months! It’s the last issue that makes front end stuff so dfficult for opentelemetry. It’s just really not designed to make it ergonomic to go “page load happened, , oh look button press”. So how the fuck do you actually meaningfully instrument that? You can, of course, but you need to make almost everything a root span and correlate them together casually via attributes and, hopefully, also some links. Which won’t be ideal from a querying perspective, but is more honest than other approaches.

Lastly, the browser doesn’t support grpc, data loss is more common, compression is vital, and the weight of instrumentation size is extremely important because blowing up someone’s data plan is inconsiderate. So this is one of the areas in which the user starts to pay really heavily for high cardinality, and data volume needs to be very judiciously monitored. You also don’t really have the option of running an in-browser version of a telemetry collector, but that’s exactly what you need a lot of the time to do the most effective curtailing of bandwidth. Even if that existed as a thing, it would bloat the page with even more javascript, cost the user even more battery life to run on their 5 year old phone, and make the user experience even worse.

There’s also api authentication issues with browsers needing to be able to send telemetry to an endpoint without being authenticated. Honeycomb solves that pretty well, but you need to think decently hard about that if you build an /api/telemetry endpoint (which you probably should). Which is a lot more work than “just yeet this straight to honeycomb for a proof of concept and then we can figure out collectors and refinery and whatever later.”

Baggage is how you’re “supposed” to build context that can be shared between services so that you can correlate a backend trace with a frontend trace. It’s probably one of the most confusing, least-out-of-the-box experiences you will ever encounter, and there’s no useful way to set that up nicely without really understanding what you’re doing in both the frontend and the backend. Which is another super difficult thing about the frontend. Rolling your own way to tie together every service instead of having that be the “normal” thing that self discovers the connections is, imo, a sign of immaturity in the space.

So You Want to Hire for Developer Tooling

2023-07-14T00:00:00Z

I see you want to hire a developer to work on internal developer tooling, developer experience, and the generally intangible but admirable goal of “making life better for devs”. That’s awesome; you’ve got one hell of a challenge ahead of you. This role is extremely difficult to hire for. In my opinion, and in my experience, it’s been the most difficult role in the company outside of senior leadership, and the most likely to fail; if there ever was a role that burns people out, it’s this one. Tread carefully, and good luck. You’ll need it.

You probably have some questions, such as:

What do they even do? (If you’re really confident you can already answer this question, I urge you to throw that confidence away and light it on fire. It will not help you here.)
How do I interview this person?
What should I look for?

I’m going to go over all of these, but first I want to provide some background into why I’m talking about this.

I have been this person multiple times in various companies. It has been a mixed bag, to put it mildly.

In one company, I was hired as “the first devops person”. They fundamentally misunderstood what they needed and were institutionally incapable of handling or addressing cross cutting concerns. Once I realized that what they wanted was purely hourly labor of cheaper toil, I built them a “what they needed but not what they asked for” platform by scraping together enough time from various teams and then left once it was operational.
In another company, I was hired as a Staff Security-oriented SRE but they actually needed tooling expertise more so I built that for them. It went well, but they didn’t go out of their way to actually hire for that.
I have been hired for a role (stability / infrastructure / resilience) and had people hire me with the generic “backend software engineer” interview loops. The loop itself went alright, but that was more me being abnormally good at both backend and this rather than than any indicator of their skill in placing me. That company underleveled me significantly and I left shortly after when it became obvious that they were incapable of seeing the value that the role was intended to capture.

There’s a trend here, really, and I think it’s a common one. If a company hires “smart people who do things,” they seem to be very prone to fucking this up. I’m not sure why this correlation seems to exist (although I have my suspicions), but I have noticed it repeatedly.

To wit, I would personally not see this role as a dev tools role; I would also not see it as operationally oriented. What you’re looking for, I think, is someone who can take “developer experience” and push that forward holistically by whatever way is necessary. The hardest things they will have to do is gain the trust of the entire engineering organization, buy-in for their approach, and deliver perceived value and improvements.

I’m reminded of the concept that there are several inflection points at which organizations change and their needs evolve; importantly, the nature of how work becomes visible and how coordination happens fundamentally shifts. Anecdotally, I’ve found these numbers to be true–you may recognize them as being related to Dunbar’s number(s): 5, 15, 50, 150, 500, 1500.

Here’s how I personally apply them to the general bucket of “not product engineering”, which includes but isn’t limited to: infrastructure, operations, and developer experience.

5 Engineers: The number of engineers you can have and work without docs. The true bliss of “yolo driven development.”
15 Engineers: You now need documentation, but still don’t need “real” infrastructure (or pretty much anything else).
50 Engineers:
- The threshold by which it makes sense to have one person specialized on infrastructure, ops stuff, developer environments, CI/CD, etc.
- Start building what will become the internal platform; but don’t build the platform yet, it’s still too early.
150 Engineers:
- The threshold where it makes sense to transition from people-driven coordination to process-driven coordination.
- You should have something that resembles an internal platform, but it’s not a full platform yet.
- If you don’t have anyone who truly understands Progressive Delivery and quality assurance, you need one.
- Knowledge management as an institutional capability is no longer optional and is likely sorely overdue.
500 Engineers:
- The threshold by which developer environments, cost optimization, infrastructure, security, all pay for themselves as fully separate and independent teams of expertise in addition to people closer to the teams who work to improve these functions.
- You should have an internal platform that is fully fleshed out.
- Enabling experimentation, progressive delivery, and effective testing as an expertise is no longer optional.
1500 Engineers
- Developer experience, infrastructure, cost visibility, security, etc., should be embedded into the culture, exist as teams within organizations, and also as a separate organization.
- The idea that you can have basically any engineering function without hiring industry experts in that function should seem both insulting and laughable; even if you hire them as consultants, you should understand deeply that success means leveraging others and you now have the funding to fully do so.

Figure out where in here you are and how much catching up you have to do. This role probably doesn’t make sense until you’re at 50 engineers, but it’s not a bad idea to start thinking about it at 15.

How To Fuck Up Before You Even Start Hiring

Not having an answer for “How does this role demonstrate value.”
Not having significant buy-in across the entire CTO org for recognizing the need of this role and the benefits it will deliver.
“We just need someone to implement X, Y, and Z and migrate us from a few tools.” No.
Literally any desire for this role involving the word “kubernetes”. It’s a fantastic tool; that is not this role.
Not having a good picture for how consensus happens, a good process around moving from decision to action to execution, or a willingness to implement one top-down from senior leadership.
Doing things way before you’re ready:
- For example, self service catalogues are great. Implementing Spotify’s Backstage before you’re at 500-1,500 engineers is a mistake.
- “Let’s have the dev tools person implement observability” is going to end badly.
Culture, then process, then tooling, then process, then culture.
- “It could’ve been an email” applies to overengineering your CI pipelines just as much as it does to useless meetings.
Not having a way to get visibility into your actual needs
- If I were to be in this role again, I would be fine doing interviewing tours amongst all the EMs and tech leads every month. However, companies like getdx exist now to automate the vast majority of that toil; use them to set this role up for success.

What Should Developer Tooling People Work On

In reality, this list should be informed by actual answers from engineers, where the “dev tools person” interviews everyone and figures out:

immediate pain points
medium term plans
long term goals
shared frustrations
things teams aren’t aware of but cause friction

That list should then be categorized, prioritized, and an appropriate allocation of time should be spent on it. In my experience, it has always been that immediate pain points needs 80%+ time allocation for the first quarter, because nobody ever hires for this role before it’s too late. Eventually, a 30/30/30 split of immediate pain, medium term plans, and knowledge sharing is a great place to be. You’ll notice I didn’t allocate any time to items 3, 4 and 5; that was intentional.

Being the only hire in this role means they won’t get to work on the long term goals because there’s absolutely no way to make meaningful progress on them quickly enough for it to matter. Long term goals should be turned into medium term goals, and frustration and friction points are things where leading without authority starts to come into play; progress there is made by sharing knowledge, writing process, showing, demonstrating, and teaching, not by plowing ahead on massively scoped projects. When leadership without authority happens successfully, along with delivery of value in the short and medium term, the ROI for more people doing this will become apparent, and demand for headcount will organically happen.

As a leader, you’ll know this role is being executed successfully when cross-team and cross-functional collaboration starts to happen more; another strong indicator will be when other managers and leaders start to ask for more headcount in the developer tooling and infrastructure functions.

All of that said, the below list of projects is something that is pretty much guaranteed to be positive ROI, I haven’t gone wrong from picking something off of this list and rolling with it if I didn’t have a more compelling first option:

Fully automated developer onboarding and local developer environments
Comprehensive documentation strategies, testing strategies
Building out Progressive Delivery as a capability - ability to rollback deploys, deploy feature flags, and drive feature flag driven development
Build system performance improvements and reliability improvements
Roll out a comprehensive philosophy and approach to observability, including (but not limited to): cost consciousness, performance, distributed tracing in production and CI
Finding one cross-functional collaboration point, automate aspects of it, and reduce friction there
- Nothing says “I know how to improve things where it actually hurts” like bringing more visibility into tickets and making it easier to open and close them
Find a new project a team is about to do, sit in on planning, and take notes. Look for opportunities to notice when multiple teams are trying to solve the same problem, and bridge that communication gap.

Crucially here, the takeaway is that I would expect this person to succeed if, and likely only if, there is some visibility into showing what the actual needs of the company are, and they have the ability to globally prioritize needs as well as locally drive improvements.

How Do I Screen For This Role

Here’s the gist: This role requires leading without authority. It is not about programming. It is not about technical skills. It is not about architecture.

If you screen for those, you will probably fail to hire someone who will succeed in this role. If you utilize in a whiteboard algorithms interview, you will actively screen out everyone who is qualified to do this role; they will be capable of doing the interview just fine, they will just tell you to fuck off. They will be right.

If you do not reach the offer stage with at least 50% of the pipeline being women and at least 40% of the pipeline being other underrepresented minorities, you fucked up.

Frankly, who the fuck do you think is most qualified to lead without authority and work within systems to drive change than those who have been systematically oppressed, denied leadership roles and opportunities, and have had to succeed despite that? If you are screeening out the experts in sociotechnical systems, you are doing it wrong; put this article down and fix your pipeline.

If you want to hire someone who knows how to pull off a developer experience transformation and building all of that out, the things that would highlight that strength are:

Skip the coding interview. You’ll probably need some technical aptitude, but this is best measured with a coding review, or even better yet, an architecture review.
Lean in on questions that ask them how they drive organizational change. You’re asking for someone to be an expert in leading without authority and doing so is incredibly challenging even with leadership buy-in. If this doesn’t go well, it will probably be the reason they quit, and hiring their backfill will be 10x harder than it should be.
My favorite architecture/technical question here is asking people to walk through how they build a paperclip maximizer. I personally call it an addition function. Here’s the question: “let’s say I want to add two numbers and return a result, how do I scale that taking into account people, coordinating teams, software architecture, and infrastructure?” You’re going to be looking for people who can walk the evolution of a company and point out how the nature of coordination, tooling requirements, architecture needs, etc., fundamentally change both as the software scales as well as the organization.
People don’t do well in this role if they don’t recognize the sociotechnical nature of the work; they will also not do well in this role if you don’t recognize the sociotechnical nature of the work. Empowering the social humanity with technology and humanizing the technical systems is key to this role and most people don’t seem to understand how to do that. Look for indicators of this thinking throughout answers.
Ask about times they have done something intentionally that is not a best practice. Example: One of my favorite stories to tell is when I turned off all of the on-call for the entire company. Leadership refused to prioritize stability, the alerts were not actionable, and fatigue was burning teams out; so I turned it off rather than fight against leadership priorities. That’s the kind of thinking that will be required to succeed here; working with the dysfunctions of an organization to improve the health of the engineers is really the value here, not migrating a CI system.
Look for the types of questions they ask when interviewing you. High quality questions speak volumes. Some great questions would be:
- how does the company think about value, ROI, and what incentivizes work
- what does high impact mean at the company
- how does leading without authority look like at the company
- what does success in this role look like. When I ask this, I always poke and prod at the answer; I want to know why it looks like that and not like a different way. Look for someone who can ask this question and follow it up with the “why not X instead” so that they understand the outcome behind success rather than the simple outputs
- what are the pain points people currently have, and how would one measure addressing those?
- how does the company build consensus, how do decisions get made, and how do decisions turn into action
- what are the dysfunctions of the company and quirks of its communication gaps? (I have never had a company answer this effectively or accurately, but discovering the delta behind the honest effort to answer and reality is very illuminating during my first 90 days)

If you don’t have good answers to those questions, by the way, this role will not be successful, and you have more fundamental problems in engineering to address first.

Fundamentally, this role interviews best in interviews where people know what high quality expertise looks like and allows them to just talk. Like can identify like very rapidly in most cases. Which means, quite honestly, that if you don’t have a good artistic sense and aesthetic for what high quality engineering truly looks like, you will be unable to hire for this role effectively, regardless of your process. If you fail to hire for this role, consider that a strong indicator, and take the opportunity to reflect on the implications of that.

Candidly, you should worry less than you think you need to about having an “objective” interview process. This person will have to lead without authority, institute company wide change, and is going to be hired into the most difficult role to succeed in outside of senior leadership. “The Vibes feel good; I would trust this person to tell me very uncomfortable things about stuff I am personally proud of” is absolutely something to aim for over most everything else. However, the full implications of this are not always obvious. For example, you will need to be very painfully aware that if you don’t have diversity in senior leadership, hiring someone who is not a white male will likely not turn out well. The exceptions to this, in my opinion, prove the rule; I have personally been both the exception and the rule, here. Many companies will be uncomfortable with this; which is one reason why this role is so prone to failure.

This role is truly a Sociotechnical Engineer, in every sense of the term; they will expose the weaknesses of your company in ways you are not prepared for, and they will challenge the status quo in ways that are painful. Embrace it. Be prepared to grow as much, if not more, than they do.

The Power of Being New: A Proven Recipe for High Impact

2023-07-17T00:00:00Z

When starting a new job as a software engineer, it’s natural to feel the pressure of delivering immediate value and meeting the expectations of your role. However, there’s a unique opportunity during this initial period that often goes unnoticed: nobody expects you to actually do useful work right away. So not only can you can feel free to identify and solve problems that others might have grown accustomed to or overlooked, you’ll have a fresh set of eyes that have not yet grown accustomed to the pains of the job.

While you’ll lack in-depth knowledge of the existing systems or workflows, this is actually a good thing here! You’re going to run into every single problem possible during the onboarding process, like a cartoon character running straight into a rake over and over. Embrace the pain, it builds character (well, not really, but it provides really good opportunities).

I’ve been able to take this approach and do some pretty cool things with it in my career.

At a previous company I onboarded, improved onboarding documentation, synthesized cross-org inefficiencies, wrote a technical doc on developer productivity and how it fit into the company, got buy-in, implemented it, shepherded it, and onboarded the entire org, all in one month–the month that I joined.

Some things that made this possible:

Before joining, I knew they were interested in this so I was already looking for it.
I have years of experience in internalizing other people’s workflows and improving them without wrecking them. I am very good at it.
Most importantly? It didn’t require a lot of deep context, and I knew how to implement changes in a way that was opt-in without breaking individual workflows.

The other time I delivered strong results immediately after being hired was when I came into a company, onboarded myself, broke major communication silos, internalized a very poorly communicated product, repaired trust between multiple teams, and broke a 3 month roadblock. In my first 2 weeks. I became the tech lead of the infrastructure team in those first two weeks as well. I had to reshape some mental models, coach and mentor some people, and start improving some practices while planning for 2 weeks, 3 months, and 6 months down the road concurrently.

Again, a lot of that didn’t require deep context in order to start the changing process and know where to go; the important part was to utilize my empathy, listen to others, understand their viewpoints, and solve their problems. That’s the magic, by the way.

I’m going to break this down into a series of steps. Here’s the formula:

Step 1: Take Notes and Absorb

During the onboarding process, make it a habit to take detailed notes. If something is weird, take notes. If something is confusing, write it down. If something goes wrong, make a note! Not only will this help you understand the organization’s systems and processes, but it will also allow you to identify potential areas for improvement. How an organization addresses its shortcomings is often more valuable than how well it gets things right.

Ask “why” a lot, and ask people for their opinions as well. Then write those down, all of them. Very rarely do people document they whys, and even when they do, they’re very rarely consistent among different viewpoints. Those tidbits of information can be crucial in helping you later.

Take notes about the people too. Your notes on first impressions will be extremely important for identifying biases. If someone says something about another person, write that down too; reverse engineering how people think about each other can bring up a lot of interesting points and subtleties. For example, if someone says something is bad and terrible, is it really bad and terrible, or did they have a really negative experience with another coworker and now they’re biased? That’s totally possible! And we want to hold space for that, because it’s completely valid to have had negative experiences and have your current ones be filtered through that; but, as the new person, you’ll want to be aware of this so that you don’t perpetuate biases or grudges. It’ll open you up to being able to be a healer in a space as well, should you need to be.

Take notes and absorb. Talk less, write more.

Step 2: Identify Opportunities for Improvement

(that don’t involve “Big Changes”)

Questions to ask yourself during the onboarding process:

Are there repetitive tasks that could be automated?
Are there manual processes that could benefit from streamlining?
Are there missing steps in the documentation?
Are there people that have to coordinate efforts when those efforts could be centralized?
Is information scattered everywhere, out of date, wrong, or all of the above?

Here’s the important magic that makes this work:

None of these changes touch existing code
None of these changes affect an existing developer’s workflow

You don’t have the context and understanding required yet to change someone’s workflow and not piss them off, so… Don’t do that. Easy, right? Knowing what you can change is half the battle; naturally, reading everything I write cause it’s awesome and hilarious is the other half of the battle. The battle ain’t the war though, so, y’know; measure twice, break prod (only) once, and all that.

Step 3: Ask People What Sucked

Okay, here’s the deal. You’re going to start asking people questions, but this is very easy to fuck up, and if you fuck it up, you’re going to set a bad impression that will be very difficult to undo later.

Luckily, there’s a simple process for success here. Here’s the rule:

Ask them about what hurts
Listen. Listen harder. Keep listening. Listen until your ears bleed. SHUT THE FUCK UP. Listen.
Take notes. So many notes. Get things in their wording, then repeat it back rephrased to make sure you understand what they’re saying; ask them to validate that.
Use every active listening strategy you know. This will be very draining, and that’s fine. You’re here to listen and that takes real emotional energy.
VALIDATE THEIR FEELINGS. EMPATHIZE WITH THEM. DO NOT FIX IT.

I know you want to. Stop it. Bad. No fixy.

DO. NOT. FIX. THE. PROBLEM.

This step is about understanding your coworkers and how they observe their work environment, how they think, and what causes them pain. Soon, you’ll be able to think about maybe fixing it, but right now is the time for connecting with them as humans, holding space for their frustrations, and letting them be heard.

You will be absolutely shocked and heartbroken when you find out how many of these engineers will feel heard for the first time ever in their entire tenure at this company. No matter how perfect and amazing the company culture is, I guarantee this will be true.

Just… Be there for them, okay? You can fix the problems later; but right now, they need to know you can hear them as they are and understand what they have to say.

Step 4: Fix Local Development Environments

Here’s where the fixing happens, and here’s how to do it:

Find every channel you can in slack related to your team, and the teams immediately interacting with your team
Join all of them
Scroll through the last 2-6 months of scrollback in every channel
Write down or note every single problem people talked about relating to CI, the test suite, local developer environments, “hey this is broken locally but works in CI”, etc.
???
Win a Nobel prize for fixing the world’s most complicated problem

One often overlooked area where you can make an immediate impact is in local development environments. This is because people often know exactly what’s wrong, and usually they even know pretty much exactly how to fix it. So why does nobody do it, despite it having obvious immediate impact and even calculable efficiency payoffs?

Humans are really really bad at doing certain types of math, that's why.

There are two types of developer environments: those that are sorta broken, and those that are super broken for the Junior Engineers but never get fixed because the Senior Engineers have workarounds that let them be productive. You can get a feel for which situation your company is in pretty quickly just from looking at slack; it’s a fun superpower once you get good at it.

I do this with every company I join and people are flabbergasted at how quickly I understand what’s weird about a company and how to navigate the quirks, or how I know who to go to for what. It’s like showing up at a bar with your friend and their friends and you somehow know about all of the relationship drama, all of the weird nonsense, all the inside jokes, and everything else. It’s great; I highly recommend.

Once you’ve identified the 3-5 most annoying or repetitive problems: ask in chat casually “hey has anyone noticed this as a problem?” You really want to avoid jumping straight into “hey imma solve this.” Don’t do that, nobody wants that; even if you’re right, people will legitimately be offended if someone comes in and fixes their shit without asking. It’s like if you invite a friend over to be a roommate and the first they do is organize your sock drawer; like, okay, maybe it needed a little bit of organizing, but are you for real? Meanwhile the attic is molding, the sink is flooding, the laundry machine is haunted, and the basement has a cryptid in it.

Anyways. Socks are not the point here. The point is that IF AND ONLY IF everyone chimes in with haven’t you people ever heard of rm -rf node_modules, bro, it’s much better do than try and fix all of these constant ills and agonies OHHHHH.

Wait, where was I? Anyways, if everyone chimes in with “yeah that sucks,” then offer to come up with a solution for it, and if people like the solution, offer to fix it.

You’ll be the hero, angels will weep, the heavens will open, rainbows will glisten, fairies will frolick, etc etc. Here’s the key part though.
And I mean it.

ONLY FIX IT IF PEOPLE SAY IT SUCKS.

Remember, you’re new here. You still have zero prestige among the team, and zero trust; you need to meet them where they’re at and address things they care about. Now, once you deliver this, that’ll change. People will think you’re amazing, cause you are; they’ll think you’re brilliant too, cause you followed my advice, and I’m a smart cookie. It’s a win win, really.

The Takeaway: Empathy, Empathy, Empathy, Empathy, Empathy

Embrace your role as the new person and leverage that. Take notes. So many fuckin notes. Seriously. I usually take about 15 pages of notes in my first month of working somewhere and it has never not paid off. Do some of the shit work that nobody wants to do, and nobody can ever prioritize, but you can! Because you’re currently suffering from it! Awesomesauce! Radical. poggers.

Remember, being new is not a disadvantage–it’s an opportunity to make a difference by being vulnerable and open-minded. It’s also important to remember that the difference you can make now is in the relationships you form, the people you listen to, and the things you can do for others. Embrace it, and be there for them. If you can do that, they’ll remember you forever.

Observations of Leadership (Part One)

2024-03-01T00:00:00Z

I read this post from John Cutler and Tom Kerwin recently on how leaders navigate uncertainty and ambiguity and it intrigued me. I decided to give my shot at answering these as a writing exercise and as an opportunity for self reflection. The past few quarters have seen a lot of change for me, and haven’t taken the time I need to reflect as much as I would otherwise wish; this seems like as good of an opportunity as any. For each of these, I’m going to copy in the interview question and then answer it very similarly to how I would answer it during an interview (but without any of the time or brevity constraints). I’m actually quite curious to see what other people have to say about my answers, and what answers others have of their own.

As a brief bit of background, I’m going to be referring to my current job quite a bit, but how I’m doing so is probably going to be a bit confusing because it’s been a very unusual journey. Here’s the very shortened timeline:

I came into the company as an IC.
Shortly after doing so, our Head of Infrastructure and Engineering Manager (same person) left; I stepped up to assume the role in the interim while we looked for a new hire.
After one quarter (and some change), we hired our new Head of Infrastructure, and I stayed on as “just” the Engineering Manager of the team for another quarter.
At the start of the year, we made the decision to transfer a Director from elsewhere in the company into my role, as the role had expanded.
In doing so, I stepped into my current role as Principal Architect of the Platform Organization (which is what I was essentially hired to do in the beginning).

I do plan to write about this in the future in more detail, because I think there was a lot of things to unpack and a lot of things to learn; frankly, we don’t write enough about the “interim” roles and how to set them up for success. So much of the writing on leadership out there assumes a 2-3+ year timescale; it’s not wrong for doing so, but there were quite a few things I didn’t do effectively because I didn’t have experience in being an interim leader (or, well, any sort of leadership, to be honest). But, this is not that article; this is the article where I go way too in depth on all of these questions.

It’s going to be quite long, sorry-not-sorry. This is also going to have to be a multi part series because I started writing this a week ago and only made it through five of the questions before realizing how long it had already gotten.

Accept We Are Part of the Problem

Can you share a specific instance when you recognized your contribution to a problem? What led to this realization, and how did it influence your actions in the future?

– https://cutlefish.substack.com/i/142017363/accept-we-are-part-of-the-problem

Firstly, I love this question, what a banger to start things out with. It’s not about failure, it’s about learning and growth, but in a different perspective than I see most leadership questions tackling.

Here’s an instance for you, looking back into my time most recently as an Interim Director of Infrastructure. To put it kindly, I stepped into this role because there was an urgent need at the company and I was able to address it; in no way was I particularly qualified for it, and I most certainly was not experienced. I’m going to lay out the situation briefly, break it down into external factors, internal factors, and then address the part where I realized later (with the help of my SVP) what I could’ve done differently; in full transparency, I’m still working on the “how did it influence your actions” part myself.

Getting to the situation in question, as I perceived it: We had a critical under-investment in infrastructure, resulting in a team that was extremely underwater, had far too high work in progress, and was unable to even communicate the problem in a way that external stakeholders could understand. When I came in, one of the first things I did was to address this by attempting to increase visibility here. By all accounts, I was wildly successful: During my tenure so far, we’ve gone from 5 ICs and one manager to multiple teams, including a dedicated Data Infrastructure team, dedicated Developer Experience team, a platform team, and infrastructure team. We have an amazing SVP now (note: titles are a bit fuzzy here still, my usage of titling here reflects scope more than reality), and we’ve been able to hire what is the most diverse and welcoming organization in the company. I can’t stress this enough: I am enormously proud of this organization.

Now, let’s get to the part where I fucked up: to put it directly, I did an okay job at showcasing the severity of the situation, and I could’ve done much better. One of the things that’s so difficult about leadership is that you can really only start to realize this type of thing by the nature of the conversations you have months down the road after it’s a bit too late to directly address them. If I were to break down an ideal scenario for what I could’ve done, it would be:

Recognize and position myself as an interim director with the sole focus of preparing for the next change in leadership
In doing so, one of the highest impact things I could’ve done was: documenting, describing, and quantifying the scope of the problem. I did the describing part really well, I’m proud of that; but I did precious little documentation of it, which led to repeated conversations and some uncomfortable moments for my SVP as she came in and had very little ability to immediately present clear and quantifiable cases to the rest of leadership for the problem that she and I were both able to articulate.

The ability to articulate the problem at all is something I helped develop, but the impact of that was greatly diminished by not quantifying and documenting the problem. I paid for that mistake a lot over the next two quarters and I’m still paying it down now. The repeated conversations, the lack of ability to transfer understanding over, and the difficulty in presenting information in a way that our CTO could push up and do effective global resource management for the company in a way that best meets its needs was a big miss here; while we had a significant amount of contributing factors there, my inexperience played a huge part as well.

That said, I really enjoyed the opportunity to learn in a very short amount of time exactly what type of information people need in order to express these types of thorny issues; I’m very good at identifying and describing them, and I’ve been unusually good at convincing people and aligning them around solutions, but you have to go several steps further than that in leadership. It’s not enough to get everyone to go “yeah that’s great, let’s solve this problem”; that’s just the beginning.

You have to be able to present the information and package it up in a way that it can be measured and balanced against the needs of the entire company, all the way up to the board if necessary. That’s hard! Most people take years to learn that this type of packaging is even necessary or what it even looks like! I’m beyond fortunate to have had a crash course in this while still being able to have the right outcome that we needed at the time.

Encourage New Interaction Patterns

Describe a situation where you facilitated new ways for people to interact or share information. Or a situation where you exposed people to new kinds of information or experiences. What prompted you to make the change, and what was the outcome?

– https://cutlefish.substack.com/i/142017363/encourage-new-interaction-patterns

This is a fun one! I really love thinking about interaction patterns; they’re so influential in determining how people think about problems, how they tackle them, and how you can attempt to influence various emergent properties of a group of people. Consequently, I really like to think of this in terms of “what is the collaboration outcome that we need and how do we address those deficiencies while emphasizing and leaning into what we’re already good at?”

Here’s one that I really liked: in my team, when I was the interim Director and Engineering Manager, we had this problem of siloed information; because everyone had been so underwater for so long, the vast majority of work was interrupt driven. What I mean by interrupt driven work here is work that is primarily driven by asks from others and external demand rather than being planned or orchestrated; while that might be considered a normal flow of work for some infrastructure teams, it’s not optimal for teams that do more than “call desk” style support, and so we needed to find a way to address that. Consequently, people ended up specializing in the interruptions they could solve the quickest, and so we had “the person who knows how to do X”, and “the person who knows how to do Y” and so on. It became really risky to make most changes in infrastructure when that person wasn’t available.

That wasn’t a situation we could particularly afford, especially as I was trying desperately to prevent people from burning out, healing those who already had burned out, and grow the bus factor of the team while also trying to set up the future organization for success. I made a few changes to attempt to improve things, but they weren’t ultimately particularly successful:

I setup a support slack channel, so that other teams could reach out to us for any issues, and wired it up into Jira. This was fantastic and worked really well
- Previously, they had just DM’d various engineers on my team directly and so it was impossible to quantify the work being done, or share knowledge about what was going on, and we didn’t even have an effective way of announcing outages or planned maintenance.
I attempted to encourage pairing on problems together
I had the entire team lean into the interrupt driven work rather than try to do planned work and then splinter off into hero work as things inevitably required immediate attention
- While interrupt driven work isn’t necessarily ideal, since we were so underwater, focusing entirely on it was more effective than attempting to work like a team that had triple the bandwidth of ours.

These were all great steps, but the ones that I missed were:

doubling down harder: We still had instances of people doing project style work rather than having everyone doing interrupt driven work. If you’re going to lean into something, you need to really lean in.
Leaning into interrupt driven work was an attempt to minimize work in progress to a manageable level. While it worked, it would’ve worked vastly better by turning the entire team into a mob programming team. We did this after our new Director joined and the change was incredible; it wasn’t enough to have everyone working in the same area, they needed to work on the same thing at the same time, together. Not only did this speed up the entire team, but they grew closer together, collaborated better, and huge chunks of siloing disappeared overnight. Did I mention that we’re fully remote? We are. We still did mob programming, and it was amazing. I highly recommend it as a way of accelerating a team in the storming and norming phases.

Going back to the things that I did… Despite not quite doing the optimal thing here, what happened was really effective, if only for a particularly interesting and non-obvious reason: It was an extraordinarily compelling and straightforward thing to showcase to leadership. Nothing quite sells “we’re under-resourced here” than saying “I switched the team from 20% support work to 100% support work and we’ve barely moved the needle.”

Having a paper trail of an ever growing queue of work for the first time also helped tremendously here; it put a semi quantifiable number on the complaints and grumblings that people had. It also turned out that because our team was so quiet and bogged down, it wasn’t noticeable; it had been under-resourced because people weren’t even able to understand how under-resourced it was. Changing that and making those new types of information available to leadership and the rest of the company fundamentally changed how they viewed us, and we learned quite a few interesting things.

Here’s some examples of discoveries I didn’t expect:

The engineering leadership team was under the impression that all the teams did their own infrastructure work end to end. The reality was that my team helped a lot with support
Each vertical of the company was frustrated with my team not responding to them: they each thought we spent all of our time on the other verticals and ignored them
It turned out that 80% of our time was spent on support for product success, security, and compliance; we had so much toil that we didn’t even have time to automate it or reduce it
There was an incredible amount of rework and redundancy going on: Because communication was so ad-hoc and boundaries weren’t clear, people would ask IT for problems we solved and vice versa, they’d get bounced around between different channels, and we’d have the same conversations about the same issues over and over
Every DNS change in Route53 took about 20 people-hours of meetings to communicate between product success, engineering, IT, and infrastructure; triple that if it was “cross concern” between two parts of the company that didn’t typically interact
“Percentage of work complete and accurate” was abysmal; very rarely would something get fixed without having to get re-fixed, or addressed later; misunderstandings happened constantly, and it meant that our ticket queue never really went down even if items got completed

Turns out “done” vs “done done” vs “done done for realsies actually done” vs “done so done that you don’t have to do it ever again” are all vastly different concepts. However, if you notice, none of the results here are really in the area that I wanted them to be: all of the benefit here was external facing rather than for the team. So it should come to zero surprise to any experienced leaders reading this that my team struggled with confidence that the company was happy with their work or liked them or appreciated them.

It should also come as no surprise that despite the visibility upward to leadership about the problem happening very quickly and that resulting in rapid change… The team didn’t feel this for another quarter or two. They stuck around because they trusted me, and I’m eternally grateful for that, and I love them dearly; but I could’ve done so much more to help the team itself with the outcomes of the interaction pattern changes. Luckily, I have great people I can learn from now, and my org is in such a wonderful spot now that it’s phenomenal to be able to take the opportunity to reflect, learn, and grow.

Patient Divergence

Tell me about a time when you guided a team through a complex issue without rushing toward a solution. How did you manage this process, and what led to finally deciding on a path forward?

– https://cutlefish.substack.com/i/142017363/patient-divergence

The situation I want to talk about here is about how we decided as an organization to invest more heavily in developer experience, the process around that, and how we were able to do that quite quickly. The initial situation was one that I’m sure quite a few leaders are familiar with, but I want to take this opportunity to lay things out a bit more explicitly for people who might not be familiar with how things generally work in terms of feedback loops at the organization level.

How this generally works in tech (and likely most companies, but I can’t speak to those) is you have roughly three personas of “experiencing” the company: Executives, Management, and ICs. Each have their own goals, strategies, and tools available to them to help steer the company in the right direction (which I’m going to call a “lever”); broadly speaking, Executives have levers of alignment, Management have levers of communication, and ICs have levers of execution. In addition (and I am grossly oversimplifying here), each company, particularly startups, are going to find themselves in one of roughly three phases: Exploration, Expansion, or Extraction. One of the difficulties that comes here in the Executive and Management level is that any advice, tool, strategy, goal, or whatever else that you receive or attempt to implement is only going to work if it’s a match for the particular stage of a company; consequently, you essentially have to throw out your understanding of “how to run a company” every time you switch stages.

When I came in, I would categorize our company as one that had just gone through the Exploration stage and was now entering into the most awkward phase, Expansion. It turns out this isn’t quite right, we have two to three main business markets and each one is in a different stage; in addition, each company that we’ve merged with or acquired also was in a different stage, so what you really had from the perspective of the Executive layer was a very fuzzy matrix of approaches, strategies, tools, and everyone came in with a slightly different toolset and rationale for said toolset. This isn’t a worst case scenario for the business (it’s quite normal), but it’s close to a worst case scenario for when it comes to understanding how to build and utilize effective communication streams and feedback loops so that information travels bidirectionally in a way that people feel valued and heard.

Coming back to the initial situation that I found myself in when I joined the company: it should now come as zero surprise that we were having a particularly difficult time getting good feedback from ICs, acting on said feedback, and doing so in a way that they felt heard and valued. As a consequence, many of the issues that actually mattered to ICs weren’t acted on or even identified; most of which would be issues that we could broadly categorize as “developer experience.”

I was actually deeply fortunate here: I came in as an IC, and then stepped into an interim hybrid Engineering Manager and Director of Infrastructure within a month of joining, so I got to see all three perspectives almost simultaneously, and I attribute a very large amount of my ability to effectively and rapidly zero in on “the real issues” at our company to this unique start. As such, I was able to categorize a lot of things in ways that helped ICs feel heard and understood, but then translate those issues into something that the management and executive layers could actually see.

One of the first things I did here was to quantify and outline the issue in way that presented enough evidence to the executive layer that investing in a developer experience platform would be cost effective and a force multiplier for helping figure out what to do next. In our case, we utilized DX because I was familiar with the tool and research behind it and made a strong case that the qualitative feedback mechanism of a survey would offer a much more rapid and tangible ROI. In a cheeky sense: “Hey, a lot of our devs complain that our infrastructure and tooling is really broken, we can’t use quantitative reasoning to measure any of this because the tools don’t work” is a surprisingly effective argument and it’s essentially the one I used.

While we did settle on the developer experience platform of choice somewhat quickly because I pushed hard for it, I was very careful to lay out that we had a 1-2 quarter plan for procuring the platform, using it, and actually understanding what we needed to do with it. One of the additional critical things that I used to help make the choice easier was to use this as an opportunity to communicate from the top-down that the leadership team is investing in figuring out how to understand ICs better. That worked so well that we had a noticeable bump in developer trust in leadership within a quarter, before we had even been able to use the platform to make any real changes; I really can’t overstate enough the importance of making sure your organization, at every level, feels heard and respected.

Lastly, the path forward here was “building a path to build a path”, in a sense, and that was actually also very important; we had recently gone through a lot of turmoil in the organization due to people feeling like the infrastructure team wasn’t communicating and being able to setup large multi-quarter initiatives in a way that let us start communicating about them immediately was crucial. Communicating early and often was so important to the success of this, and if anything, my only regrets are that I could’ve communicated earlier and more often; I slowed down a bit after things started “working” and change started happening, but doubling down on the communication would’ve likely helped some.

However, there’s a danger there in over communicating to the point where people don’t see change happening at the rate that you’re communicating and then it sounds like you’re all talk with no action (ironically, this was a frequent piece of feedback for me in the last two months; you can’t really win here). The balance and nuance in what it means to be an effective communicator and a transparent one gets even fuzzier and more complicated when you’re in leadership because there’s extremely valid psychological safety concerns in being “too” transparent. In addition, one can find themselves communicating about the wrong things, or with the wrong ratio of frequency to message importance, and so on; one of the hardest lessons of leadership I had to learn was truly understanding what it means to communicate less and be less transparent in being more effective as a leader.

As someone who really values transparency (and can handle “too much” transparency), it honestly particularly irked me to discover that there are, in fact, extremely legitimate reasons behind most leaders erring on the side of less transparency. I don’t have any easy answers there, of course; it’s one of the hardest skills to develop in leadership and I’m continually working on it myself.

Identify Plausible Contributors / Multiple “Causes”

Discuss a complex problem you’ve encountered with numerous contributing factors. How did you tackle this complexity, and what was your method for deciding what to do next?

– https://cutlefish.substack.com/i/142017363/identify-plausible-contributors-multiple-causes

I’m going to talk about the same team and a very similar time-frame here again. We had a problem with our infrastructure: it was really really fragile. Nobody liked it, and quite a few people agreed that we really would probably be better off rewriting it all from scratch.

So naturally, we didn’t do that.

With a team that was so underwater, rewriting something from scratch and then migrating the entire company from one set of kubernetes clusters and AWS accounts to another one was a recipe for unmitigated disaster. Absolutely in no way would I ever commit a team to a death sentence like that. Well, not without understanding the problem really well. The contributing factors were things that were tricky to pin down, but easy to intuit if you have a “gut” for infrastructure:

A very young company had hired generalists that were smart and built things they understood and that fit the usecase
Certain technological bets were made prematurely that ended up being ones that only pan out well with a certain amount of investment
The business context changed and the technological choices weren’t re-evaluated
The required amount of infrastructure expertise wasn’t invested in
Everyone who had setup the system had left
The blend between “infrastructure” and “application concerns” had started fuzzy and gotten fuzzier
Non local initiatives had been made that had compounding effects on each other: in particular, pursuing certain markets, certain regulatory statuses, certain application level architectural decision, and certain GTM strategies all came together in a very exponentially complex way that nobody could’ve foreseen at the time

In short: it wasn’t anyone’s “fault,” but we sure ended up in quite the predicament, and much of the complexity was embedded in human interaction and how the intersection between locally smart choices resulted in disproportionately complex consequences. More importantly: most of the “real” fixes weren’t isolated to infrastructure, but touched upon ways of working and architectural assumptions in our compliance, regulation, security, infrastructure, product design, roadmaps, and more.

What I decided to attempt to do was to try and quantify the business continuity risks that our infrastructure posed, and then outline the various changes needed to address certain categories of business continuity risk. We had other types of risk as well: burnout, knowledge siloing, scale issues, scaling people issues, and so on, but the business continuity one is the one that moves the needle the most on resource allocation during budget planning, which is what we needed the most at the time.

To get more information here, we took the approach of trying to document whenever we ran into an issue with our infrastructure in a way that disrupted other teams, and phrasing things in terms of a two part “here’s the hack fix” and “here’s the requirements for the real fix” layout; that was combined with me attempting to cross correlate that with understanding what scenarios we might run into that would immediately shift our risk/reward ratio.

Eventually, we came to understand that the biggest things that would shift our risk vs reward of doing a rewrite were:

Can we incrementally do it (ie: stop at any time)
Can we actually fucking finish it
Can we do it without creating a “now we have twice the operational burden” problem
Is it worth the cost of everything we drop in order to do it

Ultimately, the answer ended up being yes, but several things had to happen for that to be true:

We got more headcount and doubled the size of the team; letting us actually have enough bandwidth to tackle the issue for the first time while still doing Keep The Lights On work
Changes in compliance and regulatory requirements meant that many of the affordances we relied on previously wouldn’t work going forward; this changed the risk vs reward substantially
We were able to figure out how to narrow down the scope of the rewrite enough that doing it became more feasible

Did I manage this process well? Eh…, I did okay; my inexperience really showed up here. This was a problem that I unintentionally solved mostly in my head and, while I took the team with me, I should’ve externalized the process and externalized the information in a better way.

It would’ve been amazing to have something like an ADR (architectural decision record) but for hypothetical needs. The “hack vs doing it right” set of trade-offs that we had was all informal discussion; while it was amazingly helpful, it would’ve been much more impactful to lay it out in a way that external stakeholders could see it and reason about it, and while we were able to have conversations with others for the first time of “hey this thing you want isn’t possible because X, Y, Z” it would’ve been a huge benefit for them to be able to take a document and share it with their leaders so that people could connect the dots for themselves and tie their business goals to ours in a way that would help everyone involved plan for success.

We did an alright job there, but it relied on me being very charismatic, good at communicating with others, and having lots of meetings; while I’m happy to own that I’m good at communication, I’m disappointed that I put the team in a place where their success depended on me being able have the right conversations at the right time with the right people. That simply doesn’t scale, and our success ended up feeling more like luck than anything else.

With all that said, there was one pivot point that changed the entire conversation for the rewrite: We had just gotten budget for 3 new headcount, and I had just learned about a new regulation requirement that would force us to upgrade our kubernetes clusters within a few months. In addition, we had already previously decided that they were not upgradeable at this point and that any forced upgrades would require a rewrite; so, when we got to the point where we realized an upgrade was mandatory, the conversation switched from an “if” to a “how.”

What I am proud of, in that moment, is that we had done the work required for that to be an instant and clear conclusion for the entire infrastructure team; everyone understood the trade-offs and alignment was unanimous. While that could’ve been communicated externally better, it’s so difficult to have that type of hard decision be straightforward, and I can take a bit of pride in having helped set up the conditions for it to become a straightforward decision.

Power of the Present

Often as leaders we struggle with the tension between two extremes. At one extreme, we push for a big leap towards our opinionated vision about where we want to get to. At the other, we start where we are right now, figure out what’s working, and take small steps to change the present situation. Can you describe a situation where you needed to explore this tension?

– https://cutlefish.substack.com/i/142017363/power-of-the-present

This is something that I’m struggling with right now, actually!

On one hand, I have this fantastic and grand vision in my head for how we might build out the software engineering experience in a way that lets our Product and GTM functions be really effective in the type of market that we find ourselves in. There’s some very challenging things we have to do, and some core competencies that we have to build; if we pull off what I hope to, we’re going to end up building some extraordinarily innovative approaches to tackling this type of market, which would result in us having a world class ability to handle extremely diverse and nuanced industries that resist standard approaches towards digitization.

On the other hand, our CI pipelines are kinda janky and a lot of developers don’t feel like our test-suites are adequate or that we have sufficient monitoring in place to even detect when a service they work on is down, much less functional. Sooooo, y’know, there’s a ways to go before we get to build the innovative vision.

The big tension here, for me, comes from trying to determine how one is going to iterate; iteration is key in evolving and improving the situation, but it can be extremely difficult to iterate certain things. Feature flags help a lot, but you don’t really get those for infrastructure in the same way, and if your infrastructure team is so underwater that they can barely handle what they have now, gradually and incrementally building out “the new thing” while struggling under the burden of what you have to do now is simply not going to work. One thing I did to explore the tension was to break down things that were causing this tension into a few categories: fixable, unfixable, workable, and unworkable. Fixable is fairly self explanatory and it’s a property of whether or not you can remediate the issue in some way that actually solves it; workable is a little fuzzier, I’m using this to mean the spectrum of how much of a concern is this to the business only, whether it be from the perspective of legal, compliance, risk, or anything else. I should note that I didn’t actually have these categories laid out so cleanly when I did this, and I’m more going back and looking at what I did and making sense of it after the fact.

That said, if we build these out, we have a “fixable/unfixable” and “workable/unworkable” split, so we can pull out one of my favorite tools, which is a 2x2 matrix (as an aside, seriously, I’m addicted to those, they’re so helpful for my brain for some reason). Laying them out, you get four categories:

Fixable and Unworkable

Highest priority to address

Fixable and Workable

Lowest priority, but quickest wins

Unfixable and Unworkable

Identify and escalate

Unfixable and Workable

Label, quantify, and move on

Fixable and unworkable

These were the highest priority things to address: they were actively breaking the team or the organization, and we could fix them. The hard part here is really about finding these and appropriately labeling them: a lot of people want to label things they dislike as “unworkable” but doing so is a surefire way to lose trust in leadership.

Fixable and workable

These were workable, so they’re automatically lower priority, but if they’re fixable, then they’re great things to stick in and sprinkle in with your higher priority stuff. Often because something is workable, it’s de-prioritized but it can be a source of morale drain or impedance; giving the team permission to work on those things can build a lot of trust with them, and they’re often things that can be completed much quicker too.

While you can run the risk of appearing like you’re only working on “workable” stuff, when done right, it’s incredibly effective in being able to deliver a constant stream of improvements without necessarily meaningfully slowing down the high importance work.

Unfixable and unworkable

This is something to escalate, and this category of problem is the one that keeps me up at night. Not only can we not fix this with the current capabilities that we have, it’s actively breaking something essential that we need to function as an organization. Identifying these should be your second highest priority after identifying just enough work for the team to have things to do because the consequences of not knowing what these are and being unable to quantify the risk is absolutely massive.

Unfixable and workable

Label this and move on; the things that are great to label it with are:

a name that signifies it’s not tech debt
a sufficiently low priority to signal that you don’t care right now
the conditions required for this to move into a different category

That last bit is important enough that I’m going to repeat it; things that are unfixable and workable are very dangerous, because they can be ignored, but if it flips to any other state, it could turn out quite negatively: either people will see you as being inconsistent with what you choose to work on, or it’ll silently flip into “unworkable” and you won’t notice and that’s going to cause a lot of damage to the business.

Now that I’ve rambled on a bit about the theory of all of this, the situation that we had was one where all four categories each had more work than my team could actually accomplish, so it didn’t matter what we did, because before we could finish any of the highest urgency work, more work would escalate into being in that very high urgency state. As it was, the only thing I could really do was minimize the amount of work in progress, free up as much bandwidth for the team, and buy them as much time as I could while I addressed the unfixable and unworkable issues with leadership directly. The first being headcount, and justifying said headcount in a way that aligned the success of the org with the success of the company; this was a bit difficult because the company was very much in expansion mode, and so everyone urgently needed headcount. Us getting that headcount meant that we took it from the rest of the org, which is exactly what happened, but the argument for doing so had to essentially become “this will accelerate everyone else more than hiring more headcount for them will” and one of the key pieces of information there was showing that we had become the blocker to essentially all progress in the organization.

Which, most leaders would likely immediately tell you, stepping into a leadership role and then immediately identifying the vast majority of all blockers for the entire CTO org as being under you as a direct responsibility is… Beyond risky. I’m not saying that you shouldn’t own up to the reality of a situation, to be clear. This type of thing is far more about the implications of what happens after you do something like that, politically; what’s happening is you, in essence, become the responsible “root” cause for every missed goal, target, milestone, etc., in the entire company until you fix the blocking problem. You became the highest priority for headcount and resourcing, but you now have to manage the expectations of the entire organization who is going to see a three quarter massive whiplash between “taking the blame and promising to fix it” and anything actually improving.

If you want to appear ineffective as a leader, this is a stellar strategy, because it gives you no opportunity to actually prove out your worth before people start judging you based on the actions of things that happened before you came on. However, being an interim leader, I knew I had essentially zero shot at actually becoming the long term director, and so I wasn’t particularly sussed about maximizing the short view in favor of success in the long view; it absolutely tanked me, but it set up my successor for a ton more success than they would’ve had otherwise, and it helped break the cycle of rotating management that had plagued infrastructure for three years. Totes worth, 10/10, would fuck up my reputation again in a heartbeat.

Conclusion

I don’t even know how to conclude this, honestly; writing this has been a lot of fun, and I’m only a third of the way through. I suppose if I had to attempt to summarize a lot of the key themes here when it comes to dealing with uncertainty and ambiguity, there are a few things that emerge for me: understanding the problem, empathy, communication, and that I did better than I thought I did.

Understanding the Space of the Problem

I’ve come to terms with the fact that my brain works in a very unusual way; it’s one of the biggest gifts I have and a core differentiator in my ability to do work. It also does mean, however, that I’m cautious when giving advice to other people. Just because it works for me doesn’t mean it’s going to work for you, and honestly, it’s less likely to work for you if it works for me.

All of that said, something that works for me very well is having a sort of spacial relationship between things; I do symbolic manipulation and mental spatial navigation very very well, and I abuse that fact as much as possible in all of the reasoning that I do. If I can find a mental model that lets me do that, it helps me learn more concepts better, and if I have trouble recognizing how to solve a problem, I try to break it down to something that lets me spatially reason about it or symbolically manipulate it. It doesn’t work for everything, but it works for most everything, and it helps me build a ton of bridges of understanding; it turns out that if you build a symbolic representation of something, since most other people haven’t, it gives you a second modality of information with which to check your understanding. Being able to check understanding with others in multiple modalities or multiple analogies is a form of cross referencing that I find incredibly useful.

It also does something very very useful for me: It helps me get a navigational structure in place. So much stuff out there is only ever explained in a way that’s not actionable or constructable; if you can’t construct a solution out of the description of the problem, you either don’t understand the problem, or nobody else knows how to get the solution either. Building that path to a solution is going to get action and alignment actually happening, but it can only start once you have the space of a problem sort of laid out.

Empathy Goes an Incredibly Long Way

This is something I’ve talked about a lot on my blog before. I love empathy, and it’s my secret superpower to getting things done at an organizational level (although the fact that it’s a “secret superpower” rather than a basic tool of communication is… Anyway). However, there is something that was very difficult for me to learn and it was a bit of a bombshell for me when it really started to finally click. It turns out that empathy and effectiveness, at the leadership level, is pretty close to the same thing.

There are four parts of empathy: tuning into your feelings, expressing your feelings, tuning into their feelings, and responding to their feelings with understanding. Doing that well requires that you’re able to connect with something inside of you that has been where they’ve been. The need for us to have something in common with that situation is why we have such a hard time empathizing with people who are on sufficiently different walks of life than us. At the individual level, we’re talking about individual experiences and individual emotions, but at the organizational level, we’re talking about organizational experiences and organizational emotions.

It should go without saying that organizations express emotions extremely differently than individuals, but don’t worry, organizations absolutely do have emotions and do express them, we just call it values and we express those values through culture. But, how does one connect with the values in a company in a way that they map their individual emotions to the values of a company? How does that happen in a way that you can map your expression in a way that actually results in the company itself being able to feel that empathy? You understand the values, you participate in the culture, and the organization “feels” that through your participation in it… Which is basically your effectiveness as a leader, is it not?

Of course, to any one individual, it’s going to look completely indecipherable; you’re either going to come off as cold and disjointed, or completely unhinged; to the organization, you’re going to either be entirely invisible or insignificant or unaligned with the needs of the company. You really can’t win, and the more you manage to do well at balancing the needs and perception differences of the individual vs the group vs the organization, the more you’re going to build up a skill-set that looks suspiciously like narcissism and socipathic behavior. It’s to the point that any individual leader that tells you they can reliably tell the difference between effectiveness due to empathy and effectiveness due to behaving like a sociopath is lying; if they can tell, it’s because of some other reason (but don’t worry, there are almost always tells elsewhere).

Anyway, it was a weird trip for me to realize that empathy goes hand in hand with behavior that is, externally, occasionally unsettling; no wonder leadership is often described as being extraordinarily lonely. How do you even begin to get good at a skill-set like that? Especially when getting good at that skill-set will make most of the people you love in your life less able to relate to you? The cognitive dissonance required to be an effective leader in a capitalistic system is wild and unbelievably damaging to most.

Communication is the Whole Job

Well… Communication is the whole job, except for the parts where it’s not. Communication isn’t the same as execution, which isn’t the same as strategy, or mission, or values, or objectives, or any of the other myriad of things that we use to build up an organization of effective people and point them in the same general direction and have them build something great together. However, it really is sort of the same thing at the same time?

Just because communication isn’t strategy doesn’t mean that strategy isn’t communication, because it absolutely is, and it communicates a great deal of information when you learn how to read into it and interpret it and use that second layer of information to communicate more things when building out your own strategy to compliment someone else’s strategy. That goes for all of the others as well; they’re all a stream of communication and communicate information in their own way, and you have to learn how to utilize them as the “thing” as well as the tool of communication that they are, while not forgetting that communication itself is also directly required, especially for people who haven’t learned how to read into all of the other streams of communication that are embedded into all of the other things out there.

In particular, something that isn’t built into the rest of the things we communicate about are expectations. They’re kinda there? Sorta? Kinda sorta, but not really. Anyone who says that strategy or objectives accurately help set expectations is absolutely full of it; even if you explicitly call out expectations when writing those two things, they are absolutely not going to be the expectations that anyone actually has or sets. Likewise, anyone who has expectations and communicates them out, but doesn’t actually participate in making sure that those expectations were understood appropriately as well as communicated out to everyone else who needs then is not going to be an effective leader.

This was probably something that I was the weakest at. I made a lot of mistakes in finding that balance between communicating, over communicating, setting the right expectations, not managing them, not updating them soon enough, and so on. I also made mistakes when it came to communicating appropriate expectations and sentiment with how other people were being perceived at the job, and that ended up causing pain in a lot of places that didn’t need to be there.

Communication is fucking hard, and it’s one of the most painful things to mess up, even though it sounds so non-damaging because of how intangible it is. That said, if you are willing to be humble and learn from your mistakes, leveling up your communication skills in every aspect is going to be one of the quickest and highest leverage things you can do to accelerate your own growth and effectiveness as a leader.

I Did Okay, Really

I had a very unusual situation and I made the best of it; while I wasn’t as effective as I could’ve been, I learned a tremendous amount, and was able to set up my director for success and play a part in getting us to where we are today. In the end, I’m really proud of what I was able to accomplish and I’m deeply looking forward to being able to help continue to make this a wonderful place to work.

I feel very fortunate to work somewhere that I can actually look forward to the positive changes that this might actually make to society, and I get to heal trauma and celebrate queerness and grow a diverse workforce? I helped build the culture in the platform organization where all of that is possible? It legitimately makes me tear up thinking about that sometimes.

Fuck yeah, I did okay.

Engineering Language as a Vehicle of Innovation

2024-03-08T00:00:00Z

Something that I find missing in almost every software company is this thing that I’m not sure I’ve seen explicitly called out anywhere, but I’m going to call it an Engineering Language. This Engineering Language is something that I’m going to attempt to describe, motivate, outline, and then illustrate with an example.

Engineering Language

The Engineering Language is something that I would consider to be a living embodiment of how engineers speak, think about, describe, and express what they think in that problem domain. It’s not a programming language, or a DSL; it’s similar to a Design Language, but for software engineering and architecture more directly. The Engineering Language is the tool that you use to build foundations of thought and mental models and concepts themselves, so that one can coordinate the intangible nothingness of abstraction itself.

I think this Engineering Language is comprised of three things: An abstraction language, a protocol language, and an interface language. Together, those three things make up something that is greater than the sum of its parts.

Motivating the Engineering Language

“If there is no language, are there thoughts we can think?” It’s an interesting question, but I find it unsatisfactory; here’s a different question that keeps me up at night. “How can I share the thought that I think, if even language is insufficient and inadequate for this task?”

Let me utter a word into the air, let me breathe this thought into your mind’s eye: Echo.

What does that word convey to you? I had an extremely specific mental image in my head when I wrote that word, and I know that I could spend hundreds of words, or even dozens of pages, explaining that mental image and still you would not share it with me perfectly. Do you have any idea how discouraging it is to spend your entire life’s work building mental abstractions and making them concrete in a technical sense and a human sense, yet be completely unable to even convey so much as a single thought? It’s ridiculous; we can build towers, we can see countless infinities, we can push boundaries unimaginable, we can kiss the stars, but we can’t even share a thought with each other? How many tens of thousands of years have we spent building this ability to speak and express oneself? And for what? Absolutely nothing?

For all the weaknesses and inadequacies that language has, however, nothing else comes close to enabling the same fidelity of communicating thought. There’s a reason that the pen is mightier than the sword, after all. It seems to me, then, that if one is to scale the act of creating complex thought and nuanced abstractions and building the scaffolding upon which we construct towers of ideology that we call understanding… That one needs language.

There’s a more concrete motivation here as well. One of the most beautiful aspects of education and knowledge is that we’ve managed to figure out how to take a messy, non-linear, ball of mud that is “knowledge” and turn it into something that is fascinatingly incremental. Somehow, we’ve managed to figure out a path through which you can start from counting and the alphabet and end up with mathematics, philosophy, linguistics, and more.

While that in of itself is fascinating, there’s something in there that I think is even more amazing: it feels linear. How the absolute fuck did we manage to build a vehicle of transmitting knowledge that is mostly somehow linear in feeling even though the world is messy, information explosion is combinatorial, cardinality is uncountable, and certainty is unknowable? How? How did we do that? We don’t celebrate this miracle of knowledge nearly enough, in my opinion; of all of our achievements among humanity, this should rank as one of the greatest.

I’m going to switch gears for a second and talk about a theoretical business. Imagine this business, which is going to solve a problem, with a product or a service or whatnot, and tackle a certain market. In order to do so, one might start writing some software and doing some market research, validating things, learning about the domain, and so on. Something curious will eventually happen: No matter how carefully one writes the software, or how adaptable one tries to remain, the company will eventually reach two critical points of solidity:

Some evolution in the software will disproportionately become exponentially difficult relative to its “actual” complexity
Some evolution in market strategy, positioning, or product development, will disproportionately become exponentially difficult relative to its “actual” complexity

But somehow, this isn’t the case with language? It isn’t the case with the things we learn? How? How is it so different?

If we are to achieve this sort of linearity of growth as knowledge for a business domain develops, and if we are to do so in a way that lets us express this knowledge and make it concrete through computation, then surely we need a language of some sort. An Engineering Language.

Outlining the Engineering Language

As I said earlier, I think the Engineering Language has three parts to it:

An abstraction language
A protocol language
An interface language

Let’s break that down a bit.

Abstraction Language

When we talk about abstraction, what makes a good abstraction? I think a good abstraction is one that is both opaque and transparent. A good abstraction is opaque in the sense that it is not necessary to ever reason about something underneath the layer of the abstraction; it should not leak, it should break, it should deliver on what it promises, it should behave as an abstraction rather than a leaky shortcut. A good abstraction is transparent in the sense that it is not necessary to know the abstraction in order to reason about something below it, at no point is the interaction of the abstraction “magical”, at no point does the abstraction require you to have 100% knowledge of the abstraction and 100% knowledge of the thing it abstracts and 100% knowledge of how those two mesh together; lastly, a good abstraction is derivable in that if you see a new instance of it, it behaves logically in a way that you can reason about the implementation accurately.

Abstractions become useful precisely when they are able to be depended on and ignored, when they are able to be mixed and integrated, built on top of and built around. Abstractions should exist to coexist.

Which means, of course, that only some things can be a good abstraction; fundamentally, how you design the lower layers of your infrastructure and your software and your sociotechnical system will dictate quite literally the constraints of what can and cannot be expressible as an abstraction at all. No amount of papering over something will let you break the laws of physics, no amount of fudging the numbers will make time run backwards, and no amount of magical bullshit sprinkles will solve fundamental limitations of distributed systems, and no technical solution will ever solve a people problem.

But if you only have some things that can be a good abstraction, surely you need a language to express and help enumerate the possible abstractions one can build. Not only that, but the language should help you express why those are good abstractions, why certain others aren’t, and help other people build combinations of abstractions and towers of them in a way that preserves the coherence and alignment at scale. That is something I don’t really see anyone doing, but it’s sorely sorely needed.

Protocol Language

If an abstraction is a mental construct turned into a tangible building block of conceptual thought, then protocols are the cement through which you build the towers of your imagination. Any system needs communication, coordination, coherence, adaptive capacity, failure handling, modularity, and more. All of those things have one thing in common: You build the facilities which enable those by building a protocol.

But again; the shape of your system determines the shape of what can be a good protocol, which means you need a language for defining and conceptualizing what it even means for something to be a protocol and to interface with other protocols.

Interface Language

This one is tricky. We have abstractions, and we have protocols, so what makes an interface different from those? To stretch the construction analogy a bit more: if abstractions are bricks, and protocols are cement, then interfaces are the blueprints that let everything flow through the building correctly.

Abstractions enable growth by allowing one to compose ideas, protocols enable growth by allowing one to compose systems, and interfaces enable growth by allowing one to compose interactions.

Naturally, I love interfaces, and have a horrible time explaining what I mean here; I’ll give it a shot. I don’t really mean interfaces in the abstract interface List sense; that’s useful, but also far too low level. As a slightly better example, one could think of kubernetes as a protocol, as an abstraction, as an interface, or as any combination of those; when building a platform for others, I prefer to think of it internally as an interface and externally as a protocol. Internally, I use it as an interface and build things with it and compose all the possible interaction points people might have with the distributed system and glue them together in a coherent way; but I don’t expose the interface really, I expose the protocol so that people know how to communicate with the system. It’s a subtle difference, and I’m not sure I’m explaining it well.

An Illustrated Example

This example is either going to make a ton of sense, or absolutely zero sense. Let’s look at something that has managed to do this quite well: The web browser.

Browser Abstraction Language

What are the building blocks of a browser? What makes good ones, bad ones, weird ones, or even just possible ones? I think, honestly, that there’s only two main ones.

HTML + Accessibility Object Model + CSS
URLs

HTML, CSS, and the Accessibility Object Model are the main languages that let you even conceive and describe what it means to “be in” the browser at all. They help define the capabilities of it, the limitations of it, and shape what it means to be the web in a tactile sense.

But URLs? They are the web. URLs are the most defining aspect of the web and are so key that they are simultaneously an abstraction language, a protocol language, and an interface language.

Javascript doesn’t count here; it’s not an abstraction, it’s an interface. It doesn’t create new abstractions, it surfaces ways you can interact with them; the fact that only some things are exposed via Javascript is a perpetual wart and flaw in the design of modern browsers and it continues to be a glaring omission in their design.

Browser Protocol Language

When we think of protocols, we likely start thinking of: TCP, UDP, service workers, https, http/2, http/3, and websockets; we might get into an argument about whether or not those last ones count, or whether or not http/2 and http/3 are different protocols or not, but we certainly use all of these like protocols.

They’re not a protocol language, though; they’re manifestations of that language.

The protocol language of the browser is simple: it’s the URL. protocol://domain/sub/resource?key=value&metadata Look at that thing. It’s glorious, it’s gorgeous; contained within that language is the empires of thousands of libraries, millions of lines of code, dozens of protocols, and more.

The language of the URL helps shape what it even means to be able to think about building a protocol for the web, and its why we can instinctively feel like REST is a “web native” RPC, but most others, such as gRPC, are not.

Fuckin love URLs

Browser Interface Language

There are two interfaces I want to talk about here. I’m going to intentionally avoid the accessibility interfaces (and lack of them) here because I’ll blow a gasket and rant for a few thousand words if I get started on that.

ANYWAYS

The two interfaces that I want to talk about are URLs and Javascript. What makes a URL an interface here? Well, simply that how people interact with the web browser or initiate the web browser and do all of that is… Through URLs. Want to open the browser? Most people now actually just click on a URL anywhere on the computer, any time, anywhere, and expect a browser to open up spontaneously.

That’s honestly remarkable. It’s absurd how pervasive that idea is; can you imagine literally anything else in computing where, regardless of whether it’s an iPhone or Android or a desktop or a laptop or any OS in the last 20 years, everything works the same way? Click link, see site, never think about whether or not you need to start the browser first. Truly magical. Now that’s an interface language.

It’s a language in the sense that it lets you know the limitations and lets you conceive of new possibilities. Did anyone imagine deep linking was going to be a think in mobile apps back in 2004? Of course not; we didn’t even have iPhones yet. (Yes yes I hear you shouting in the background there Plan 9, shh, its ok)

Javascript is, well, Javascript; of all the interfaces with the browser, very few are as raw and deeply embedded as the programming engine through which we decided to shove the entirety of all and everything through.

Where Am I Going With This?

If you’ve made it this far, congratulations, you got to read my ramblings for a bit on an Engineering Language, with apologies to being a bit tired while writing this and not proofreading it in the slightest before yeeting it onto the internet.

But really, what’s the point here? The point for me, simply, is that I don’t think tech companies are thinking enough about what it means to build a language for engineering. How does one go about building something in a way that you can distribute tools of thought that are deeply embedded in peoples workflows that they learn to conceptualize thought intuitively in a way that’s aligned with the direction that you need to go? How do you make software architecture where coherence with the company vision is an emergent property? Is that even a thing people think is possible? I think it is, I just also think we suck at doing that.

I see it in a really bad way where you get software debt built up in such a way that you can’t meaningfully explain to anyone why one idea takes two weeks and another takes two years to implement. What a waste of talent and time all around. I’d love to see a world where instead of pontificating about tech debt or agile practices or wanking about the OKR-go-round, we figured out how to actually build cross-functional communication in a meaningful sense. What if a product manager actually had the language to have a meaningful conversation with a software architect and a solution engineer and marketing and UX research? What if we were able to build software in a way that we could proactively identify opportunities for alignment and that such opportunities for synergetic product and feature development happened naturally and organically?

Anyone who thinks they have cracked the formula for doing so is lying; there’s no way we’ve figured this out as an industry, and I’m doubtful we ever will figure out an actual methodology and pedagogy for teaching this type of thing. That said, I think it’s possible to do so for a company and a set of circumstances.

Whoever figures it out for their company and their circumstances is going to massively increase their chances of success. That is, if they can get everyone speaking the same language.

Which is, of course, an entirely separate problem with its own massive difficulties.

Redefining Observability

2024-03-15T00:00:00Z

Observability is a bit of a hot topic, and while it’s increasingly been playing a larger role in engineering strategy, I think the way it’s presented can often cause a lot of leaders to miss the value or to over-index on the wrong things. I’m going to present the current definitions of observability that are widely used in engineering and other disciplines, and then introduce my definition; I’ll also be going over what motivated me to develop my definition, and the deficiencies I encounter in the other definitions, especially when it comes to the failure modes of understanding.

For leaders who are pressed for time, I’m going to try something new with this blog post: I’m going to have pulled out sections labeled “leadership insight” so that you can skim this and pull out the key points. Let me know if that’s useful for you!

Definitions of Observability

“Observability”, or o11y as it’s often called by aficionados, has two main definitions that people tend to use when talking about it. The first comes from control theory and the second comes from cognitive systems engineering.

Observability: Control Theory

Here’s the first definition:

Observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs.

– Rudolf E. Kálmán

This was a definition that came out of studying linear dynamical systems and rose to prominence in software engineering largely through the efforts of thought leaders in the space bringing the concept over and applying it in a new domain; in particular, Charity Majors is often attributed as being one of the major (hah) voices in bringing this definition into the mainstream attention of software engineering.

Whenever an engineer talks about observability, the odds are very high that this is the definition they have in mind.

Observability: Cognitive Systems Engineering

Here’s the second definition:

Observability is feedback that provides insight into a process and refers to the work needed to extract meaning from available data.

– David D. Woods’ and Eric Hollnagel’s Joint Cognitive Systems: Patterns in Cognitive Systems Engineering, (Taylor & Francis, 2006), p. 121.

This definition is one that was brought to my attention by the lovely Fred Hebert. If you’re talking with someone who’s in the cognitive systems engineering space, resilience engineering space, or system safety engineering space, this is the definition they most likely have in mind.

Observability: Hazel’s Definition

Now, here’s my definition:

Observability is the process through which one develops the ability to ask meaningful questions, get useful answers, and act effectively on what you learn.

– Hazel Weakly

Naturally, I am not biased in the slightest; it’s merely a natural consequence of me being awesome that this is the best definition out there (just kidding). That said, you might be sitting here and wondering what exactly makes these particular definitions different. Let’s go over that.

Why Do We Need a New Definition of Observability?

To me, the point of having a good definition of a concept is that when you have one, that definition should be usable both as a way to center understanding of a concept, but also to influence the direction in which you explore said concept, and guide you towards grasping all of the implications of said exploration. As an example, one of the problems I have with the control theory definition of observability is that it gives you absolutely zero idea of where to start, where you are, or how to get there. If your system is fully observable, and you know that it’s observable… Cool, awesome, that’s neat. The rest of us have no idea what the fuck is going on and would like a map of how to get there.

Another problem I have with the control theory definition of observability is that it completely removes the people from the equation; it doesn’t literally remove them, but you probably aren’t going to think about humans at all when you read that definition. Be real, did you read that definition and go “ah yes this sounds like a people problem”? Probably not, and that’s an issue.

Leadership Insight: Most implementations of “observability” fail because it’s treated as a tooling problem rather than a strategic capability. Investment in observability is much more similar to Business Intelligence and Market Research than it is to Infrastructure and IT.

The fact that observability is often sold as a tool to infrastructure teams is throwing out the entire point of the idea by burying it in the implementation. Nobody buys PowerBI because they need to invest in “super fancy ass spreadsheet generation capabilities” or some shit like that, and likewise you shouldn’t be buying an observability vendor because you need a way to store system diagnostic information, it literally doesn’t make sense–observability is not a data problem.

So, the control theory definition makes it really hard to think about the people, and it doesn’t give you a starting point, ending point, or a strategy of how to get there. Well, that’s not great, so how about the cognitive systems engineering one?

Honestly, I like that one a lot more, and I wish we had popularized that one over the control theory one–while the control theory one helps guide the idea of the implementation of what an effective component of observability looks like, it doesn’t actually help the practitioner understand what’s going on. That doesn’t mean it’s perfect though: one really glaring thing that is missing from it (and the control theory definition) is the point behind why you care about this in the first place. You have “provide insight into a process” and “the work needed to extract meaning from that insight” and, honestly, why do you care? In addition, there’s still the problem of not really knowing where you are, where you need to go, and how to know that you got there.

Leadership Insight: A glaring deficiency in existing definitions of observability, to me, is the inability to know how many resources to invest in developing observability as a capability as well as how to invest those resources effectively.

Which leads me to why I like my definition the most:

I like definitions of concepts that capture the motivation in addition to the essence
Motivating definitions, to me, also contain an implicit sense of direction
If we’re defining a capability, it should be defined as an infinite and incremental process
Learning, without action, isn’t learning, and a definition about evolution that doesn’t include the action step isn’t complete

Observability Gone Wrong

This is probably my biggest gripe with the current direction of observability. Engineering has always been a bit of a silo from the rest of the business; it’s understandable, of course, you have a very specialized field filled to the brim with a very rapidly evolving internally focused set of concerns–no wonder it’s going to look completely alien to others. Much of the medical field is the same way, and so is the legal field, to give two other examples. However, Engineering had the golden chance of a century: Here we are with complex sociotechnical systems encompassing essentially “every fucking thing a business does to business business” and we have this awesome concept of “we need to understand what we’re doing” and what did we do?

We completely and utterly fucked it up by defining observability to mean “gigachad-scale JSON logs parser with a fancy search engine.” Really? Really? That’s the “we solve Real Serious Business Problems™” strategy we went with?

It just feels so tragic; what a waste of potential for building avenues of cross-functional understanding and communication.

Meaningful Questions

So okay, fuck it, let’s throw away the current concept of observability and think seriously for a moment: What does it mean to ask meaningful questions?

Here’s what that means to me. A meaningful question requires a few different components:

Anyone in the company should be able to ask a question
That question should be meaningful to them
“Meaningful” is not a concept that has any restraints or limitations or domains: if it’s meaningful, you should be able to ask it

I’m going to expand on that “meaningful” part because I think it’s particularly necessary and that most people have far too limited of an idea of what should be possible here. Imagine you have a group of people collaborating together on understanding a problem; you’re going to have a context of understanding that spans more than one person, and you can roughly understand that context to be a composite of multiple parts. Let’s break up components of “meaning” into things you can combine together to get a composite scope for your question:

The “vertical” context, in the sense of stream aligned teams
The “horizontal” context, in the sense of functional areas.
The size of the subgroup in question: the individual, the team, the vertical, the organization, the enterprise, the market, and so on.
The time period in question: past, present, future, in six months, monthly, “every time we have a board meeting”, “if/when our competitor has an IPO”, etc
The audience in question: a service, a team, an organization, a customer segment, an industry, a group of services, a cluster, a computer, …
There’s a lot more you could add, depending on what you care about, but you get the idea

Let’s take the question “are we healthy” and blend that with various composite scopes in order to get a few examples of meaningful questions to illustrate this more concretely.

I am an Engineer on Team A that is working on service A1. Is service A1’s /health endpoint returning a successful response 99.9% of the time over a 5 minute interval?
I am an Engineering Manager of Team A that works on services A1, A2, and A3; is our team within our stated SLAs with our customers for the quarter?
We are the Senior Engineering Manager and Senior Product Manager overseeing teams A, B, and C. Are we communicating effectively with each other, are we understanding each other, and are we building things that are in alignment with both our vertical’s OKRs as well as the rest of the organization?
I am an Engineering Director of Org ABC, are we making the right trade-offs between feature work and reliability work so that we can maximize value delivery while not compromising on engineering health, employee attrition, customer satisfaction, and fiscal concerns?
I am a Product Manager, of these 50 features, which ones have the most synergy with what our GTM research is indicating we need to develop, and which ones can be designed in a way that our engineers have room to bake in reliability work into the product implementation so we can maximize roadmap velocity?
I am a Director of Customer Success that oversees customer support for the services of Org ABC, are we building the right internal tools to maximally enable our CSE function while also gaining the ability to understand what classes of customer support to automate or proactively mitigate?
I am the VP of Engineering, are we designing our engineering culture and engineering process in a way that maximizes productivity and ensures alignment of development work with the company north star?
I am the CTO, are we preparing our architecture to strategically position ourselves against the market today as well as ensuring that we build capabilities that allow us to rapidly innovate five years in the future?
I am the CISO, what is our business continuity profile, how does our risk profile look, and are we working effectively with other functions to ensure that appropriate trade-offs are being made to keep us in the clear in a cost-effective manner?

I could write hundreds of these, but the point is more that “are we healthy” is meaningful in so many ways that it’s going to be a different question, not only for every person who asks it, but every time a person asks that question. Asking the same question twice is not something that should be happening, because you won’t be the same company that you were when you asked the question last. Even if you asked the question yesterday, or an hour ago, you’re a different company now, with different context, different aims, different information, different everything.

Leadership Insight: You will never ask the same question twice. That’s why observability is a process of capability development.

Useful Answers

If we have a better understanding of what a meaningful question is, that’s cool, but that isn’t super useful for the business if we don’t have an idea of what a useful answer is.

For me, useful answers also have a few different components:

The answer should be useful by way of concretely moving them closer to achieving stated or unstated business goals. Answers that are theoretically useful or maybe useful or “huh that’s neat” or “I might use that someday I guess” don’t count.
The answer’s utility should not require the answer to be “correct” or “factual” in any way.
While questions only need to be meaningful to someone, answers should try to be useful to everyone.

That’s… A lot harder than it looks. But luckily we have a saving grace: throw away your desire to have truthful, factual, or correct answers to meaningful questions.

Seriously, I mean it. I don’t mean it in a “we live in a post truth world” bullshit way, I mean it in the understanding of reality that comes when you realize that because everyone’s context and understanding and interpretation of the world is different, there is no way to ever arrive at a definition of “correctness” or “truth” or “fact” that is also useful for a situation that is not absolute and objective. This might terrify you, but lean into it and let it liberate you. Answers are useful if they let you move forward with concrete action: that’s it.

Leadership Insight: If you’re asking a meaningful question, it’s not going to have an objective answer; it’s subjective by definition because the meaning itself is subjective.

You know that phrase that everyone loves to quote? “Disagree and commit”? I hate it. I think it’s a phrase that causes a lot more harm than good because it’s quoted so often out of context and used frequently as a cudgel by leadership to force top down consensus when it was originally intended to be a reminder to leaders to trust the people you hired.

That said, if you take the concept of trusting those you work with, and you throw away the oppositional and aggressive framing its buried in, you get something really cool: trust the questions people ask and utilize the answers they learn.

Get rid of “disagree and commit” and lean into “ask meaningful questions, get useful answers, and act on what you learn.” As a leader, it’s your job to help enable as many answers as possible to be meaningful to the business.

Process of Development

I want to tackle the other part of my definition now, which is that we have this process and it’s a process through which one develops an ability. What does that mean? It means you start out being fucking terrible at it and that is a Feature, Not a Bug™.

Think back to the first time you tried to do anything in engineering, or marketing, or sales, or any other part of your professional career. Not only was it natural for you to be bad at something, it was actually a good thing; getting things wrong is a necessary and integral part of the learning process itself. It’s through correction, evolution, enhancement, and iteration that you develop so many vital skills and hone your intuition. If you didn’t have that, and you just made the right choices, you’re not smart, you’re just lucky. Leaders don’t like being lucky for a reason: it doesn’t scale, and it’s terrible luck to be lucky.

What that means to me for observability is that at the beginning, you’re going to be severely limited in the breadth, depth, scope, and nuance of your questions. But that’s okay! The simple questions are still meaningful questions to ask. This is something I see people trip up on a lot, so I want to hammer it home here.

In an ongoing process of iterative development, the progress itself is the output. You can’t ask a sophisticated question without having first asked a simple one; that just not how it works. Imagine going into a fiscal planning meeting and asking “hey what’s the Discount Cash Flow analysis broken out for our various business units” and everyone’s still busy clarifying what each business unit needs to declare as CapEx vs OpEx. Not only are you talking completely past everyone and derailing the entire meeting, but you are going to get the wrong answer and you will set yourself up for failure in the future by trying to ask a question like that before you have the basics down.

Leadership Insight: Asking the basics is not a sign of incompetence, it’s a sign of trusting the process and developing your observability “muscle.”

For computer systems, your basics are probably going to look something like this (in order of increasing sophistication):

“Is our service reachable internally”
“Is our service reachable externally”
Ok, cool cool cool, uptime is a lie, whatever: what is our uptime anyway?
Is our service reasonably performant?
Is our service reasonably cost effective?
- This is where “traditional” monitoring usually stops
Repeat all of the above but for each sub-service
Repeat all of the above but for each endpoint
- This is where “modern observability” starts to really differentiate itself
Repeat all of the above, but from the perspective of an individual end user
- This is where SLOs start to really become necessary as a tool for asking questions
From the perspective of an individual end user, what’s the performance of an end-to-end request, segmented by every point in the chain?
- This requires distributed tracing
Which of these various tuning options has the best performance characteristic?
- A/B testing and other variation functionality becomes invaluable here
How does our system behave in various situations that we might not have accounted for?
- This is where chaos testing, fault injection, and other experimentation strategies start
Where are the most effective points in the system to leverage humans for adaptive capacity
- (your next $1 billion startup goes here)

So looking at this, and then looking at your company, you’ll notice that a lot of companies are only realistically at somewhere between 1-3. That’s okay! It’s completely fine to not go further as long as the questions you can ask that are meaningful to the business aren’t captured by anything more sophisticated. Because after all, if you have no need to ask more nuanced questions, why would you need to develop further sophistication in your observability strategy?

Some companies deeply need to be able to ask very nuanced questions around how humans and technology interoperate in a variety of unanticipated areas with a lot of unknown unknowns under very tight operating constraints. Some only really need to know “code go in, money get made.” That’s not a failure of the business; the only failure here is investing disproportionately to your need.

Leadership Insight: That said, while the only failure of observability is investing disproportionately to your need, most companies are either investing too much or too little into observability.

In my experience, I see most companies investing too much money into observability with very little meaningful return on investment because they keep treating it as a tech and tooling problem rather than a research capability.

Tying Things Together

We had the Control Theory definition of observability, and the Cognitive Systems Engineering definition of observability, and then I presented my definition of observability:

Observability is the process through which one develops the ability to ask meaningful questions, get useful answers, and act effectively on what you learn.

We also went over what the “meaningful questions” and “useful answers” bit means, and we went over the process of developing an ability. When we combine those two, we get something that actually really reminds me of the five levels of expertise in the dreyfus model of skill acquisition (novice, advanced beginner, competent, proficient, expert).

Which, honestly, I love that; you absolutely should be thinking of observability as developing an organizational wide capability of asking meaningful questions and getting useful answers. Of course, once you have a useful answer, you have the final part: acting on it.

Learning, without action, isn’t learning; it’s fundamentally a process. And processes? Processes are messy, they require action, they require movement, they require doing, they require re-evaluating the process, they require evolving the process, they require wrangling with the human condition itself.

Just like observability.

To put simply, observability is organizational learning.

The Trap of Soulless Productivity

2024-04-03T00:00:00Z

If there’s one thing I wish I could burn entirely to the ground and wipe away all traces and remnants of, its the misplaced notion that the productivity of Knowledge Work can be managed, measured, analyzed, and optimized as if all one needed to do was drip feed heroin up the arse of their hapless workers.

What is Knowledge Work™, you ask? There’s two concepts of Knowledge Work that I’m thinking about right now. The first is Knowledge Work as Imagined, and the second is Knowledge Work as Done. (I’m temporarily ignoring the actual literature definitions of Knowledge Work for the sake of ranting out some frustration. Forgive me pls)

Knowledge Work as Imagined is when you take the best of humanity, you embrace it, and you turn the lovely unbridled enthusiasm and exploratory nature of humanity into a powerful self-feeding engine that paints the world with the colors of the human soul itself as it learns to understand the world around it. It’s art, beauty, love, and life. It’s this amazing fucking thing that happens when you take a bunch of humans and you stick them in a pile and say “go forth and learn to love the world.”

Knowledge Work as Done is what happens when you take art and artistry and creativity and imagination and the soulful awe inspiring wonder of a child and you figure out how to forcibly shove it into something that is roughly shaped like an assembly line.

Knowledge Work as Done is where the love of the world goes to die, it’s where one of the most unique and beautiful aspects of the human mind gets turned into its most terrible weapon, it’s the snake that eats its tail, it’s the adult world equivalent of taking the quiet artist, giving them a wedgie, and shoving them into a high-school locker while you laugh at them and take all their pictures and shove them into chat jippity do dah, zippity day, my oh my, we’re gonna IPO today.

It’s a disgrace.

It doesn’t have to be this way, of course. We could be a lot better at this; we could be infinitely better at this, even. But, that requires understanding what makes Knowledge Work tick, what makes it… Work, and how one might nourish it and encourage it to grow rather than brutally ripping it out by the roots and screaming at it until it learns to behave. In short, understanding Knowledge Work means understanding the human condition itself, and taking a dark look at how we managed to turn humans from a social equitable animal that has unlimited curiosity and a desire to help each other succeed into a raving, bloodthirsty mass of hyperindividualistic demons solely bent on hedonistic self exploitation at the expense of the other. Seriously, how the fuck did we do that? How? How did we so deeply and fundamentally break humanity like this?

Now you might be reading this and going “Hazel, that’s a lotta emotions, goodness; but, be real now, how do you actually expect a company to pay millions of dollars for knowledge workers and not want to optimize them?” Well, you, my dears, are probably not thinking this, but this is unfortunately a realistic question one might ask when attempting to be Doing a Capitalism™.

Sure, fair enough, let me rephrase that question a bit:

“How does one measure creativity, the growth of institutional knowledge, and the value of that knowledge in terms of dollars per hour?”

Which is really what you’re asking when trying to define productivity for Knowledge Work. But it probably feels like a more ridiculous question now, doesn’t it? (That’s because it is)

As for the answer to that question? About dollars per hour and Knowledge Work? Here it is: One can no more abuse a dog into loving them than one can “productivity” a knowledge worker into generating a positive ROI.

In fact, you can replace “measuring productivity” with “inflicting animal abuse” and get an accurate idea of what’ll work and what won’t. If it sounds like animal abuse, it won’t actually measure productivity for Knowledge Work.

Here’s an example!

BEFORE: I’d like to [[measure productivity]] by [[tracking the lines of code per hour produced and withhold promotions for the bottom 10% performers]]

AFTER: I’d like to [[inflict animal abuse]] by [[tracking the lines of code per hour produced and using shock treatment on the bottom 10% animals]]

Sounds horrific, doesn’t it? Guess what: it doesn’t work. Amazing. Who woulda thunk. Fear and abuse and spreadsheet hacking doesn’t help people be creative and share ideas? Astounding really.

Humans want to be creative, humans want to love each other, humans want to make the world a better place, humans want to do art, humans want to be art, humans want to inspire, humans want to be inspired, humans want to learn, humans want to teach, humans want to heal the world, humans want to heal each other, humans want to collaborate, humans want to build, humans want to be beautiful, humans want to find beauty, humans want to create, humans want to be awed, humans love this fucking universe.

One of the best things for me recently has been watching all of the research fly out about how humans work, how they cooperate, how they really learn, and guess what? It’s not even a “you can have your cake and eat it too” thing. It’s literally “you can stop eating coal and start eating cake”. Seriously! Humans are wired to be productive by sharing, by loving, by growing.

I spent my entire life thinking I had to put that aside when Doing Capitalism in order to be successful. That’s not true!

It just breaks my heart that we have so much out there still that’s stuck in this old way of thinking that the only way to have humans create efficiently is to torture them into submission and rip out their very souls and dump them into the Capitalism Monster. It’s beyond aggravating to have to explain that, no, one can’t measure productivity, but they can measure belonging, and safety, and learning, and all of these wonderful ideas.

Not only is that “fine”, it’s better.

I Miss the Days of Humanity

2024-04-12T00:00:00Z

I miss the forums.
I miss the forums so much it hurts.
I miss when research was about discovery and learning and sharing.

I miss when humanity felt like it had hope,
when human interaction was plentiful,
when genuine connection wasn’t rarer than gold.

I miss the days before our souls were destroyed for the sake of the market,
before our knowledge was plundered,
before our humanity exploited.

I miss when the song of humanity was sung in the streets.
I miss it, even though I born after the war was lost.

Now we whisper the truth and shout the lies,
but this was not the fault of AI.
We whisper the truth and drown in the noise,
but this was not the fault of academia.
We whisper the truth and bury it in disguise,
but this was not the fault of the internet.
We whisper the truth and watch as it dies,
but this was not the fault of humanity.

We whisper the truth because we no longer know it,
and know of no one or no place or no where to find it.
Our oracles were slaughtered; our teachers starved

If someone is lost in the desert, and goes many days without food or water,
when they are rescued, you must do something very important:
Do not feed them, they will die; do not water them, they will drown.

Their bodies are not ready yet,
they will burst under the weight of life,
they must be brought back slowly.

What will happen to us when we starve ourselves
of our humanity
for decades?

What horrors will we encounter
as we burn our souls?

Worse: how, on earth, in the heavens,
will we heal?

How does one flame in the darkness,
in the howling wind,
find another flame to huddle with,
to keep warm,
to share the connection of humanity
and the joy of learning?

How does one go on as the world gets snuffed out?

How does one heal the garden
where the salt was sowed,
where the poison was poured,
where the rocks were thrown?

How does one heal that which has scarred so heavily
it may never grow life again?

When we recover,
as we eventually will,
what will be left of us?

The nightmare of humanity lost started
with those who seek to find it.
Publish and perish,
draw blood from the stone,
turn lead to gold.

But then it grew.

It became the price to pay to participate in society:
“give us your humanity, give us your thoughts”,
“let us profit from the words you pen,
from the inscriptions you carved,
from the art you created”.
The price was free;
the cost was everything.

But then it grew.

As we built machines to move dirt,
ever faster, ever farther, ever higher.
As we built machines to construct buildings ever greater.

So we did with words, structure, and thoughts.
We built parrots to speak sounds of saying,
words of no meaning,
thoughts of no thinking.

As we built machines to speak words,
ever faster, ever farther, ever higher.
We build today machines to construct noise ever louder,
that we might drown out every ounce of humanity.

Today, you can find me, you can hear my voice,
but tomorrow?
I cannot promise to you