<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
  <title>Hazel Weakly</title>
  <subtitle>A feed of the latest blog posts</subtitle>
  <link href="https://hazelweakly.me/atom.xml" rel="self" />
  <link href="https://hazelweakly.me/" />
  <updated>2026-01-05T00:00:00Z</updated>
  <id>https://hazelweakly.me/</id>
  <author>
    <name>Hazel Weakly</name>
  </author>
  <entry>
    <title>Observations of Leadership (Part Two)</title>
    <link href="https://hazelweakly.me/blog/observations-of-leadership-part-two/" />
    <updated>2026-01-05T00:00:00Z</updated>
    <id>https://hazelweakly.me/blog/observations-of-leadership-part-two/</id>
    <content type="html">&lt;p&gt;Hey again! Welcome back to part two of me reflecting on the past few quarters and writing down my answers to John Cutler and Tom Kerwin’s questions on how leaders navigate uncertainty and ambiguity. If you’re lost, part one is &lt;a href=&quot;https://hazelweakly.me/blog/observations-of-leadership-part-one/&quot;&gt;here&lt;/a&gt;. I started writing my reflections to this a while ago (almost two years now!) and decided that I actually wanted to separate out every edition of this by a few years. That’s how long it takes for feedback cycles to truly hit at my level, anyway, and it’s a good excuse to practice self reflection. I’ve also grown to enjoy seeing how my thinking matures over the years, and this is a natural way to do that.&lt;/p&gt;&lt;p&gt;My life has also changed quite a bit since the last post. I left that company about six months after writing that post and joined a major financial firm as a software architect, specialising in architecting distributed systems and zero trust networks. After working on some very large scale architectural plans there, I immigrated to the Netherlands and joined a different financial firm as a software architect, with the goal of contributing towards resilience efforts across the entire firm. Consequently, I’ve gone through quite a lot of ego death in the last year; it’s humbling to be an immigrant. Additionally, no matter how you try to develop the skill of reasoning about large scale changes, there’s only so much experience you can get at startups; my last two companies being mature enterprises have given me the opportunity to add much needed depth to my perspectives. While I’m at it, you’ll notice that my examples get a bit more vague here; working in a regulated industry doesn’t make it easy to go concrete on &lt;em&gt;details&lt;/em&gt;. Regardless, I’ll do my best to share what I’ve learned–luckily the lessons are, I think, largely conveyable without numbers or UUIDs or getting myself in trouble.&lt;/p&gt;&lt;p&gt;Without further ado, here we go!&lt;/p&gt;&lt;h2 id=&quot;blend-diverse-perspectives&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://hazelweakly.me/blog/observations-of-leadership-part-two/#blend-diverse-perspectives&quot;&gt;&lt;span&gt;Blend Diverse Perspectives&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;&lt;blockquote class=&quot;flow&quot;&gt;&lt;p&gt;Can you tell me about a time you needed to make space for many diverse perspectives, including those that you found particularly challenging? How did that inform collective decisions and actions moving forward?&lt;/p&gt;&lt;p&gt;– &lt;a href=&quot;https://cutlefish.substack.com/i/142017363/blend-diverse-perspectives&quot;&gt;https://cutlefish.substack.com/i/142017363/blend-diverse-perspectives&lt;/a&gt;&lt;/p&gt;&lt;/blockquote&gt;&lt;p&gt;This has been a huge area of growth and maturity for me. In smaller companies, I have found that it’s often accurate (or at least a functional mental model) to think of things as “us vs. them”. In other words, you don’t really tend to run into situations where you need to make space for diverse perspectives. The strategy is often more around taking those diverse perspectives and winning them over to your way of thinking. There’s not a lot of room in a small company for multiple diverse perspectives to run simultaneously, so it makes sense that you primarily prioritise consensus rather than loose alignment.&lt;/p&gt;&lt;p&gt;However, at larger companies, you run into the inevitability of complex systems. One of the most unavoidable aspects of a complex system is that as you have multiple actors operating, you inevitably will find them operating at cross-odds with each other. These actors are not necessarily on opposite sides, and often are even on the same team. They could even be trying to work for the same purpose! But ultimately they’re going to operate at cross-odds and appear to undermine each other.&lt;/p&gt;&lt;p&gt;Not only is this unavoidable, not only is this inevitable, but it’s actually also–surprisingly–not undesirable. This was one of the more interesting things for me to learn. It turns out that operating at cross-odds creates friction, which is obvious. What’s not obvious is that this particular flavour of friction is incredibly valuable and deeply informative. When you know actors are working towards compatible goals and yet are operating at cross-odds, it forces you to consider the design space and the solution space in a more diverse and interesting way.&lt;/p&gt;&lt;p&gt;So now, when I’m thinking about working at cross-odds, I want to make sure that it happens. I don’t want to avoid it; after all, it’s unavoidable. Given that, I want to ensure that it happens in the &lt;em&gt;right&lt;/em&gt; places, at the &lt;em&gt;right&lt;/em&gt; time, and in the right amounts. Otherwise you have too much similarity going on and the organisation is not seeing enough of the solution space or problem space to understand where there are complexities. If everybody gets aligned immediately and is spontaneously cooperative, it’s either not an interesting problem or the solution is too simple to work in practice.&lt;/p&gt;&lt;p&gt;As an example, I was at a company and we were doing a proof of concept. The team had been rolled out, we were part of a new department, and we were focusing on trying to prove out a very large potential initiative. One of the problems that I ran into immediately was that I couldn’t quite understand why we were doing things the way that we were doing them; it felt pointless, like we were working at cross-odds with the rest of the firm. To illustrate: we had a few different subsections of the project, and the more I dug into them, the more I realised a few quirks. One subsection of the project could be entirely consumed by a separate and existing team in a different part of the organisation. Another subsection of the project could have been handled by yet another team who were already subject matter experts. A third subset from the project was actually something that was novel and interesting, but was ultimately unlikely to succeed for reasons outside of our control. Lastly, the scope, roadmap, and timeline were utterly impractical and inconsistent: It felt very much like something that was trying to boil the ocean, but with a set of resources that was only feasible for boiling a puddle.&lt;/p&gt;&lt;p&gt;As the project developed, it completed a proof of concept that was considered “wildly successful” and generated lots of positive attention. Naturally, shortly after, multiple staged re-orgs happened. During the re-orgs, the project got split up into multiple different teams, multiple different areas of the organisation, and modified its roadmap. Now, behind the scenes, what I came to understand was that this was how innovation actually happened at a company of this particular political structure; I hadn’t understood the success criteria initially, and once I did, it all made sense.&lt;/p&gt;&lt;p&gt;The reason why the initial design was setup was so that the reorganisations could inevitably happen. In addition, the project scope was overly ambitious for two reasons. One, so that when it was cut down, the critical pieces could be kept. Secondly was that as the project went over into multiple different parts of the organisation, each party had their own set of priorities and needed sufficient political leverage to demonstrate why they needed more funding to achieve the goals stated. In other words, a lot of the “wrongly” sized aspects of the project occurred as we absorbed various roadmap political agendas of stakeholder teams; when they “acquired” their part of “our” project, they needed to cut their roadmap some and adjust. However, if our roadmaps were secretly already partially pre-merged, then the “cuts” &lt;em&gt;look&lt;/em&gt; extensive but are actually not that bad; in addition, there’s now ample leverage to demonstrate that extra funding is needed to accommodate disruptions from the multiple re-orgs (despite underlying teams staying almost entirely the same). When things inevitably got delayed or disrupted later, and when budget and scope creep occurred, leaders could reference the “original” plans and use them to provide a stronger argument for more resources. If &lt;em&gt;they&lt;/em&gt; made the argument without validating anything, they’d look greedy as a leader, but if the initial proof of concept organisation made the argument and despite everyone’s best efforts to keep costs low… Well, that’s a different story. It further turns out that the executive(s) in charge of the organisation had a habit of making severe budget cuts and then adding back funding as needed, so this was a compensatory measure to prepare for that.&lt;/p&gt;&lt;p&gt;Ultimately, how this ended up changing collective decisions and actions moving forward once I understood this was that it became very critically important to understand your stakeholders. Not in the nebulous “stakeholder management” sense, but in a very concrete manner and with a specific strategy behind it. Your stakeholders need levers to do political maneuvering. They have resources, they have a budget, they have a timeline, and what &lt;em&gt;you&lt;/em&gt; need to do is understand how to help them win at what they’re trying to do; it’s an entire unspoken language. Essentially, you need this notion of reciprocity, and you needed to be able to develop reciprocity with a multitude of teams in various organisations. Importantly, you also need reciprocity with leaders who also want that mutual reciprocity… But who also have no problem simultaneously playing this as a zero-sum game and stabbing you in the back if they think that it’ll give them an advantage.&lt;/p&gt;&lt;p&gt;The paradox of cooperation becomes that we need to simultaneously all buy in to the idea that there’s this reciprocity, and yet also simultaneously never truly believe it for a moment. Which felt like a bunch of stupid horseshit when I first encountered this type of thinking. It’s why a lot of people say “they don’t like politics in companies”. After all, this type of thinking feels &lt;em&gt;extremely&lt;/em&gt; toxic at a small or even medium or even semi-large enterprise (and at almost any startup, it certainly is). But at a large enough size of company, the complexities of the interdependencies and subsequent politics involved become such that this is fairly inevitable. It also means that, surprisingly, there are ways to keep it healthy. Of course, it’s not going to be healthy if it’s “backstabbing” (and definitely not if we &lt;em&gt;call&lt;/em&gt; it that), but friction is interesting and this evolves into an extremely high bandwidth way of exploring otherwise insurmountable solution spaces.&lt;/p&gt;&lt;p&gt;In other words, it’s an outcome of the nuances that emerge at that many layers of overlapping strategy. You begin having emergent behaviour and meta-cognition and game playing and meta-gaming; it’s just going to turn out that way. So, weirdly, learning how to scale your strategy isn’t necessarily about scaling it to &lt;em&gt;more&lt;/em&gt; players, it can very much be about taking those collaborative decisions and enabling them to happen in an eventually consistent way, across different levels of gameplay and theorizing.&lt;/p&gt;&lt;p&gt;I found all of this fascinating to experience.&lt;/p&gt;&lt;h2 id=&quot;patience-and-self-repair&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://hazelweakly.me/blog/observations-of-leadership-part-two/#patience-and-self-repair&quot;&gt;&lt;span&gt;Patience and Self-Repair&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;&lt;blockquote class=&quot;flow&quot;&gt;&lt;p&gt;Can you share an example of a time you chose not to intervene in a situation, allowing it to resolve on its own? What informed this decision, and how did it turn out?&lt;/p&gt;&lt;p&gt;– &lt;a href=&quot;https://cutlefish.substack.com/i/142017363/patience-and-self-repair&quot;&gt;https://cutlefish.substack.com/i/142017363/patience-and-self-repair&lt;/a&gt;&lt;/p&gt;&lt;/blockquote&gt;&lt;p&gt;When I think of this question, one of the things that comes to mind for me is understanding why I am not intervening. For me, that’s actually the real question. Am I trying to save my bandwidth? Am I trying to teach them a lesson? Am I helping them build the skills of autonomy? Or is it some other reason? What is actually going on here?&lt;/p&gt;&lt;p&gt;Rather than developing a bunch of ad-hoc explanations for all this, I tried to look for something grounded in a scientific theory, because the arbitrary nature of deciding for myself when or when not to intervene felt uncomfortable. Luckily, I am a massive nerd and find a lot of utility in reaching for psychology here, so I did end up finding something useful.&lt;/p&gt;&lt;p&gt;There is a concept in psychology called psychological affordances. I find it to be something that now guides me in these decisions and has the benefit of wrapping up all of the different factors and blending them together into something that I can reason about. Psychological affordances for me gave me a way to understand when the conditions are right for an intervention and what that intervention may look like. For example, one intervention is not doing anything, so it turns out that whether or not you intervene in a situation is the wrong way to think about it. You &lt;strong&gt;are&lt;/strong&gt; always intervening because a “non-action” is just as impactful as any other action you are taking. It is, in fact, an option of equal weight. Not making a choice isn’t an option as an intentional leader.&lt;/p&gt;&lt;p&gt;One particular example that comes to mind here is when I was working on developer experience at a prior company. We had a particular item that was listed as a major driver for developer experience in the surveys that we pulled out. This particular driver was “tech debt.” For several quarters in a row, it was consistently in the top 5 priorities that the company needed to address. However, it never make sense to address tech debt, and the reason for that was because these psychological affordances weren’t there; the conditions were “never right.” For starters, nobody in leadership was wanted to hear about it. There was always something around the corner that they wanted to focus on, always another thing that was a higher priority, and there was this perpetual feeling amongst the leadership of “we’re almost at the finish line and then…” You can’t sell someone on an intervention that sounds like putting on the brakes when their entire world revolves around pushing towards the finish.&lt;/p&gt;&lt;p&gt;More importantly, the way that roadmapping and prioritisation was structured in the company meant that engineers didn’t &lt;em&gt;have&lt;/em&gt; a lot of say over their roadmap but were &lt;em&gt;told&lt;/em&gt; they did. In reality, it was the product managers that were most responsible for roadmapping. The Product organisation had two huge issues. The first was that very few of them believed in tech debt as a concept and felt that engineers should just deal with it invisibly. Some even told me to my face that it was just an excuse engineers gave when complaining too much! The second problem was that product managers had almost no visibility into how the engineering teams actually worked and none of the leadership teams talked to each-other at any level lower than the C-suite. From the product perspective, they just designed a roadmap, then the engineering leaders said okay, and then would invisibly fail to deliver. Product leadership couldn’t understand why, due to lack of communication, and even if they did understand they didn’t know how to help developers because they were not given the information about what was actually happening. Critically, product managers were not reliably told the consequences or the impact of certain choices that had been made, and so kept running the engineering teams into “expensive” choices such as re-architecting or database schema changes.&lt;/p&gt;&lt;p&gt;All of that is to say that “tech debt” was essentially a loaded term and in-actionable. It was, however, a very common language to describe symptoms, and people used it to describe a multitude of various different things which had nothing to do with each other. Which also meant that there was no suitable intervention; there’s nothing we could do from the leadership team to actually fix the problem, because the problem wasn’t the tech debt, it was all the other things.&lt;/p&gt;&lt;p&gt;To address this, what I would do is every quarter I would look at the tech debt driver, I would read through commentary, and then I would go find &lt;em&gt;&lt;strong&gt;literally anything else&lt;/strong&gt;&lt;/em&gt; to accomplish. When I found another initiative that had the right conditions to be executed on, I’d massage it so that, when executed, addressed “a thing” tech debt was being used as a proxy for.&lt;/p&gt;&lt;p&gt;For example, one intervention was giving all the product managers access to the developer experience tools. Once I showed them how to use everything, to see the developer feedback, and to see what concerns teams had had, they were able to go and have meaningful conversations with the engineering managers and help set timelines accordingly. Additionally, it turned out that while the engineering leaders were under the impression that they didn’t have any influence on the roadmap, the C-suite, had the impression that the engineering leaders &lt;em&gt;did&lt;/em&gt; in fact have control over their roadmap. So, I worked with C-suite, engineering, and product leaders, to make sure that all the implicit assumptions were laid out and clarified. Once assumptions regarding what the actual bandwidth of an engineering team should be were made explicit, and once teams parted out time accordingly, the tech debt problem started to go away “magically” on its own.&lt;/p&gt;&lt;p&gt;It was never actually about tech debt, it was about communication. Naturally, had I &lt;em&gt;called&lt;/em&gt; it a communication problem, nobody would’ve talked with me; what kind of self respecting leader has bad communication issues? They were swamped with meetings! Of course they’re communicating!&lt;/p&gt;&lt;p&gt;Well yes, but also.&lt;/p&gt;&lt;h2 id=&quot;anticipate-effects&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://hazelweakly.me/blog/observations-of-leadership-part-two/#anticipate-effects&quot;&gt;&lt;span&gt;Anticipate Effects&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;&lt;blockquote class=&quot;flow&quot;&gt;&lt;p&gt;Can you tell me about a time you needed to try to anticipate the unintended side effects of a difficult decision? What did you watch out for?&lt;/p&gt;&lt;p&gt;– &lt;a href=&quot;https://cutlefish.substack.com/i/142017363/anticipate-effects&quot;&gt;https://cutlefish.substack.com/i/142017363/anticipate-effects&lt;/a&gt;&lt;/p&gt;&lt;/blockquote&gt;&lt;p&gt;Oh man, this is a tricky one. This is especially tricky because uh, yeah, okay, anyway. Rest assured that while this will be light on concrete details, it’ll hopefully be just as insightful.&lt;/p&gt;&lt;p&gt;A while back, I had to look through and compare a couple different architectural choices of a very large project. If we made the wrong choice, it could have negative downstream effects for multiple decades, and the project’s impact potentially involved an exorbitant amount of money. Consequently, when a company has a decision this large, they really want to make sure that you are anticipating side effects; more practically speaking, you should really be doing as much work you can to cover your ass.&lt;/p&gt;&lt;p&gt;I understood that many other people on the project were going to cover risk from multiple conventional angles. Thus, what I tried to do was discover and address dark swan events. This meant that, for a variety of reasons (of which I &lt;em&gt;will&lt;/em&gt; be hand-waving over), I ended up having to look at the corporate capture potential and supply chain continuity of multiple open source projects. One motivation for doing this was in order to determine whether or not nation state actors could get involved, and how that might affect the threat model of the project. In addition, I also traced every dependency of the possible architectural choices in order to understand and ensure that the underlying projects could continue to be maintained indefinitely (by us, by vendors, or otherwise). If that &lt;em&gt;wasn’t&lt;/em&gt; the case, assurances were needed that potential vendor(s) in question had mechanisms in place to deal with that. There were, of course, many regulatory questions and various functional concerns, but the supply chain question ended up being the trickiest one to anticipate.&lt;/p&gt;&lt;p&gt;Looking at how the political and economic climate of 2025 ended up shaking out… Whew. When you’re right, sometimes you’re really fucking right, huh? &lt;sup&gt;(Have I mentioned how I hate being right?)&lt;/sup&gt;&lt;/p&gt;&lt;p&gt;Anyway, in addition to the supply chain question, there was another line of inquiry that had to do around understanding the true aims of the project I was working on. The reason for this was that the solution that we had &lt;em&gt;did&lt;/em&gt; address the presented business case. But arguably, if you wanted to truly solve the underlying &lt;em&gt;need&lt;/em&gt; that the business had, the way to do it was to rethink the approach to the domain entirely, rather than solve the problem with the same operating model. Essentially, I grew to believe that what needed to happen was a rethinking the vendor relationships, IT Service Management, and the operating model around open source consumption entirely. It appeared to be the case that our strategic rethinking needed to be a transition away from playing a mutually cooperative game towards playing a maximally adversarial game. Unfortunately, when you do that, the risk that you can tolerate is substantially lower; you also have to weigh your tradeoffs in a very different way.&lt;/p&gt;&lt;p&gt;Summarising the research, I ended up writing out all of these concerns as well as strategies for addressing them appropriately and de-risking them. I then put all these into a series of documents that I published internally. These documents were then read… And politely, for the most part, ignored.&lt;/p&gt;&lt;p&gt;It made sense that they were ignored because they ran counter to a lot of the political agenda of the project itself internally, but what I found fascinating was the series of events that would occur later. About ten or months after these documents were written by me, pretty much point for point, every concern that I had listed out stopped being hypothetical and started being very concrete and real possibilities. One by one, all of the dark swan events that I had looked for started occurring.&lt;/p&gt;&lt;p&gt;Fuck me, I &lt;em&gt;hate&lt;/em&gt; being right.&lt;/p&gt;&lt;p&gt;Luckily, because I &lt;em&gt;had&lt;/em&gt; written all these documents, people were then able to refer to them and start strategising accordingly. However, it’s a bit difficult for me to convey &lt;em&gt;how&lt;/em&gt; I did that, because the advice that I’d give would sound vague, hand-wavy and wish-washy: The honest answer is that this was pure intuition for me, and I just listened to the vibes and trusted my gut.&lt;/p&gt;&lt;p&gt;I suppose the best way for me to try and explain it is that I seek to understand where are there inflection points that result in having to change the way that you think about something. If you approach an event or if you approach a strategy as an incremental refinement, that’s how the majority of human progress happens. That’s good! Incremental is awesome, lean into that. But if you get &lt;em&gt;locked&lt;/em&gt; into that mode of thinking, to where you can no longer find these points of inflection that require rethinking the rules in the game, then you can get caught very unaware when those events start to happen.&lt;/p&gt;&lt;p&gt;Another thing I keep in mind is that these events never happen as a one-off thing, they happen as a chain of cascading failures in your approach to strategising; your future-sensing “intuition” as leaders starts failing everywhere at once.&lt;/p&gt;&lt;p&gt;Looking back, it was a good thing that people ended up ignoring all the things that I wrote. I was horribly annoyed about it (because nobody likes being a Cassandra), but the distance helps reflect here. &lt;em&gt;Had&lt;/em&gt; my concerns been “taken seriously”, and had they been addressed at the time, it would have cost the project its success because we would have been solving problems that did not politically exist yet. Later, when we did run into the problems and encounter them, we were able to do the pivot correctly, in part because the architects in question had been talking to me and had factored in some contingencies where possible. Of course, we had a team of competent people, so naturally we were able to address the problem once everyone agreed that it existed.&lt;/p&gt;&lt;p&gt;Therein lies the rub; as it stood, now it looks like we could predict the future better than &lt;em&gt;multiple&lt;/em&gt; executives; because that’s statistically infeasible (given that they have more information than us on average), it gave people the large enough upset needed to realise they needed to update their mental models. That, more than anything, is what allowed the pivot to happen correctly.&lt;/p&gt;&lt;p&gt;It was far more important to create a scenario where everyone had to update their mental model than it was to be right about something “early.” Rather than feeling almost a year late like I initially thought it was, the timing ended up being right around the time that it needed to be.&lt;/p&gt;&lt;h2 id=&quot;curiosity-and-light-touch&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://hazelweakly.me/blog/observations-of-leadership-part-two/#curiosity-and-light-touch&quot;&gt;&lt;span&gt;Curiosity and Light Touch&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;&lt;blockquote class=&quot;flow&quot;&gt;&lt;p&gt;Describe a moment you caught yourself making a snap judgment and instead opted for curiosity. What prompted this change in approach?&lt;/p&gt;&lt;p&gt;– &lt;a href=&quot;https://cutlefish.substack.com/i/142017363/curiosity-and-light-touch&quot;&gt;https://cutlefish.substack.com/i/142017363/curiosity-and-light-touch&lt;/a&gt;&lt;/p&gt;&lt;/blockquote&gt;&lt;p&gt;Honestly, I am still pretty bad at this. I like to think of myself as someone that strives for curiosity, but I am honestly such a judgy bitch at times. My intuition is pretty good, so I’m usually not out of line with it, but it’s really hard to hold back and approach with curiosity at times. Especially when some of these choices just &lt;em&gt;really&lt;/em&gt; make you question everything you’re reading.&lt;/p&gt;&lt;p&gt;Something that helps me is I try to ask myself, “What are the environmental conditions that make what someone’s doing the most reasonable and rational possible choice to make?” When I dig into those environmental conditions and contributing factors, then I can figure out how to ask questions that are genuinely curious. What I don’t want to do is end up asking leading questions in bad faith; “just how stupid do you think you are” feels mighty cathartic to ask at times, but it’s never helpful.&lt;/p&gt;&lt;p&gt;Now for a pet peeve of mine: Leading with curiosity does not work unless you are truly and genuinely curious.&lt;/p&gt;&lt;p&gt;I see a lot of people asking questions with their ears sewed shut and it pisses me off; it’s a surefire way to make sure someone doesn’t answer the questions you &lt;em&gt;do&lt;/em&gt; want to know the answers to. Which means that I don’t lead with curiosity unless I can figure out how to make myself truly curious and willing to understand the answer that they give me. If I can’t find that spark of curiosity, the bitchy snap judgment is usually accurate (or, more likely, it’s a signal for me to delegate this to another person with the right type of curiosity). Either way, I’m not going to be a productive leader in that moment, so I either turn it into an opportunity for someone else, or I shut the initiative down with clear and candid feedback.&lt;/p&gt;&lt;p&gt;I had a really interesting experience where this happened when I was working on improving developer experience. One of the items that I had been looking at carefully over the last couple quarters was a driver called Focus Time. It turns out that this driver had been in the top 10 priorities, but slowly it had risen to become the top priority that everybody was clamoring about. The problem is that when I looked at this, while I saw very clear signs from the &lt;em&gt;developers&lt;/em&gt;, when I talked to the engineering leaders and managers, they were entirely unaware or unsympathetic. Somehow the signal was getting entirely lost.&lt;/p&gt;&lt;p&gt;This made me curious, so I decided to dig in and looked at all of the calendars of the developers and pulled some statistics. We found out that the developers, on average, had about two hours of meetings a week. When I saw this number, my first initial thought was to laugh. My second thought was “you whiny little bitches, just get over it.”&lt;/p&gt;&lt;p&gt;Now honestly, that’s a fairly valid reaction; it’s a &lt;em&gt;very&lt;/em&gt; easy thing to think as a leader when you have over 35 hours of meetings a week and a full load on your plate that you have to get done on top of that. Then to turn around and listen to developers complain about a mere two hours spread over a whole week? It’s honestly hilarious. It’s so hard to be empathetic and curious in that moment.&lt;/p&gt;&lt;p&gt;Consequently, I initially was convinced that either the data was wrong, or the developers were or little babies. (Honestly, my money was kind of on the latter. It was &lt;em&gt;very&lt;/em&gt; hard to take that data seriously!)&lt;/p&gt;&lt;p&gt;Because we had other major drivers that were more actionable and immediately impactful, I sidelined this for a few quarters until the signal had persisted long enough that it was unarguably present. When it finally became clear that the signal from the developers didn’t match the data &lt;em&gt;and&lt;/em&gt; was significant, &lt;em&gt;and&lt;/em&gt; I had dealt with all the obvious road-blockers that it &lt;em&gt;could’ve&lt;/em&gt; been… Well, time to get curious! Because I couldn’t trust the data &lt;s&gt;or take it seriously&lt;/s&gt;, I started by directly talking to all the developers openly and curiously, just casually interviewing them to understand what was actually going on.&lt;/p&gt;&lt;p&gt;What I encountered was quite illuminating. It turned out that several different things were going on.&lt;/p&gt;&lt;ul&gt;&lt;li&gt;Firstly, the developers were promised a “No Meetings Thursday” by the leadership team. This promise had been ignored for quarters which meant that the developers didn’t &lt;em&gt;feel&lt;/em&gt; that leaders respected their time.&lt;/li&gt;&lt;li&gt;Another thing that was happening was that the organisation had quadrupled the amount of headcount per quarter that it wanted to hire, but it did not increase the amount of engineers on the interviewing teams. This subsequently crushed those engineers under meeting fatigue. Whoops! (This is honestly surprisingly easy to do as very few people-related things in an organisation automatically scale)&lt;/li&gt;&lt;li&gt;Yet another contributing factor ended up being that a significant amount of meeting load was entirely invisible. During interviews engineers would casually say, “oh, I’m stuck in meetings all day,” and I would check their calendar and they didn’t have any meetings on the calendar. Their workload was full of invisible meetings, 20-minute meetings that silently turn into 4 hour meetings, or impromptu meetings that weren’t on calendars.&lt;/li&gt;&lt;li&gt;Finally, the last contributing factor was that it was “audit season” when the survey ran; many developers were stuck in very boring and painful compliance related meetings.&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;Once I was able to approach the problem with curiosity, the required interventions became quite simple.&lt;/p&gt;&lt;ul&gt;&lt;li&gt;First, we wrote a document on meeting etiquette and spread that around. This included guidance on “when it can be an email”, when to time the meeting, how frequently to make things occurring, advice to chunk meetings together rather than spreading them apart, and more.&lt;/li&gt;&lt;li&gt;We then empowered the developers to turn down meetings.&lt;/li&gt;&lt;li&gt;We also ensured that no meeting Thursday was actually respected (for ICs at least).&lt;/li&gt;&lt;li&gt;We asked developers and managers to update meetings on their calendars and make the calendar reflect the actual meeting load the developers had&lt;/li&gt;&lt;li&gt;We increased the amount of people in the interviewing loop rotations and wrote procedures to prevent the load mismatch from happening again.&lt;/li&gt;&lt;li&gt;Finally, while compliance meetings are unavoidable, they can be accounted for; engineering managers were instructed to reduce ticket load appropriately for people in meetings. (Compliance meetings &lt;em&gt;are&lt;/em&gt; work for ICs)&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;After all these interventions happened, the average meeting time for developers “rose” to nearly 6 hours a week. Additionally, it took less than a quarter for Focus Time to become the lowest priority driver to improve. It’s funny, because I theoretically could’ve solved it several quarters earlier… But I couldn’t take the problem seriously enough (and neither could anyone else); once I was able to identify a way to approach it with curiosity, the solution fell out fairly naturally.&lt;/p&gt;&lt;h2 id=&quot;bothand&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://hazelweakly.me/blog/observations-of-leadership-part-two/#bothand&quot;&gt;&lt;span&gt;Both/And&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;&lt;blockquote class=&quot;flow&quot;&gt;&lt;p&gt;Can you tell me about a time you faced something that looked like a simple trade-off on the surface—an either/or situation—but it turned out to be a both/and situation? How did you navigate the situation?&lt;/p&gt;&lt;p&gt;– &lt;a href=&quot;https://cutlefish.substack.com/i/142017363/bothand&quot;&gt;https://cutlefish.substack.com/i/142017363/bothand&lt;/a&gt;&lt;/p&gt;&lt;/blockquote&gt;&lt;p&gt;Can I? &lt;em&gt;Can I?&lt;/em&gt; Pffh, I can hardly think of a decision that &lt;em&gt;wasn’t&lt;/em&gt; more complex than it appeared at first glance. (I run into these a lot in architecture.)&lt;/p&gt;&lt;p&gt;What I find is that often people aren’t thinking of things as a trade-off space, and they’re approaching it as a binary choice. When you think of the trade-off as a trade-off &lt;em&gt;space&lt;/em&gt;, then it actually becomes a lot easier to understand that you may be looking at the wrong axis and you need to reorient yourself towards a different way of thinking. While that tends to make the problem become more complex, it also makes the problem statement more honest. It’s in that new honesty of the problem statement that the choices have to evolve and get correspondingly more complex in order to match the true complexity of the domain.&lt;/p&gt;&lt;p&gt;I had a situation with one project where the aim of the project was to achieve two goals: deprecate an existing component, and expand the scope of the project to address more needs. The trade-off on the surface is one of prioritisation: Do you develop the new functionality first, or do you prove value immediately by replace the existing component? This one ended up being tricky because it was very easy to pick one or the other and the idea of a “both/and” situation was not at &lt;em&gt;all&lt;/em&gt; obvious. Politically speaking, the option to replace the existing component fits a lot with how the company liked to show value first and then roll out enhancements after the fact. That’s a very natural progression. On the other hand, we didn’t really know with certainty what the new functionality would look like, and how it might be implemented. The downside was that we might get quite far into a solution before realizing that the (more important) new functionality isn’t even feasible in the way we thought it was.&lt;/p&gt;&lt;p&gt;When I’m faced with a trade-off like this, one of the first things that I do is I try and understand what the trade-off space &lt;em&gt;actually&lt;/em&gt; is. These choices, after all, aren’t really about the binary choice of “do the new roadmap” vs “do the old roadmap.” This is really about developing an evolving set of capabilities for the business with a set of complex intersecting criteria. When I dug in, picked apart the overlapping criteria, and started mapping out all of the different capabilities, it quickly became obvious that the true choice to be made was not about what part of the functionality to develop next. The choice was more fundamentally about was about what types of foundational capabilities and technologies do we need to develop as company in order to be ready for what lies ahead in the next 3-5 years.&lt;/p&gt;&lt;p&gt;The answer to that, naturally, ended up being: “a bunch of things that aren’t on the roadmap at all anywhere.”&lt;/p&gt;&lt;p&gt;Nobody wants to hear that answer. Absolutely &lt;em&gt;nobody&lt;/em&gt; wants to hear that answer. However, when you &lt;em&gt;do&lt;/em&gt; run into that answer, getting leaders to believe it, and getting leaders to understand how they need to re-evaluate the situation becomes really critical. Luckily, when you get past that hurdle, then the both/and situation turns into something you can collaboratively problem solve together.&lt;/p&gt;&lt;h2 id=&quot;intervene-safely&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://hazelweakly.me/blog/observations-of-leadership-part-two/#intervene-safely&quot;&gt;&lt;span&gt;Intervene Safely&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;&lt;blockquote class=&quot;flow&quot;&gt;&lt;p&gt;Can you tell me about a time when you and your team tried out a new way of working, interacting, or behaving? How did you decide what to try? How did you figure out whether to double down or change again?&lt;/p&gt;&lt;p&gt;– &lt;a href=&quot;https://cutlefish.substack.com/i/142017363/intervene-safely&quot;&gt;https://cutlefish.substack.com/i/142017363/intervene-safely&lt;/a&gt;&lt;/p&gt;&lt;/blockquote&gt;&lt;p&gt;This is an interesting question for me because despite the fact that I have spent a large chunk of my career designing interventions, analyzing them, and understanding how to get teams to work differently… I feel like this question is the one that, so far, I have been the least able to answer well. Honestly, I don’t know if I have a good story for this!&lt;/p&gt;&lt;p&gt;One thing that comes to mind is there was a situation on a team that I was managing where I was noticing a few things. Engineers were silently getting stuck and doing individual problem solving for days at a time; that was slowing them down way too much. Institutional knowledge was also getting locked up inside individuals and it wasn’t spreading across the team effectively. The team also skewed very bimodal, having both very junior and very experienced engineers; it was important to ensure that everyone felt comfortable and safe speaking up when they didn’t know something.&lt;/p&gt;&lt;p&gt;We also had hundreds of tedious fucking tasks to do. Tasks with ETAs of anywhere from twenty minutes to three weeks. Just horrible shit to work with and figure out, honestly.&lt;/p&gt;&lt;p&gt;I had initially attempted addressing this in the very traditional way of splitting tasks out into tickets, writing up all the tickets individually, assigning them to people, and then asking for for daily updates. Super standard, and super ineffective. It wasn’t working! Information stayed siloed, people were invisibly delayed, and progress wasn’t being made. It was just horrific. In response, we changed our way of working to a combination of mob programming and pair programming.&lt;/p&gt;&lt;p&gt;Initially, we went with mob programming: I grabbed the entire team together and we spent two days triaging all the tickets together by going into the actual codebases assessing out-loud whether or not the ticket was going to be an easy, medium, or difficult. When everyone was sharing their internal thoughts on this, all of a sudden, all the weird tips and tricks that the experts knew started coming out; all of the institutional knowledge of what would be difficult started spilling out and getting documented as well.&lt;/p&gt;&lt;p&gt;After that, it became substantially easier to assign people to work on the easy problems first. First, we went through a few tickets together as a mob. Beginning to end, we did the whole ticket together as a team and basically ran it like a collaborative live Twitch streaming experience. When everybody was in the flow and rhythm of knowing how these tasks could be accomplished, they could then break it down into pair programming, which was &lt;em&gt;initially&lt;/em&gt; to be done only via zoom and on camera. After an initial adjustment period, I stopped requiring cameras on and live pairing sessions for work, and let people do it organically as the task demanded.&lt;/p&gt;&lt;p&gt;The results were magical and throughput shot up like a rocket. I knew that this had worked to change the way of working for our team when, for other projects and problems down the road, people started doing the mobbing and then pairing dynamic naturally. The way people shared problems and learning with each other changed as well. Overall, the ability of the team to collaborate became a very, very tight mesh. This ended up serving us incredibly well later on and helped make this team one of the best performing teams I’ve ever had the pleasure of leading.&lt;/p&gt;&lt;h2 id=&quot;conclusion&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://hazelweakly.me/blog/observations-of-leadership-part-two/#conclusion&quot;&gt;&lt;span&gt;Conclusion&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;In the last article, I ended this with “I don’t even know how to conclude this.” I’m happy to report that one thing about me is consistent: I still don’t know how to conclude this. However, this time I’m going to spare you a ~1.500 word conclusion.&lt;/p&gt;&lt;p&gt;See you in a few years!&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>To Be a Leader of Systems</title>
    <link href="https://hazelweakly.me/blog/to-be-a-leader-of-systems/" />
    <updated>2025-11-18T00:00:00Z</updated>
    <id>https://hazelweakly.me/blog/to-be-a-leader-of-systems/</id>
    <content type="html">&lt;p&gt;Picture with me, if you will, the absurdity of finding yourself swimming in the middle of the ocean. First think about the ocean and how deep and infinitely vast it is; then about how improbable it is to even fully grasp the notion of how large the ocean is, of how deep it is, of how wide it is; think about how there is so much of it we will never know, and so much we can never know.&lt;/p&gt;&lt;p&gt;Let that sink in, deeply and fully.&lt;/p&gt;&lt;p&gt;Now, as a person who sees systems, who intuits chaos, who can grasp these concepts of swirling infinities, you have to sit with the uncomfortable idea that if you were to find yourself stranded in the middle of the ocean, it is a death sentence of all but certainty. This might be fine if it were just you; after all, life happens, you know? Sometimes things are bizarre, sometimes luck just runs out. However, I’ve often noticed that people who become known for seeing systems often become in charge of them. In other words, you’re probably a leader–either by name, by identity, or by purpose.&lt;/p&gt;&lt;p&gt;But to &lt;em&gt;be&lt;/em&gt; a leader is to understand that you’ll find yourself stranded in the middle of the ocean one day. Not just you, but everyone you lead. And you’ll need to chart a course. In the ever-changing winds, the ever-shifting tides, the unknown weather, and with an inability to see up or down or basically anywhere except a few minutes away. You won’t have the time to find your bearings even if you could. Yet, somehow, in this sea of swirling and infinite complexities and probabilities, in the midst of incalculable odds, you will find yourself needing to have simultaneously several different things:&lt;/p&gt;&lt;p&gt;Firstly, you need to hold in your head the knowledge that you will probably–no, almost certainly–fail. You will need to hold that uncomfortableness in your heart and carry it with you always. Never to share, but always to hold. Ground yourself in the impossibilities of what you’ll attempt, lest your hubris lead you to ruin.&lt;/p&gt;&lt;p&gt;Secondly, you will need to hold firm and irrationally deep conviction. Unshakable. Unshatterable. Conviction that you will succeed. This is the conviction that you have to share, that you have to use to lead people along with you, that you have to embody fully and without reservation. Because you have to keep going. You &lt;em&gt;must&lt;/em&gt; keep faith. Up until the very last stroke. Up until the very last moment. Going past the moments beyond when you thought you could go. To succeed often requires–and always demands–of you to embrace the utterly insane notion that you cannot fail.&lt;/p&gt;&lt;p&gt;Thirdly, you will need to prepare and make ready everyone around you. Because to fail as a leader is not when you lose conviction and go down the wrong path. To fail as a leader is when you do not prepare people for when that inevitable failure happens. This will feel in uncomfortable contrast to the necessary conviction that you cannot fail. Unfortunately, leadership is defined by contradictions. You, as a leader, have to hold these unresolvable contradictions together. There is no reconciling them. There is no addressing that cognitive dissonance. Your only path will be to hold that storm of of disquiet in yourself, even as it slowly anguishes you.&lt;/p&gt;&lt;p&gt;If there ever was a trick to this, it cannot be described simply. But if I were to try, I would say that the trick is–I think–to learn how to dance in the rain. Plant gardens you will never grow. Prepare futures you will never see. Create victories you will never celebrate. Build memories you will never recall. Tell stories that will never be heard. Hear sorrows that will never be healed.&lt;/p&gt;&lt;p&gt;Lead people with kindness and empathy towards impossibilities; not because you can make them possible, but because you know both that you can and cannot. Smile at the chaos and laugh at the sworling vortex of insanity. Pick out the pinpricks of light buried inside that encompass what humans find worth living and weave them together. That tapestry is the cloth, fixed atop your mast of conviction, that you will use to sail your way through the seas of entropy.&lt;/p&gt;&lt;p&gt;Above all: Feel deeply, live fully, lead truly, hold empathy, and inside your heart the intellectual humility required to be able to see the beauty in the chaos. Oh, and learn to dance in the rain.&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>Scaling Innovation: Building Ecosystems</title>
    <link href="https://hazelweakly.me/blog/scaling-innovation-building-ecosystems/" />
    <updated>2025-10-21T00:00:00Z</updated>
    <id>https://hazelweakly.me/blog/scaling-innovation-building-ecosystems/</id>
    <content type="html">&lt;p&gt;Innovation is a tricky subject. The precise details of how to do it are not well studied, at least not under the name “innovation”. In addition, multiple disciplines have advanced research that overlaps significantly, but the multidisciplinary integration lags behind by decades, making shared empirical research or identifying more global patterns difficult.&lt;/p&gt;&lt;p&gt;Nevertheless, I consider it dearly important to share and study and discuss the topic of innovation, and so I will provide some research notes here, as well as my overall interpretation of them. Most of this is going to be a compilation of research and the synthesis of it; consequently, I will note where my hypothesis and personal opinions sprinkle in. Otherwise, you may assume that the information here comes from somewhere with strong empirical backing (it will also be cited).&lt;/p&gt;&lt;blockquote class=&quot;flow&quot;&gt;&lt;p&gt;One final thing that’s important to know about me is I do not particularly resonate with the phrase “strong opinions weakly held.” I prefer to think of it as “open worldviews, deeply researched.”&lt;/p&gt;&lt;p&gt;Which is to say, my worldview is &lt;em&gt;open&lt;/em&gt;, and my opinions are always open for challenge and debate, but I will &lt;em&gt;insist&lt;/em&gt; on evidence proportional to the strength of the claim; extraordinary claims require extraordinary evidence, particularly if they present counter-factually to decades of empirical research. Naturally, that doesn’t mean that the extraordinary claim can’t be true (quantum physics is a great example), but I am likely to be quite grumpy if an under-researched &lt;em&gt;opinion&lt;/em&gt; is used to counter viewpoints that are backed by densely-cited empirical research.&lt;/p&gt;&lt;p&gt;That said, I hold anecdotal evidence in high regard and consider qualitative evidence to be equal (if not frequently superior) to most “objective” quantitative data. In particular, I am dearly fond of the phrase “what works in practice can work in theory,” and hold that close to my heart whenever I research complex and interdisciplinary topics, particularly those involving matters of humanity.&lt;/p&gt;&lt;/blockquote&gt;&lt;p&gt;With that out of the way, I would like to make a brief apology towards the many disciplines of science and research that I am about to enthusiastically smash together. I am not an expert in most of these, merely a well-read and curious practitioner. That said, I’ve done my best to cite where appropriate, and if I’ve missed a citation or if something needs clarification, please let me know.&lt;/p&gt;&lt;h2 id=&quot;tying-disciplines-together&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://hazelweakly.me/blog/scaling-innovation-building-ecosystems/#tying-disciplines-together&quot;&gt;&lt;span&gt;Tying Disciplines Together&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;My personal hypothesis is that you can tie together the following domains of research (Resilience Engineering, Social-Ecological Systems, and Cumulative Culture from Cognitive Psychology), and map them fairly directly between each other. This forms the basis of much of my thinking around understanding, building, and operating Complex Adaptive Systems.&lt;/p&gt;&lt;ul&gt;&lt;li&gt;&lt;a href=&quot;https://www.resilience-engineering-association.org/resources/where-do-i-start/&quot;&gt;Resilience Engineering&lt;/a&gt;: My definition of resilience engineering is “the science of understanding, improving, operating, and building complex adaptive systems.” The outcomes and artifacts of this science is a collection of frameworks, theories, and models which help us do that.&lt;/li&gt;&lt;li&gt;&lt;a href=&quot;https://en.wikipedia.org/wiki/Socio-ecological_system&quot;&gt;Socio-Ecological Systems (SES)&lt;/a&gt;: This is an interdisciplinary lens of inquiry, originating from public economics, systems ecology, and complexity theory. The study revolves around SES, which can be defined as “a set of critical resources (natural, social, economic, and cultural) whose flow and use is regulated by a combination of ecological and social systems.” The full definition is more complex and contains additional factors, but this aspect gets us closer to the heart of what is “unique” about this lens of inquiry. This can also alternatively be viewed as a lens of inquiry within Resilience Engineering.&lt;/li&gt;&lt;li&gt;&lt;a href=&quot;https://royalsocietypublishing.org/doi/10.1098/rstb.2015.0192&quot;&gt;Cumulative Culture&lt;/a&gt;: This is a lens of inquiry, largely within cognitive psychology, although it is also interdisciplinary to an extent. Cumulative culture is the study of how humans accumulate progress over time, such that the output is greater than any individual could produce on their own. It lays the foundation of understanding how groups of humans continually “ratchet” up their competence over time in order to accomplish increasingly sophisticated outcomes.&lt;/li&gt;&lt;li&gt;&lt;a href=&quot;https://johncarlosbaez.wordpress.com/2021/06/25/complex-adaptive-system-design-part-10/&quot;&gt;Operads&lt;/a&gt;: I’m including these as a separate top-level item because of how fundamental they seem to be. Operads are a mathematical object that abstracts “operations that compose many objects into a single object.” Operads and their algebras represent a rigorous mathematical framework for modeling a Complex Adaptive System as an overlapping set of networks, each with a valid set of operations, and a way to build and develop compositions of those operations such that the result is well formed. To put that in plain English: Operads provide a mathematical formalism to build “correctly functioning” complex adaptive systems out of independent and adaptive yet understandable components. As a note to my future self, I need to revisit this and see if Operads are still the nicest mathematical framework, or if the work that has been emerging with &lt;a href=&quot;https://arxiv.org/pdf/2505.18329&quot;&gt;double categories&lt;/a&gt; is easier to wrap one’s head around for the applications I’m imagining. I need to make extra care to ensure that I actually &lt;em&gt;need&lt;/em&gt; all the machinery of the categorical construction, because sometimes a cute lil abstract object is entirely sufficient, and they’re much cuddlier to boot.&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;In addition, I have grown fond of a few particular tools for tying things together and going from theory to action. Wardley maps are one of them, and I will reference material more relevant, but I won’t provide full introduction here.&lt;/p&gt;&lt;p&gt;Summarizing the above, I personally hold the viewpoint that any sufficiently advanced system exhibits the properties of an ecosystem, a culture, and a complex adaptive system. Additionally, I also think all three of these are the same thing.&lt;/p&gt;&lt;p&gt;Rather than repeatedly referencing ecosystems, cultures, complex adaptive systems, etc., as separate concepts, I will reference them by name when the particulars matter, (ie when citing particular fields of research), &lt;strong&gt;and otherwise will refer to the collective concept as a CAST (Complex Adaptive System Thingy).&lt;/strong&gt; Yes, it’s thingy; one can’t take themselves too seriously if they wish to study and understand wibbly wobbly system-shaped stuffs after all.&lt;/p&gt;&lt;p&gt;My hypothesis is that given the above, we then see:&lt;/p&gt;&lt;ul&gt;&lt;li&gt;Cumulative culture as a way to understand under what environmental conditions humans achieve optimal collaboration.&lt;/li&gt;&lt;li&gt;How to optimise and facilitate human innovation, and how to understand when one has studied to compromise the human element.&lt;/li&gt;&lt;li&gt;Socioeconomic systems then provide a way to understand what properties of the system result in effective environments, as well as ways to understand and optimise the efficiency of the environment by being able to visualise it as an economy.&lt;/li&gt;&lt;li&gt;Resilience engineering comes in to provide the most generally formed “laws of the universe” for CASTs, ways to strategically visualise and orient towards desired system performance, and a robust toolkit for helping humans handle the non-intuitive nature of CASTs.&lt;/li&gt;&lt;li&gt;Lastly, Operads provide a mathematical framework under which one can validate that automation built to compose complex components do not result in a globally incoherent system.&lt;/li&gt;&lt;/ul&gt;&lt;h2 id=&quot;understanding-ecosystems&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://hazelweakly.me/blog/scaling-innovation-building-ecosystems/#understanding-ecosystems&quot;&gt;&lt;span&gt;Understanding Ecosystems&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;I posit that any sufficiently large CAST contains within it at least one commons, if not several. Consequently, while it may be natural to look at the research around SES and Commons, as things that do not apply to enterprises. I would argue that it certainly does. Some examples of commons within any large enterprise, in my opinion, are:&lt;/p&gt;&lt;ul&gt;&lt;li&gt;The Network&lt;/li&gt;&lt;li&gt;Technical Strategy&lt;/li&gt;&lt;li&gt;Infrastructure&lt;/li&gt;&lt;li&gt;Fiscal Budget&lt;/li&gt;&lt;li&gt;Platforms and Software Frameworks&lt;/li&gt;&lt;li&gt;Inner-Source Software&lt;/li&gt;&lt;li&gt;Road-maps&lt;/li&gt;&lt;li&gt;…&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;In particular, the research around SES was originally started to specifically address common pool resources with public goods, but services, and the general problem of “balancing resource use and the system maintenance” all see quite direct applicability from this inquiry of research.&lt;/p&gt;&lt;h3 id=&quot;socio-ecological-system-framework&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://hazelweakly.me/blog/scaling-innovation-building-ecosystems/#socio-ecological-system-framework&quot;&gt;&lt;span&gt;Socio-Ecological System Framework&lt;/span&gt;&lt;/a&gt;&lt;/h3&gt;&lt;p&gt;The most approachable framework here is the SES framework.&lt;/p&gt;&lt;blockquote class=&quot;flow&quot;&gt;&lt;ul class=&quot;italic&quot;&gt;&lt;li&gt;Solid boxes denote first tier categories, resource systems, resource units, government systems and actors.&lt;/li&gt;&lt;li&gt;Each first-tier category contains multiple variables at the second tier as well as lower tiers.&lt;/li&gt;&lt;li&gt;Action situations are where all the action takes place as inputs are transformed by the actions of multiple actors into outcomes.&lt;/li&gt;&lt;li&gt;Dashed arrows denote feedback from action situations to each of the top tier categories. The dotted-and-dashed line that surrounds figure indicates that while the system can be considered a logical whole, it will always be influenced by external factors.&lt;/li&gt;&lt;/ul&gt;&lt;/blockquote&gt;&lt;p&gt;&lt;/p&gt;&lt;figure class=&quot;flow bordered-box has-caption&quot;&gt;&lt;picture&gt;&lt;source srcset=&quot;https://hazelweakly.me/images/Yv-yyWruab-300.avif 300w, https://hazelweakly.me/images/Yv-yyWruab-600.avif 600w, https://hazelweakly.me/images/Yv-yyWruab-1000.avif 1000w&quot; sizes=&quot;100vw&quot; type=&quot;image/avif&quot;&gt;&lt;source srcset=&quot;https://hazelweakly.me/images/Yv-yyWruab-300.webp 300w, https://hazelweakly.me/images/Yv-yyWruab-600.webp 600w, https://hazelweakly.me/images/Yv-yyWruab-1000.webp 1000w&quot; sizes=&quot;100vw&quot; type=&quot;image/webp&quot;&gt;&lt;img alt=&quot;Revised social-ecological system (SES) framework with multiple first-tier components. The quoted bullet list above describes the image.&quot; srcset=&quot;https://hazelweakly.me/images/Yv-yyWruab-300.jpeg 300w, https://hazelweakly.me/images/Yv-yyWruab-600.jpeg 600w, https://hazelweakly.me/images/Yv-yyWruab-1000.jpeg 1000w&quot; title=&quot;Revised social-ecological system (SES) framework with multiple first-tier components.&quot; class=&quot;w-full&quot; decoding=&quot;async&quot; height=&quot;646&quot; sizes=&quot;100vw&quot; src=&quot;https://hazelweakly.me/images/Yv-yyWruab-300.jpeg&quot; width=&quot;1000&quot;&gt;&lt;/picture&gt;&lt;figcaption&gt;Revised social-ecological system (SES) framework with multiple first-tier components.&lt;/figcaption&gt;&lt;/figure&gt;&lt;p&gt;&lt;/p&gt;&lt;p&gt;&lt;em&gt;(&lt;a href=&quot;https://www.jstor.org/stable/26269580&quot;&gt;McGinnis, Michael D., and Elinor Ostrom. “Social-Ecological System Framework: Initial Changes and Continuing Challenges.” Ecology and Society 19, no. 2 (2014).&lt;/a&gt;)&lt;/em&gt;&lt;/p&gt;&lt;h3 id=&quot;design-principles-for-managing-common-resources&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://hazelweakly.me/blog/scaling-innovation-building-ecosystems/#design-principles-for-managing-common-resources&quot;&gt;&lt;span&gt;Design Principles for Managing Common Resources&lt;/span&gt;&lt;/a&gt;&lt;/h3&gt;&lt;p&gt;These have been empirically validated across decades of research in multiple contexts, multiple cultures, and multiple environments, and so on. They are about as close to a law of the universes we can get, essentially.&lt;/p&gt;&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;&lt;/th&gt;&lt;th&gt;Principle&lt;/th&gt;&lt;th&gt;Description&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;&lt;em&gt;1A&lt;/em&gt;&lt;/td&gt;&lt;td&gt;User boundaries&lt;/td&gt;&lt;td&gt;Clear boundaries between legitimate users and nonusers must be clearly defined.&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;em&gt;1B&lt;/em&gt;&lt;/td&gt;&lt;td&gt;Resource boundaries&lt;/td&gt;&lt;td&gt;Clear boundaries are present that define a resource system and separate it from the larger biophysical environment.&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;em&gt;2A&lt;/em&gt;&lt;/td&gt;&lt;td&gt;Congruence with local conditions&lt;/td&gt;&lt;td&gt;Appropriation and provision rules are congruent with local social and environmental conditions.&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;em&gt;2B&lt;/em&gt;&lt;/td&gt;&lt;td&gt;Appropriation and provision&lt;/td&gt;&lt;td&gt;The benefits obtained by users from a common-pool resource (CPR), as determined by appropriation rules, are proportional to the amount of inputs required in the form of labour, material, or money, as determined by provision rules.&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;em&gt;3&lt;/em&gt;&lt;/td&gt;&lt;td&gt;Collective-choice arrangements&lt;/td&gt;&lt;td&gt;Most individuals affected by the operational rules can participate in modifying the operational rules.&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;em&gt;4A&lt;/em&gt;&lt;/td&gt;&lt;td&gt;Monitoring users&lt;/td&gt;&lt;td&gt;Monitors who are accountable to the users monitor the appropriation and provision levels of the users.&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;em&gt;4B&lt;/em&gt;&lt;/td&gt;&lt;td&gt;Monitoring the resource&lt;/td&gt;&lt;td&gt;Monitors who are accountable to the users monitor the condition of the resource.&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;em&gt;5&lt;/em&gt;&lt;/td&gt;&lt;td&gt;Graduated sanctions&lt;/td&gt;&lt;td&gt;Appropriators who violate operational rules are likely to be assessed graduated sanctions (depending on the seriousness and the context of the offense) by other appropriators, by officials accountable to the appropriators, or by both.&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;em&gt;6&lt;/em&gt;&lt;/td&gt;&lt;td&gt;Conflict-resolution mechanisms&lt;/td&gt;&lt;td&gt;Appropriators and their officials have rapid access to low-cost local arenas to resolve conflicts among appropriators or between appropriators and officials.&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;em&gt;7&lt;/em&gt;&lt;/td&gt;&lt;td&gt;Minimal recognition of rights to organise&lt;/td&gt;&lt;td&gt;The rights of appropriators to devise their own institutions are not challenged by external governmental authorities.&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;em&gt;8&lt;/em&gt;&lt;/td&gt;&lt;td&gt;Nested enterprises&lt;/td&gt;&lt;td&gt;Appropriation, provision, monitoring, enforcement, conflict resolution, and governance activities are organised in multiple layers of nested enterprises.&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;p&gt;&lt;em&gt;(&lt;a href=&quot;http://www.jstor.org/stable/26268233&quot;&gt;Cox, Michael, Gwen Arnold, and Sergio Villamayor Tomás. “A Review of Design Principles for Community-Based Natural Resource Management.” Ecology and Society 15, no. 4 (2010).&lt;/a&gt;)&lt;/em&gt;&lt;/p&gt;&lt;h3 id=&quot;applying-the-design-principles&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://hazelweakly.me/blog/scaling-innovation-building-ecosystems/#applying-the-design-principles&quot;&gt;&lt;span&gt;Applying the Design Principles&lt;/span&gt;&lt;/a&gt;&lt;/h3&gt;&lt;p&gt;As such, whenever I think about “how do I design something people can grow to become adapted by a company as a way to collectively manage some resources.” I reach for this as a way to check and guide my work. If a solution doesn’t satisfy these criteria, it will eventually fail in some manner at scale, or it will only succeed temporarily.&lt;/p&gt;&lt;p&gt;Consequently, if I’m designing a system and want to ensure that it is temporary and does &lt;em&gt;not&lt;/em&gt; scale, (as a way to shorten the time to production), then I will intentionally choose as many of these to violate as I can. To clarify, I don’t do that in order to maliciously ruin the system, but rather it my experience the shortcuts in system design do not actually end up saving time unless they violate many of these principles.&lt;/p&gt;&lt;p&gt;The worst outcomes that I tend to see are when multiple shortcuts are introduced that do &lt;em&gt;not&lt;/em&gt; violate these principles, yet only half of the principles are fulfilled. The system then is capable of evolution just enough to limp on far beyond any intended lifespan, yet the friction in the system is so great that it becomes a source of continual pain for everyone participating in it.&lt;/p&gt;&lt;p&gt;Thus, I strive to be bimodal in my application of these principles. By intentionally failing as many of them as I can, the knowledge I gain as these things successfully fail to scale means that a new scalable solution is far clearer to build. Likewise, by fulfilling as many of these as I can, I can ensure that the resulting system is flexible enough to work in practice, long term, and so getting the design right from day zero becomes much less important.&lt;/p&gt;&lt;h3 id=&quot;more-of-ostrom&#39;s-work&quot; tabindex=&quot;-1&quot;&gt;&lt;a href=&quot;https://hazelweakly.me/blog/scaling-innovation-building-ecosystems/#more-of-ostrom&#39;s-work&quot; class=&quot;header-anchor&quot;&gt;&lt;span&gt;More of Ostrom’s Work&lt;/span&gt;&lt;/a&gt;&lt;/h3&gt;&lt;p&gt;This will be expanded later as I have time. I’m just going to dump a bunch of info here for now in order to list out the relevant concepts.&lt;/p&gt;&lt;p&gt;&lt;a href=&quot;https://web.pdx.edu/~nwallace/EHP/OstromPolyGov.pdf&quot;&gt;Elinor Ostrom. “Beyond Markets and States: Polycentric Governance of Complex Economic Systems”&lt;/a&gt;&lt;/p&gt;&lt;blockquote class=&quot;flow&quot;&gt;&lt;p&gt;“A core effort is developing a more general theory of individual choice that recognises the &lt;strong&gt;central role of trust in coping with social dilemmas.&lt;/strong&gt;”&lt;/p&gt;&lt;/blockquote&gt;&lt;ul&gt;&lt;li&gt;Trust and reciprocity are core ideas of success in systems. Trust is the &lt;em&gt;most essential element&lt;/em&gt; to overcoming social dilemmas. This is specifically &lt;strong&gt;trusting that others are reciprocators.&lt;/strong&gt; In other words: collective benefit is directly derived from how much cooperation actors exhibit. Cooperation is directly derived from how much trust actors have that other participants are reciprocators.&lt;/li&gt;&lt;li&gt;“One-size-fits-all” policies are not effective.&lt;/li&gt;&lt;li&gt;There are four types of goods and they can be categorised by factor and difficulty of but slowly potential beneficiaries with shift tractability of use.&lt;/li&gt;&lt;li&gt;Providing opportunities for “cheap talk” (or any form of communication) reduces over-consumption of common-point resources&lt;/li&gt;&lt;li&gt;Polycentric governance models, the governance of a CAST: many centres of decision making that are formally independent of each other, either by being truly independent or by being interdependent. &lt;ul&gt;&lt;li&gt;Importantly, this shows empirically that complex systems are not chaotic systems, even if a sufficiently complex CAST approaches the appearance of chaos, they can still be governed with better-than-chaotic approaches despite the potential loss of globally coherent collective knowledge.&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Crucially, polycentric governance of resource and infrastructure systems are more efficient, more effective, and more understandable.&lt;/strong&gt;&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;/ul&gt;&lt;table class=&quot;2x2 bordered-box&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;flow&quot;&gt;&lt;p&gt;&lt;strong&gt;Common Pool Resources&lt;/strong&gt;&lt;/p&gt; &lt;p&gt;&lt;em&gt;High Difficulty of Excluding Potential Beneficiaries.&lt;/em&gt; &lt;br&gt; &lt;em&gt;High Subtractability of Use.&lt;/em&gt;&lt;/p&gt;&lt;/td&gt;&lt;td class=&quot;flow&quot;&gt;&lt;p&gt;&lt;strong&gt;Public Goods&lt;/strong&gt;&lt;/p&gt; &lt;p&gt;&lt;em&gt;High Difficulty of Excluding Potential Beneficiaries.&lt;/em&gt; &lt;br&gt; &lt;em&gt;Low Subtractability of Use.&lt;/em&gt;&lt;/p&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class=&quot;flow&quot;&gt;&lt;p&gt;&lt;strong&gt;Private Goods&lt;/strong&gt;&lt;/p&gt; &lt;p&gt;&lt;em&gt;Low Difficulty of Excluding Potential Beneficiaries.&lt;/em&gt; &lt;br&gt; &lt;em&gt;High Subtractability of Use.&lt;/em&gt;&lt;/p&gt;&lt;/td&gt;&lt;td class=&quot;flow&quot;&gt;&lt;p&gt;&lt;strong&gt;Toll Goods&lt;/strong&gt;&lt;/p&gt; &lt;p&gt;&lt;em&gt;Low Difficulty of Excluding Potential Beneficiaries.&lt;/em&gt; &lt;br&gt; &lt;em&gt;Low Subtractability of Use.&lt;/em&gt;&lt;/p&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;h2 id=&quot;ecosystems-as-a-future-sensing-engine&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://hazelweakly.me/blog/scaling-innovation-building-ecosystems/#ecosystems-as-a-future-sensing-engine&quot;&gt;&lt;span&gt;Ecosystems as a Future Sensing Engine&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;One can identify ecosystems as a competitive strategy and can either play the game as an &lt;em&gt;adversarial&lt;/em&gt; competition or as a &lt;em&gt;collaborative&lt;/em&gt; competition. Adversarial in this sense means in the us versus them sense (like the capitalistic market), not in the “let’s maliciously step outside of the rules and ruin them” sense. A notion of Collaborative competition, here, is my idea. The idea is that, given a large enough internal ecosystem, we benefit from artificially creating enough diversity in order to stimulate innovation faster than may naturally occur otherwise.&lt;/p&gt;&lt;p&gt;I find it fruitful to approach designing ecosystems for collaborative competition by thinking of them through an Innovate Leverage Commoditise (ILC) model. &lt;a href=&quot;https://blog.gardeviance.org/2014/03/understanding-ecosystems-part-i-of-ii.html&quot;&gt;This blog post by Simon Wordley&lt;/a&gt; contains a fantastic overview and example of this and how it can look in the real world. Quoting Simon here, we can see why this is directly applicable to most large enterprises.&lt;/p&gt;&lt;blockquote class=&quot;flow&quot;&gt;&lt;p&gt;The purpose of a platform (and hence an API) is to create an ecosystem. The &lt;strong&gt;value is in the ecosystem&lt;/strong&gt;. The &lt;strong&gt;ecosystem is a future sensing engine&lt;/strong&gt;. Correctly used (under an ILC model) you can create network effects whereby …&lt;/p&gt;&lt;ol&gt;&lt;li&gt;Your apparent rate of innovation&lt;/li&gt;&lt;li&gt;Your customer focus&lt;/li&gt;&lt;li&gt;Your efficiency&lt;/li&gt;&lt;li&gt;Your stability of revenue&lt;/li&gt;&lt;li&gt;Your ability to maximise opportunity&lt;/li&gt;&lt;/ol&gt;&lt;p&gt;… All increase, SIMULTANEOUSLY, with the size of your ecosystem and NOT the physical size of your company.&lt;/p&gt;&lt;/blockquote&gt;&lt;p&gt;(emphasis mine)&lt;/p&gt;&lt;p&gt;This works by tracing through the following chain of thought.&lt;/p&gt;&lt;ol&gt;&lt;li&gt;Competition enables new higher order systems. This is backed by cumulative culture, as well as decades of anecdotal success by Wardley and other change agents operating the companies and governments of various sizes.&lt;/li&gt;&lt;li&gt;Evolution operates in multiple overlapping cycles at the macro and micro level. This is bound by every field studying CASTS.&lt;/li&gt;&lt;li&gt;Studying patterns of utility consumption provides sufficient data to strategically sense the future. You can see this particularly well with AWS, which has applied this model almost since the very beginning of its inception.&lt;/li&gt;&lt;/ol&gt;&lt;h3 id=&quot;the-opportunity&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://hazelweakly.me/blog/scaling-innovation-building-ecosystems/#the-opportunity&quot;&gt;&lt;span&gt;The Opportunity&lt;/span&gt;&lt;/a&gt;&lt;/h3&gt;&lt;p&gt;Looking at the diagram below, the numbers labeled 1 2 3 (in red) correspond to the below chain of thought. This chain of thought pattern matches on the above one intentionally to drive the point home a bit more thoroughly.&lt;/p&gt;&lt;ol&gt;&lt;li&gt;If you commoditise a component into an industrialised form, that enables others to innovate on top&lt;/li&gt;&lt;li&gt;Then you can leverage consumption behaviour in order to detect successful innovations.&lt;/li&gt;&lt;li&gt;Finally, you can commoditise any identified success to become a fast follower without incurring prohibitive R&amp;D risk. This repeats indefinitely.&lt;/li&gt;&lt;/ol&gt;&lt;p&gt;&lt;/p&gt;&lt;figure class=&quot;flow bordered-box has-caption&quot;&gt;&lt;picture&gt;&lt;source srcset=&quot;https://hazelweakly.me/images/1g99zauvbx-300.avif 300w, https://hazelweakly.me/images/1g99zauvbx-600.avif 600w, https://hazelweakly.me/images/1g99zauvbx-986.avif 986w&quot; sizes=&quot;100vw&quot; type=&quot;image/avif&quot;&gt;&lt;source srcset=&quot;https://hazelweakly.me/images/1g99zauvbx-300.webp 300w, https://hazelweakly.me/images/1g99zauvbx-600.webp 600w, https://hazelweakly.me/images/1g99zauvbx-986.webp 986w&quot; sizes=&quot;100vw&quot; type=&quot;image/webp&quot;&gt;&lt;img alt=&quot;An ILC diagram showing the zig zag upwards motion of the commodification and industrialization of individual innovations&quot; srcset=&quot;https://hazelweakly.me/images/1g99zauvbx-300.jpeg 300w, https://hazelweakly.me/images/1g99zauvbx-600.jpeg 600w, https://hazelweakly.me/images/1g99zauvbx-986.jpeg 986w&quot; title=&quot;The zig-zag evolution of innovation via cooperative competition&quot; class=&quot;w-full&quot; decoding=&quot;async&quot; height=&quot;762&quot; sizes=&quot;100vw&quot; src=&quot;https://hazelweakly.me/images/1g99zauvbx-300.jpeg&quot; width=&quot;986&quot;&gt;&lt;/picture&gt;&lt;figcaption&gt;The zig-zag evolution of innovation via cooperative competition&lt;/figcaption&gt;&lt;/figure&gt;&lt;p&gt;&lt;/p&gt;&lt;p&gt;&lt;em&gt;(&lt;a href=&quot;https://blog.gardeviance.org/2014/03/understanding-ecosystems-part-i-of-ii.html&quot;&gt;Wardley, S. “Understanding Ecosystems” (2014).&lt;/a&gt;)&lt;/em&gt;&lt;/p&gt;&lt;p&gt;The reason this works so well is because there are two different types of risk here. Firstly, there is the R&amp;D risk that you externalise onto the people operating in the ecosystem. Secondly, there is the risk of scale. Taking a solution that has proven to work at one scale, and then actually scaling up, creates a lot of challenges and risk in budget and operational skill. You, as the caretaker of an ecosystem, can actually take on that type of risk because you are better equipped to handle the economies of scale. Your consumers, on the other hand, are better equipped to handle the economies of diversity. Building a game that plays to the strengths of both allows you to actually work collaboratively together, even if you’re not on the same team, and even though it appears to operate like a competition.&lt;/p&gt;&lt;p&gt;In an adversarial competition, one must balance how rapidly and pervasively the opponent’s consume and commoditise successes in order to make it financially worthwhile for actors to build on top of the existing commodities in your ecosystem. In a cooperative competition, one must balance how rapidly and proficiently they consume and commoditise successes in order to ensure that they can curate the growing ecosystem effectively, and also to ensure that innovators have an incentive to work with you to make it easier for the hand-off to occur.&lt;/p&gt;&lt;p&gt;I like to think of this as a garden (and indeed, Wardley uses that metaphor occasionally as well). If one harvests the garden too aggressively, the garden dies from resource starvation. If one harvests it too little, the garden dies from becoming overgrown and unruly beyond manageable capacity. In many enterprises, we can think of our platforms as ecosystems that are comprised of inner overlapping sub-ecosystems, each with their own set of components and various levels of evolution (but often at the product-to-utility maturity of evolution). It follows, then, that if a company’s platform adopts components and commoditises them too slowly, teams will be reluctant to build and innovate because the cost of carrying the burden of R&amp;D will be too high. Likewise, if that company’s platforms commoditise too rapidly, teams will not innovate because the platforms will lack the stability required to be built build on top of, and the teams will experience prohibitive amounts of churn fatigue.&lt;/p&gt;&lt;p&gt;To expand on that, one can imagine enabling product teams at the “uncharted” side to innovate, rather than requiring innovation to happen solely through platform teams. This would be de-risked on the product side by allowing them to solve their immediate problems faster than otherwise could happen. Likewise, it would be de-risked on the platform side by incentivizing collaboration between the platform team and product team. In that case, the platform team would aim to act in more of an advisory or consultation role. Ultimately, this would have the goal of ensuring that successful innovations can seamlessly be adopted or graduated into to the platforms.&lt;/p&gt;&lt;p&gt;Essentially, what we form here is a positive feedback loop in the form of a &lt;strong&gt;bidirectional trust building engine&lt;/strong&gt;, where&lt;/p&gt;&lt;ol&gt;&lt;li&gt;Actors become increasingly trusting that &lt;strong&gt;platforms reciprocate investment and innovation by rewarding actors with a better platform and reduced operational burden&lt;/strong&gt;, with which they can continually ratchet up their capabilities and build increasingly valuable products.&lt;/li&gt;&lt;li&gt;Simultaneously, platforms become increasingly trusting that &lt;strong&gt;actors reciprocate investment and commoditization by building on top of platforms and working to reduce burden of adoption by delivering high-quality innovations&lt;/strong&gt; that can be efficiently and effectively commoditised.&lt;/li&gt;&lt;/ol&gt;&lt;p&gt;In my opinion, this is &lt;em&gt;similar&lt;/em&gt;, but not entirely equivalent, to concepts like the &lt;a href=&quot;https://charity.wtf/2018/12/02/software-sprawl-the-golden-path-and-scaling-teams-with-agency/&quot;&gt;golden path&lt;/a&gt;, &lt;a href=&quot;https://www.oreilly.com/videos/oscon-2017/9781491976227/9781491976227-video306724/&quot;&gt;the paved road&lt;/a&gt;, and the &lt;a href=&quot;https://medium.com/booking-com-infrastructure/how-reliability-and-product-teams-collaborate-at-booking-com-f6c317cc0aeb&quot;&gt;reliability collaboration model&lt;/a&gt;. I also think that this serves as a very handy approach, combined with Ostrom’s design principles, towards solving the problem of continually growing standardization without stifling innovation or preventing necessary specialization. By bringing both into a consistent framework, we turn an otherwise tension filled trade-off of “I want to optimise for my product use case” versus “I want to optimise for ease of uniform operation” into a mutual trade-off space of “how can I innovate in a way that helps you help me best” and “how can commoditise in a way that helps you help me best.”&lt;/p&gt;&lt;h2 id=&quot;understanding-cumulative-culture&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://hazelweakly.me/blog/scaling-innovation-building-ecosystems/#understanding-cumulative-culture&quot;&gt;&lt;span&gt;Understanding Cumulative Culture&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;What follows is a very rough dump of my internal notes which will be cleaned up later. I do apologise for the mess, but I apologise only mildly.&lt;/p&gt;&lt;ul&gt;&lt;li&gt;&lt;a href=&quot;https://pmc.ncbi.nlm.nih.gov/articles/PMC5053256/&quot;&gt;Eureka! What is innovation, how does it develop, and who does it?&lt;/a&gt; &lt;ul&gt;&lt;li&gt;Idea. Innovation is independent invention, age social learning, and/or modification of social learning, but it must be novel.&lt;/li&gt;&lt;li&gt;Innovation should be useful and/or transmitted.&lt;/li&gt;&lt;li&gt;(insert image from paper)&lt;/li&gt;&lt;li&gt;Innovations don’t need to be intentional, but to lead to learning.&lt;/li&gt;&lt;li&gt;Innovation classification. Low, (unlearned chance via an individual, not repeated). Medium, (individual learned, repeated by individual). High (Individually learned, acquired by others).&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;a href=&quot;https://onlinelibrary.wiley.com/doi/10.1111/mila.12335&quot;&gt;Cumulative culture and complex cultural traditions.&lt;/a&gt; &lt;ul&gt;&lt;li&gt;Idea. Four distinct trends associated with cumulative culture. adaptiveness, complexity, efficiency, and disparity.&lt;/li&gt;&lt;li&gt;Adaptiveness. As variants are transmitted over time, they can accumulate modifications, and some of these make about a concomitant increase in the biological fitness of individuals who bear or express these variants.&lt;/li&gt;&lt;li&gt;Complexity. There are three competing accounts that overlap and conflict. Unit Counting, Skillfulness, and Interactive Complexity &lt;ul&gt;&lt;li&gt;Unit counting is the increase in actions and tools involved to produce a behaviour.&lt;/li&gt;&lt;li&gt;Skillfulness is how many people are required to transmit cultural knowledge and how difficult expertise is.&lt;/li&gt;&lt;li&gt;Interactive complexly is the number of components in a tradition and degree of interaction between them.&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;Efficiency - Lowering costs associated to acquiring or performing behaviours&lt;/li&gt;&lt;li&gt;Disparity. Accumulating increasing numbers of qualitatively distinct cultural traditions. Disparity also likely has to come before complexity.&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;a href=&quot;https://link.springer.com/article/10.1007/s13752-020-00351-w&quot;&gt;Where does cumulative culture begin? A plea for a sociologically informed perspective.&lt;/a&gt; &lt;ul&gt;&lt;li&gt;Idea: cumulative culture as a form of behaviour that could not have been invented by an individual alone.&lt;/li&gt;&lt;li&gt;Care giving and culture giving go hand in hand.&lt;/li&gt;&lt;li&gt;Social Learning: Emulation, Imitation, Modification, Teaching, high fidelity transmission &lt;ul&gt;&lt;li&gt;Transfer mechanisms that are simpler than social learning are: peering, participation, co-performance, or engagement with a material environment altered by group members.&lt;/li&gt;&lt;li&gt;Migratory patterns of sheep and moose evolve via social learning, same for homing pigeons.&lt;/li&gt;&lt;li&gt;Post-natal environment is the continuation of the universe as a stimulation learning environment. the community is a social uterus&lt;/li&gt;&lt;li&gt;Group-specific practice patterns are sub-action level.&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;Habitus: social environment, a supply of social latent solutions plus social learning. &lt;ul&gt;&lt;li&gt;A cumulative matrix of perceptions, appreciations, and actions&lt;/li&gt;&lt;li&gt;embodied cultural and social capital&lt;/li&gt;&lt;li&gt;located on the sub-action level of dispositions and competency; not dissimilar from “attitude,” conceptually&lt;/li&gt;&lt;li&gt;evolves via osmosis-like action&lt;/li&gt;&lt;li&gt;Behaviors of humans: evolutionary, historical, ontogenetic&lt;/li&gt;&lt;li&gt;Cultural niche construction.&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;Metaphor: ratchet&lt;/li&gt;&lt;li&gt;alternative metaphor: mountaineering effect (development is path dependent)&lt;/li&gt;&lt;li&gt;idea: habitus. Embodied cultural and social capital &lt;ul&gt;&lt;li&gt;A system of durable and transposable dispositions which, integrating all past experiences, functions at every moment as a matrix of perceptions, appreciations, and actions, and makes possible the achievement of infinitely diversified tasks, thanks to analogical transfers of schemes permitting the solution of similarly achieved problems.&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;a href=&quot;https://royalsocietypublishing.org/doi/10.1098/rstb.2015.0192&quot;&gt;Innovation in the Collective Brain&lt;/a&gt; &lt;ul&gt;&lt;li&gt;idea: our societies are a collective brain&lt;/li&gt;&lt;li&gt;Within these collective brains, the three main sources of innovation are serendipity, recombination, and incremental improvement.&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;/ul&gt;&lt;h3 id=&quot;cumulative-culture-and-developer-experience&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://hazelweakly.me/blog/scaling-innovation-building-ecosystems/#cumulative-culture-and-developer-experience&quot;&gt;&lt;span&gt;Cumulative Culture and Developer Experience&lt;/span&gt;&lt;/a&gt;&lt;/h3&gt;&lt;p&gt;Additionally, cumulative culture &lt;a href=&quot;https://www.drcathicks.com/post/a-cumulative-culture-theory-of-developer-problem-solving-new-preprint&quot;&gt;has a link to developer experience and developer productivity&lt;/a&gt; through the work of a good friend of mine, &lt;a href=&quot;https://www.drcathicks.com/&quot;&gt;Dr. Cat Hicks.&lt;/a&gt;&lt;/p&gt;&lt;p&gt;The abstract is below (emphasis mine)&lt;/p&gt;&lt;blockquote class=&quot;flow&quot;&gt;&lt;p&gt;Understanding how developers problem-solve within ecosystems of practice, tooling, and social contexts is a critical step in determining which factors dampen, aid or accelerate software innovation. However, industry conceptions of developer problem-solving often focus on overly simplistic measures of output, over-extrapolate from small case studies, rely on conventional definitions of “programming” and short-term definitions of performance, fail to integrate the new economic features of the open collaborative innovation that marks software progress, and fail to integrate rich bodies of evidence about problem-solving from the social sciences. We propose an alternative to individualistic explanations for software developer problem-solving: a Cumulative Culture theory for developer problem-solving. This paper aims to provide an interdisciplinary introduction to underappreciated elements of developers’ communal, social cognition which are required for software development creativity and problem-solving, either empowering or constraining the solutions that developers access and implement. &lt;strong&gt;We propose that despite a conventional emphasis on individualistic explanations, developers’ problem-solving&lt;/strong&gt; (and hence, many of the central innovation cycles in software) &lt;strong&gt;is better described as a cumulative culture where collective social learning&lt;/strong&gt; (rather than solitary and isolated genius) &lt;strong&gt;plays a key role in the transmission of solutions, the scaffolding of individual productivity, and the overall velocity of innovation.&lt;/strong&gt;&lt;/p&gt;&lt;p&gt;– Hicks, C. M., &amp; Hevesi, A. (2024, November 21). A Cumulative Culture Theory for Developer Problem-Solving. &lt;a href=&quot;https://doi.org/10.31234/osf.io/tfjyw&quot;&gt;https://doi.org/10.31234/osf.io/tfjyw&lt;/a&gt;&lt;/p&gt;&lt;/blockquote&gt;&lt;p&gt;What I take from this is that, in other words, developer productivity and the overall velocity of innovation are best thought of as a product of collective social learning. Fortunately, collective social learning is well studied in humans as a cumulative culture, and we would do well to learn from that existing and rich body of work. Bringing everything together, I take this to mean that, in my opinion, focusing on a collective environment directly implies that the facilitation and creation of that environment becomes a top priority for a high performing enterprise, and we can understand how to operationalise this facilitation by framing it in the lens of socio-economic systems.&lt;/p&gt;&lt;p&gt;To put concisely: The journey to high performance in a community is equivalent to pursuing the art of learning. For an enterprise, that means the enterprise should built towards having and maintaining a culture that seeks to understand itself better.&lt;/p&gt;&lt;div class=&quot;py-4&quot;&gt;&lt;hr&gt;&lt;/div&gt;&lt;p&gt;This blog post is, as of October 2025… Roughly a quarter of the way done. I still need to flesh out all of the skeletons and add all of the Resilience Engineering and Mathematics in here. Rest assured, I will probably get to that. I have every intention! But I don’t control the future.&lt;/p&gt;&lt;p&gt;Anyways, if you’re reading this and have suddenly realised that the blog post is old enough to go to school and I have yet to remove this notice, please find me somewhere on the internet and poke me about it.&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>Stop Building AI Tools Backwards</title>
    <link href="https://hazelweakly.me/blog/stop-building-ai-tools-backwards/" />
    <updated>2025-05-16T00:00:00Z</updated>
    <id>https://hazelweakly.me/blog/stop-building-ai-tools-backwards/</id>
    <content type="html">&lt;p&gt;I’ve been reading this week about how humans learn, and effective ways of transferring knowledge. In addition, I’ve also had AI in the back of my mind, and recently I’ve come to the realization that not only is our industry building AI tools poorly, we’re building them backwards. Which, honestly, is really depressing to me because there is so much unrealised potential that we have available–is it not enough that we built the LLMs unethically, and that they waste far more energy than they return in value? On top of that, it doesn’t take that much extra effort to build the tooling in a way that facilitates how humans work together; the tooling could be built to improve our capabilities by making everybody more effective, rather than by deskilling critical reasoning loops for practitioners. Here’s how that might look.&lt;/p&gt;&lt;h2 id=&quot;the-human-part&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://hazelweakly.me/blog/stop-building-ai-tools-backwards/#the-human-part&quot;&gt;&lt;span&gt;The Human Part&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;First: How we learn. My favorite (evidence backed) theory on how humans learn is &lt;a href=&quot;https://www.learningscientists.org/blog/2024/3/7/how-does-retrieval-improve-new-learning&quot;&gt;Retrieval Practice&lt;/a&gt;.&lt;/p&gt;&lt;p&gt;The short of it is that humans don’t really learn when we download info into our brain, we learn when we &lt;em&gt;expend effort&lt;/em&gt; to pull that info out. This has some big implications for designing collaborative tooling!&lt;/p&gt;&lt;p&gt;Second: What we learn. It turns out, the “thing” that we learn most effectively is not &lt;em&gt;knowledge&lt;/em&gt; as we typically think of it, it’s &lt;em&gt;process&lt;/em&gt;. This should be intuitive, if we put into a bit of a more natural context. Imaging learning baking for a moment: Do you teach someone to bake a cake by spitting out a fact sheet of ingredients and having them memorise it? Or do you teach them the process?&lt;/p&gt;&lt;p&gt;Third: How we level up. Humans are really bad at “novel” innovation, which is a bit tragic because novel innovation seems to be the thing that the tech industry thinks of when it talks about developer productivity. We surround ourselves with the myth of the solo genius, we benchmark developers on individual contributions, and expect people to implement code by themselves. Yet, it turns out that sustained solo innovation is both extremely rare, and also not that important in the grand scheme of things. It’s much more like the sprinkles on top of a cupcake, rather than the main course; simply put, it’s not how innovation &lt;em&gt;generally&lt;/em&gt; happens.&lt;/p&gt;&lt;p&gt;However! We’re really good at cumulative iteration. Humans are turbo optimised for communities, basically. This is why brainstorming is so effective… But usually only in a group. There is an entire theory in cognitive psychology about cumulative culture that goes directly into this and shows empirically how humans work in groups. Humans learn collectively and innovate collectively via copying, mimicry, and iteration on top of prior art. You know that quote about standing on the shoulders of giants? It turns out that it’s not only a fun quote, but it’s fundamentally how humans work.&lt;/p&gt;&lt;p&gt;Also, innovation and problem solving? Basically the same thing. If you get good at problem solving, propagating learning, and integrating that learning into the collective knowledge of the group, then the infamous Innovator’s Dilemma disappears.&lt;/p&gt;&lt;p&gt;So, combine all of those bits of information together, what do we get?&lt;/p&gt;&lt;ol&gt;&lt;li&gt;Humans learn and teach via process&lt;/li&gt;&lt;li&gt;Processes need to take a goldilocks amount of effort to be effective&lt;/li&gt;&lt;li&gt;Cumulative iteration &gt; solo developer problem solving&lt;/li&gt;&lt;li&gt;We build tools to &lt;em&gt;help&lt;/em&gt; us think, not to think &lt;em&gt;for&lt;/em&gt; us&lt;/li&gt;&lt;/ol&gt;&lt;h2 id=&quot;the-ai-part&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://hazelweakly.me/blog/stop-building-ai-tools-backwards/#the-ai-part&quot;&gt;&lt;span&gt;The AI Part&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;Now, here’s the main pattern I see AI tooling doing:&lt;/p&gt;&lt;ol&gt;&lt;li&gt;Click AI button -&gt; ✨ magic ✨&lt;/li&gt;&lt;li&gt;View data -&gt; AI Suggestions&lt;/li&gt;&lt;li&gt;Action prompt -&gt; AI initiation&lt;/li&gt;&lt;/ol&gt;&lt;p&gt;What’s missing? Human retrieval and task initiation, process reinforcement, collective knowledge transfer, and iterative improvements… Y’know, the whole set of criteria that humans &lt;em&gt;need&lt;/em&gt; in order to be effective? This is wild, we’re taking the &lt;em&gt;one&lt;/em&gt; thing humans are good at and making AI do it. But AI is bad at it! Even worse: if humans get bad at it then we’ve lost the one thing we had going for us as a species!&lt;/p&gt;&lt;p&gt;Which means we end up deskilling humans faster than we improve AI, and the humans can’t improve the AI because we’re no longer feeding AI the high quality data it can use to augment human excellence. It’s a self-reinforcing feedback loop… Spiraling rapidly downwards into ineffective systems. I’m already seeing negative consequences of this–constantly–and it’s heartbreaking.&lt;/p&gt;&lt;p&gt;However, a small change to how AI interactions are built can help reverse this.&lt;/p&gt;&lt;h2 id=&quot;building-better-ai-tools&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://hazelweakly.me/blog/stop-building-ai-tools-backwards/#building-better-ai-tools&quot;&gt;&lt;span&gt;Building Better AI Tools&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;Oftentimes, people try to provide an analogy of AI as an intern or as a co-worker, and candidly, I don’t like either of these. This is because it doesn’t, to me, convey an intuition around the right way to build (or interact with) an AI tool that will help you get better at doing what you do. So instead, as an analogy, I like to imagine AI as an “absent-minded instructor”, not as a coworker. It’s prone to forgetting details, but ultimately there to guide you; most importantly, the goal of the instructor is to make sure you learn and learn &lt;em&gt;how&lt;/em&gt; to learn!&lt;/p&gt;&lt;p&gt;If you want to be a bit snarky about it, you can alternatively think of AI as a very overconfident rubber duck that exclusively uses the Socratic method, is prone to irrelevant tangents, and is weirdly obsessed with quirky hats. Whatever floats your ducky.&lt;/p&gt;&lt;p&gt;So, I’m going to walk through one of the anti-patterns I see in AI tooling and fix it by taking an evidence-based teaching process and imagining it augmented with AI. The teaching process, by the way, is: &lt;strong&gt;Explain, Demonstrate, Guide, Enhance.&lt;/strong&gt;&lt;/p&gt;&lt;p&gt;If you’ve ever been in scouting, you’ll recognise this as their EDGE method with a small difference; rather than “enable”, I’m using “enhance”. The reason for that is because “enable” is about having someone perform the action, but we are already sprinkling human actions all the way through the process. Instead, “enhance” is going to be about feeding that human action into the next iteration of problem solving, so that the next time someone does something, they get even better. Ideally, we want to encourage and inspire even more ambitious tasks, guiding people towards increasingly effective actions.&lt;/p&gt;&lt;p&gt;(The theory behind EDGE and similar methodologies is &lt;a href=&quot;https://www.cell.com/trends/cognitive-sciences/abstract/S1364-6613(10)00208-1&quot;&gt;Retrieval Practice&lt;/a&gt;. It turns out to be highly general, and there’s a million ways to do it, but I picked this for the example because it matches how I would teach an early career engineer the process of managing an incident, as well as the mental models and strategies I use when thinking through said process.)&lt;/p&gt;&lt;p&gt;The running example is gonna be incident management with observability tooling being used to diagnose and remediate the incident. While we’re at it, the anti-pattern we’re going to fix is “Given a prompt sent to a human, immediately initiate a response with AI.” I picked this one because it’s the one I see the most marketing on and it also has some of the most damaging potential for human expertise: in short, it’s the one thing you &lt;em&gt;absolutely don’t ever, for any reason,&lt;/em&gt; want to implement with incident management and observability tooling.&lt;/p&gt;&lt;h2 id=&quot;what-better-ai-tooling-looks-like&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://hazelweakly.me/blog/stop-building-ai-tools-backwards/#what-better-ai-tooling-looks-like&quot;&gt;&lt;span&gt;What Better AI Tooling Looks Like&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;Let’s set the stage of the story… It’s way-too-late o’clock and our human is fast asleep. But what’s that I hear? &lt;sup&gt;(the author writes, ironically, being profoundly Deaf…)&lt;/sup&gt; Oh no! The pager!&lt;/p&gt;&lt;p&gt;Something’s on fire!&lt;/p&gt;&lt;p&gt;What does the human do? Well, they’re going to acknowledge that they’re responding to the incident, and then… They’re going to start by opening up the observability tool, right? So that’s where we’re going to start.&lt;/p&gt;&lt;p&gt;The most important thing here is that when the human opens the observability tool, they have to &lt;em&gt;actively, and with some effort,&lt;/em&gt; recall (or retrieve) the process of what to do next. This is &lt;strong&gt;crucial.&lt;/strong&gt; If your process is so baroque and messed up that people can’t remember what to do next, you should stop reading this article and fix that; AI can’t save you from a 97-step-guide-to-hating-your-life.&lt;/p&gt;&lt;p&gt;But alrighty, the human has recalled the process of incident management! Heck yeah! Now, we’ve got our fancy AI tooling because we’re living in the ✨ future ✨. What should AI do, here? (NO, it’s not auto investigate.&quot; Auto fix? NOPE!)&lt;/p&gt;&lt;p&gt;Let’s walk through that EDGE (Explain, Demonstrate, Guide, Enhance) process and see what helpful, human enhancing, AI tooling looks like. Additionally, I’m a little weird, so I’m going to call any AI-assisted action here an ‘interaction’ to help reinforce that effective AI tooling is about amplifying human effectiveness. Remember: AI should be an amplifier, not an obfuscater.&lt;/p&gt;&lt;h3 id=&quot;explain&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://hazelweakly.me/blog/stop-building-ai-tools-backwards/#explain&quot;&gt;&lt;span&gt;Explain&lt;/span&gt;&lt;/a&gt;&lt;/h3&gt;&lt;p&gt;Here’s what some &lt;strong&gt;good&lt;/strong&gt; interactions look like&lt;/p&gt;&lt;ul&gt;&lt;li&gt;Suggest missing steps (ex: “have you tried turning it off and on?”, “can you rollback the deployment before investigating further?”, “do you wanna filter?”)&lt;/li&gt;&lt;li&gt;Pull up the incident process guide (and help explain it)&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;Here’s what some &lt;strong&gt;bad&lt;/strong&gt; interactions look like&lt;/p&gt;&lt;ul&gt;&lt;li&gt;“click this button to perform an action”&lt;/li&gt;&lt;li&gt;“explain this error” tooltips or buttons&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;Why? Because they remove human retrieval from the process and humans have no way to interact with the interface and evolve it from providing an unhelpful interaction to providing a helpful one. This is going to be a running theme. Retrieval is something that needs constant reinforcement so that humans continue to get increasingly effective at it.&lt;/p&gt;&lt;h3 id=&quot;demonstrate&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://hazelweakly.me/blog/stop-building-ai-tools-backwards/#demonstrate&quot;&gt;&lt;span&gt;Demonstrate&lt;/span&gt;&lt;/a&gt;&lt;/h3&gt;&lt;p&gt;Here’s what some &lt;strong&gt;good&lt;/strong&gt; interactions look like&lt;/p&gt;&lt;ul&gt;&lt;li&gt;Turn human query into system query syntax (eg turning “what are the top 10 slowest endpoints for the service I care about?” into the query syntax of your observability tool)&lt;/li&gt;&lt;li&gt;Turns human asks into UI discovery (eg a human says “can I see the SLOs for this service and the downstream customers?” and the AI provides a link to the SLO page of the tool)&lt;/li&gt;&lt;li&gt;Turn task execution questions into dynamic 15 second demos (eg a human asks “how do I compare two time ranges?” -&gt; Provide a short animation of the process, or an interactive click-through-these-steps)&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;I know, I know, it’s &lt;em&gt;so&lt;/em&gt; tempting to provide a button that says “click me to do the thing”. DON’T. Not only does it deskill the human, but what if you mess it up and waste everyone’s time? Trust is crucial for developer tooling and you will &lt;em&gt;not&lt;/em&gt; get it back.&lt;/p&gt;&lt;p&gt;Lastly, think about it: when’s the last time you clicked an “auto do the thing” button and then &lt;em&gt;didn’t&lt;/em&gt; want to do several follow-up modifications of that same process? Making humans chew through a zillion tokens in order to get a simple task done is a great way to take your friction-reducing interaction and turn it into a friction-&lt;em&gt;introducing&lt;/em&gt; interaction.&lt;/p&gt;&lt;p&gt;As an aside: Yes, humans should be able to add data to this. If I’m pairing with a developer that I’m mentoring and I’m teaching them how to do a thing, I want the AI to be able to demonstrate similar things in the future using my actions as a starting point. Any human recall task is extremely high quality data for training and fine-tuning. Use it!&lt;/p&gt;&lt;h3 id=&quot;guide&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://hazelweakly.me/blog/stop-building-ai-tools-backwards/#guide&quot;&gt;&lt;span&gt;Guide&lt;/span&gt;&lt;/a&gt;&lt;/h3&gt;&lt;p&gt;Here’s what some &lt;strong&gt;good&lt;/strong&gt; interactions look like&lt;/p&gt;&lt;ul&gt;&lt;li&gt;“You seem stuck on X. Do you want to try investigating Y?” (if and only if the human provided a high level plan of what they’re going to investigate)&lt;/li&gt;&lt;li&gt;“Do you want to ping the code owner? Would you like to view the documentation for the service?”&lt;/li&gt;&lt;li&gt;human: “I’m stuck” -&gt; AI: “what are you stuck on?” -&gt; (human answer) -&gt; AI response&lt;/li&gt;&lt;li&gt;Suggest mental model(s) for concept Z, providing references to company documentation&lt;/li&gt;&lt;li&gt;“Should we document that? Is this something we need to page another team about?” or other questions a helpful human might ask during an incident&lt;/li&gt;&lt;li&gt;“Can you tell me what steps you’re trying to accomplish?”&lt;/li&gt;&lt;li&gt;Validating responses by assessing how sensible they seem, cross checking information the human provides with information the AI can verify, asking the human for clarification if the AI detects inconsistencies&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;Here’s what some &lt;strong&gt;bad&lt;/strong&gt; interactions look like&lt;/p&gt;&lt;ul&gt;&lt;li&gt;“im stuk, pls help”. Make the human give you an answer before providing a response; do it socratic style if you have to&lt;/li&gt;&lt;li&gt;Providing information humans didn’t ask for&lt;/li&gt;&lt;li&gt;Correcting human responses or doing fact-checking in an authoritative tone&lt;/li&gt;&lt;li&gt;“Guiding” but it feels like backseat driving by someone who would rather do it themselves&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;(In general: if you give someone a “continue” button, or a generic “provide next hint/step/action” button, they will probably learn to just spam the button and then they will break things accidentally because it’s there. It breaks the human reasoning loop.)&lt;/p&gt;&lt;h3 id=&quot;enhance&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://hazelweakly.me/blog/stop-building-ai-tools-backwards/#enhance&quot;&gt;&lt;span&gt;Enhance&lt;/span&gt;&lt;/a&gt;&lt;/h3&gt;&lt;p&gt;Here’s what some &lt;strong&gt;good&lt;/strong&gt; interactions look like&lt;/p&gt;&lt;ul&gt;&lt;li&gt;After/during an action, suggest an incremental improvement (eg: filtering by time range -&gt; provide five-minutes-before-the-alert-fired as an option)&lt;/li&gt;&lt;li&gt;Revealing UI: if someone performs a compound action, give them a shortcut next time (eg: click on trace -&gt; copy trace_id? Dynamically surface a “copy trace ID” button)&lt;/li&gt;&lt;li&gt;Comparing services A and B repeatedly? Suggest split UI&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;You can also &lt;em&gt;suggest&lt;/em&gt; enhancements to existing Processes&lt;/p&gt;&lt;ul&gt;&lt;li&gt;If the tool identifies people performing N queries to grab data? Suggest infra pipeline improvements&lt;/li&gt;&lt;li&gt;Suggest alert refinements if the alerts aren’t actionable often enough&lt;/li&gt;&lt;li&gt;Detect manual indirect correlation (eg when people are relying on intuition), suggest instrumentation improvements &lt;ul&gt;&lt;li&gt;Here’s a real example: I had a team of people who would open an observability tool during debugging, look for slow endpoints, and then manually drill down to find abnormally slow database queries, and then intuit if it was because the database query plans had become suboptimal or that some cache had busted. Putting that information &lt;em&gt;in&lt;/em&gt; the telemetry was not an idea they had thought about, but it was very helpful for them!&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;Turn a scratchpad of notes into post incident learning material&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;Notice how careful I am to avoid any enhancements that remove human reasoning from the loop? That’s intentional!&lt;/p&gt;&lt;p&gt;In fact: most enhancement suggestions are of the form of adding &lt;em&gt;more&lt;/em&gt; recall prompts. They literally help embed micro-learning deeper into the process, organically.&lt;/p&gt;&lt;p&gt;As a bonus: it helps people &lt;em&gt;observing&lt;/em&gt; &lt;a href=&quot;https://link.springer.com/article/10.1007/s13752-020-00351-w&quot;&gt;learn via osmosis&lt;/a&gt;, even if they’re not actively involved in taking actions. Also, did you know there’s actual &lt;em&gt;real&lt;/em&gt; support for the idea that humans learn at the sub-action level just by observing? It’s not necessarily the primary mechanism, but it contributes to the &lt;em&gt;propagation&lt;/em&gt; of said knowledge and helps spread “how we approach doing” throughout teams very well. Humans are so neat, seriously. &lt;sup&gt;Ok, side tangent over.&lt;/sup&gt;&lt;/p&gt;&lt;h3 id=&quot;describing-the-pattern&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://hazelweakly.me/blog/stop-building-ai-tools-backwards/#describing-the-pattern&quot;&gt;&lt;span&gt;Describing the Pattern&lt;/span&gt;&lt;/a&gt;&lt;/h3&gt;&lt;p&gt;That was a lot of information! One thing that I want to look at, zooming back out a little bit, is that there are a general set of principles here:&lt;/p&gt;&lt;ul&gt;&lt;li&gt;Reinforce human learning&lt;/li&gt;&lt;li&gt;Help humans work better together&lt;/li&gt;&lt;li&gt;Accelerate human execution in-process, don’t remove it&lt;/li&gt;&lt;li&gt;&lt;em&gt;Never&lt;/em&gt; go from blank to outcome&lt;/li&gt;&lt;li&gt;Tools should take the &lt;em&gt;right&lt;/em&gt; amount of effort to use&lt;/li&gt;&lt;li&gt;Incorporate team learning into the tool’s output&lt;/li&gt;&lt;/ul&gt;&lt;h3 id=&quot;another-example:-code-gen&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://hazelweakly.me/blog/stop-building-ai-tools-backwards/#another-example:-code-gen&quot;&gt;&lt;span&gt;Another Example: Code Gen&lt;/span&gt;&lt;/a&gt;&lt;/h3&gt;&lt;p&gt;As a bonus, here’s another example of utilizing this pattern (I’ll be much briefer this time). It’s a task that everyone developer does: code writing! It turns out, you shouldn’t use AI to generate the code (first).&lt;/p&gt;&lt;p&gt;Instead, work “backwards” with the AI. Generate rough documentation, rough/high-level architecture diagrams, then a testing plan, then the tests, then stubbed feature-flagged code… &lt;em&gt;THEN&lt;/em&gt; generate the code.&lt;/p&gt;&lt;p&gt;Once the code passes the tests, work backwards over the entire process and use the existing code to improve the tests, flesh out the testing plan, polish the architecture diagrams, and finalise the documentation.&lt;/p&gt;&lt;p&gt;Why? Because if you ask a human “is this right?” when they don’t have a solution in mind, you’re asking a validation-style question that humans can’t assess. That’s not retrieval, and even worse, we’re &lt;em&gt;really&lt;/em&gt; bad at it.&lt;/p&gt;&lt;p&gt;Alternatively, if you ask socratic-shaped questions such as “what should X do? How should it look? What’s the data flow? How should it behave?”&lt;/p&gt;&lt;p&gt;Every step is retrieval!&lt;/p&gt;&lt;p&gt;(As a bonus, LLM builders or fine-tuners now have a reliable source of extraordinarily high signal-to-noise ratio code if people follow these retrieval-driven-development patterns. Why AI tools don’t heavily encourage this is beyond me, especially as they’re all desperate for more high quality data.)&lt;/p&gt;&lt;p&gt;Anyway…&lt;/p&gt;&lt;h2 id=&quot;untapped-potential&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://hazelweakly.me/blog/stop-building-ai-tools-backwards/#untapped-potential&quot;&gt;&lt;span&gt;Untapped Potential&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;I skimmed over cross-functional possibilities, because nobody in software engineering is super focused on that right now, unfortunately (especially in platform engineering where these types of tools are being built a lot). It makes sense: budgets are tight, team are scrambling, helping “not us” out isn’t the highest priority at the moment. I get it. But, truly, I think that cross-functional assistance is one of the highest impact areas of AI if done right.&lt;/p&gt;&lt;p&gt;Here’s one example of cross-functional potential. Imagine production is down, and customer support is getting a ton of emails about what’s happening, what’s impacted, is my stuff okay, etc Here’s what &lt;em&gt;could&lt;/em&gt; be possible, if anyone built it…&lt;/p&gt;&lt;p&gt;Customer Support could phrase a few questions for the dev team, send &#39;em over, and get a two phased answer.&lt;/p&gt;&lt;p&gt;First, an immediate rough draft from AI, saying something like&lt;/p&gt;&lt;blockquote class=&quot;flow&quot;&gt;&lt;p&gt;“Hi, this is AI’s guess at the answer. Don’t send it to customers! But just FYI for you. Also I’m pinging the devs to make sure it’s correct.”&lt;/p&gt;&lt;p&gt;– Hypothetical AI response to the Customer Support team&lt;/p&gt;&lt;/blockquote&gt;&lt;p&gt;Neither Customer Support or the developer teams are stupid, if the answer that the AI is providing sounds like gibberish, it’ll help the teams understand that they might need to get some face time with each other; crucially, this can be fielded by non ICs on either end if the ICs are deep in the middle of a focus crunch.&lt;/p&gt;&lt;p&gt;As for the second phase of the answer: the developer team gets that series of questions from Customer Support. It might sound something like this&lt;/p&gt;&lt;blockquote class=&quot;flow&quot;&gt;&lt;p&gt;“Hey customer support wants to know X Y and Z. Here’s the answers I gave them, are they right? Is there anything you’d change? Please let me know if this information is accurate enough to use for responding to customer questions.”&lt;/p&gt;&lt;p&gt;– Hypothetical AI pinging the developer team in an incident&lt;/p&gt;&lt;/blockquote&gt;&lt;p&gt;The development team (or their engineering manager, product manager, or someone else in the loop) can then review those answers and fix &#39;em if necessary, which is &lt;em&gt;much&lt;/em&gt; faster than interrupting developer flow. This is an ok place for that! Also, this is still close enough to retrieval because we’re actively asking developers to confirm that the information is sufficiently accurate; it’s not &lt;em&gt;always&lt;/em&gt; close enough to retrieval, but mid-incident, this shaping helps take a low priority “not now” into an interaction that the team can perform without disrupting their flow.&lt;/p&gt;&lt;p&gt;This is only the surface of the potential for improving cross-functional collaboration, however. What if the AI answer is deeply incorrect? Or the team needs to write a brand new answer? Rather than having the team perform a context heavy translation of the problem (in the moment when they’d rather do literally anything else), give them the ability to write a fully technical, jargon heavy, and fragmented answer, and then use the AI to help rewrite that. Suppose the developers look at the first AI answer and reject it and reply back with&lt;/p&gt;&lt;blockquote class=&quot;flow&quot;&gt;&lt;p&gt;“yeah no. what’s going on is that zk is borked, our sidekiq is backed up and redis is grumpy, we’re mid thru traffic redir to a new AZ, we did blue and yellow. orange seems fine already? idk”&lt;/p&gt;&lt;p&gt;– A jargon heavy in-context summary during an incident&lt;/p&gt;&lt;/blockquote&gt;&lt;p&gt;While that’s a useless reply for Customer Support as is (and probably useless to &lt;em&gt;anyone&lt;/em&gt; not actively responding to the incident), but AI could turn that into a friendlier answer, and prompt developers for missing bits (like ETA).&lt;/p&gt;&lt;p&gt;Plus, you probably have multiple tiers of support, too. Do you have business partners with technical experts asking through support for the “real answer”? What about tier-1 consumers? AI could help make it feasible to give both answers in an accurate way (after double checking with the team that the AI didn’t mess up the expansion).&lt;/p&gt;&lt;p&gt;That could then further be integrated with Customer Support software so that they see the live incident info, know when incidents are happening or resolved, and view live answers so that they aren’t stuck fielding questions they don’t know answers to.&lt;/p&gt;&lt;p&gt;There’s a ton of potential here, but until leadership teams begin perceiving building software for internal improvements to be as impactful, value wise, as shipping features, platform engineering teams likely won’t be able to build this type of thing. In addition, without existing demand, it’ll be hard to sell it or create the environmental conditions necessary for vendor integrations to start organically appearing. Sigh.&lt;/p&gt;&lt;h2 id=&quot;close_incident&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://hazelweakly.me/blog/stop-building-ai-tools-backwards/#close_incident&quot;&gt;&lt;span&gt;/close_incident&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;You know, it’s funny. Originally, I thought to myself, “oh, this will be a short article and I’m just going to kind of bang it out…”, and then it turns out that there’s a reason my bio tagline says “I have thoughts. Lots of thoughts. They never stop thinking. They never stop thunking.” I’m sure next time I’ll remember to keep it concise-er. Maybe.&lt;/p&gt;&lt;p&gt;Oh yeah, conclusion. Gotta get that catchy takeaway, right? (ahem)&lt;/p&gt;&lt;p&gt;When it comes right down to it, we are building our AI tooling backwards. The backwards tooling is resulting in skill deficiencies and is de-skilling people by taking the one &lt;em&gt;single&lt;/em&gt; thing that humans are really, really good at and attempting to have AI replace–rather than augment–that part of us. Naturally, we also managed to pick the thing that AI itself is extraordinarily bad: cumulative learning in a collaborative fashion (after all: it can neither reason, nor work collaboratively). To make matters worse, we feed those two broken processes into each other, creating a feedback loop that completely derails the effectiveness of human/computer interaction.&lt;/p&gt;&lt;p&gt;Seriously, we need to cut that out. We don’t &lt;em&gt;have&lt;/em&gt; to do that, either! (I’m not even making this up! There’s evidence for this! Science!)&lt;/p&gt;&lt;p&gt;If you build tools for collaborative learning, if you prioritise assisting and augmenting a human driven process over outputting exponential amounts of noise, then what you’re going to end up doing is building tooling that helps humans get better at getting better. That, in turn, then helps the tooling get better, which then helps the humans get better; the result is the creation of a reinforcing positive feedback loop rather than reinforcing negative feedback loop. Please, y’all, put the emphasis on humanity back into our tooling rather than pretending nothing matters, as if somehow humans will supposedly be irrelevant in a few years. Although, arguably, that human focus was never in our tooling in the first place–I mean, let’s be real here.&lt;/p&gt;&lt;p&gt;Systems tooling is ripe for revolutionary changes in how they’re imagined, how they’re implemented, and how they’re valued. But those changes will &lt;em&gt;never&lt;/em&gt; materialise if we don’t build them to be human-first. Don’t just keep humans in the loop, remember that humans &lt;em&gt;are&lt;/em&gt; the loop.&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>The Future of Observability: Observability 3.0</title>
    <link href="https://hazelweakly.me/blog/the-future-of-observability-observability-3-0/" />
    <updated>2024-12-09T00:00:00Z</updated>
    <id>https://hazelweakly.me/blog/the-future-of-observability-observability-3-0/</id>
    <content type="html">&lt;p&gt;Observability, so hot right now. Over the years, we’ve seen observability go from an unknown concept to a ubiquitous phrase that everyone is desperate to stamp on their products. We’ve seen projects come, evolve, and die. We’ve seen technologies emerge out of the ashes, born from the tears of SREs long departed. Yet, amongst all of this growth, all of this innovation, one question remains: and then &lt;em&gt;what&lt;/em&gt;?&lt;/p&gt;&lt;p&gt;You see, it turns out observability is pretty useless because it doesn’t &lt;em&gt;do&lt;/em&gt; anything. Not by itself, that is. Which makes sense! Computers don’t do anything until you turn them on; bikes don’t go forward unless you pedal them; raw materials won’t turn themselves into a building without blueprints and labour. But observability? Somehow it’s this thing that we’ve been able to grow into a multi billion dollar industry composed entirely out of “get a bunch of data, and then…”&lt;/p&gt;&lt;p&gt;That’s it.&lt;/p&gt;&lt;p&gt;Just “and then”. Nothing else, nothing more. That’s right! GET ALL THE THINGS! UPLOAD &lt;strong&gt;THE WHOLE INTERNET&lt;/strong&gt; INTO A ZIP FILE CALLED &lt;code&gt;final_super_legit-awesome.observability.pdf.com&lt;/code&gt;. And then…&lt;/p&gt;&lt;p&gt;Will someone please tell me what the fuck the “and then” part is supposed to be? Anyone? And then &lt;strong&gt;WHAT&lt;/strong&gt;? Draw the rest of the fucking owl already.&lt;/p&gt;&lt;p&gt;If you remember seeing a previous post I wrote on &lt;a href=&quot;https://hazelweakly.me/blog/redefining-observability&quot;&gt;redefining observability&lt;/a&gt;, I brought this up as something I specifically wanted to address in the new definition of observability that I proposed. In fact, this “and then” problem is a huge reason why my definition is&lt;/p&gt;&lt;blockquote class=&quot;flow&quot;&gt;&lt;p&gt;A process through which one develops the ability to ask meaningful questions, get useful answers, &lt;strong&gt;and act effectively on what you learn.&lt;/strong&gt;&lt;/p&gt;&lt;/blockquote&gt;&lt;p&gt;See that? The “act effectively” part. That, that right there, that’s the stuff. Gimme more of that. I still don’t see nearly enough of that. Honestly, I don’t really see it talked about anywhere, period.&lt;/p&gt;&lt;p&gt;Which is wild, right? Imagine a sales team that had a motto of “we create sales leads!” and then never talked about the “and then” part. Imagine a marking department whose mission was “we create impactful campaigns to raise awareness!” and then never bothered to think about what comes after that. Imagine an engineering organisation who aligned around the most beautiful strategy of “we design reliable systems” and then forgot to care about writing the fucking code in the first place.&lt;/p&gt;&lt;p&gt;Seriously? Who would hire those jokers? Who would trust products produced by that malarkey?&lt;/p&gt;&lt;p&gt;Yet I see so much talk about observability out there that’s all about data, gathering data, thinking about data, cleaning data, making systems observable, making systems monitorable, …&lt;/p&gt;&lt;p&gt;And then what?&lt;/p&gt;&lt;p&gt;Alerts! They cry! We’ll get SLOs! SLIs! So many fucking configuration files! We’ll instrument your deployments! You’ll be able to slice and dice! The whole world’s your oyster!&lt;/p&gt;&lt;p&gt;&lt;strong&gt;And. Then. What?&lt;/strong&gt;&lt;/p&gt;&lt;p&gt;(Crickets chirp. Angels weep.)&lt;/p&gt;&lt;h2 id=&quot;observability-through-the-ages&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://hazelweakly.me/blog/the-future-of-observability-observability-3-0/#observability-through-the-ages&quot;&gt;&lt;span&gt;Observability Through the Ages&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;Let’s back up. How’d we get here, anyways? You can’t understand what’s next without knowing where you came from; you can’t draw the rest of the owl if you don’t know what an owl is in the first place.&lt;/p&gt;&lt;p&gt;Okay, we’ll go back in time to when we had observability 1.0 and start there. Back then we just called it “instrumenting your system so you can unfuck it up later” like a buncha heathens. We had the three pillars of logs, traces, metrics, and all of their derivatives (y’know, RUM, APM, dashboards, …).&lt;/p&gt;&lt;p&gt;Life was great, right?&lt;/p&gt;&lt;p&gt;As an engineering executive, all you needed to do and worry about was to implement the three pillars… And then what? Eh, don’t worry about that, just implement the pillars. They’re capabilities! It’s a maturity model! Just allocate the headcount, implement the thing, watch number go from zero to mature, and sleep tight knowing that High Impact Awesomesauce is happening.&lt;/p&gt;&lt;p&gt;Well, okay, it turns out we need to back up a little bit because that isn’t really what happens in reality. Despite implementing the three pillars, you run into a lot of limitations, all stemming from a central foundational choice: multiple sources of truth, with no ability to correlate them, leads to an inability to ask meaningful questions. Charity refers to this this when she talks about how observability 1.0 ends up &lt;a href=&quot;https://www.honeycomb.io/blog/one-key-difference-observability1dot0-2dot0&quot;&gt;“making decisions at write time about how you and your team would use the data in the future”&lt;/a&gt; in her nice observability 1.0 vs 2.0 article. I like that phrasing a lot, but I personally like to emphasise the “inability to ask meaningful questions” part a bit more than the “write-time decisions” and “existence of pillars” parts. That’s just me, though.&lt;/p&gt;&lt;p&gt;However, I do want to note that these are equivalent ideas: I’m not reinventing anything here. That might get missed, so I wanna really spell it out. These concepts are equivalent. If anything, they’re duals to each other; different facets of the same shape:&lt;/p&gt;&lt;ul&gt;&lt;li&gt;Observability 1.0 is defined by pillars&lt;/li&gt;&lt;li&gt;Observability 1.0 defines the questions you can ask at write time&lt;/li&gt;&lt;li&gt;Observability 1.0 results in an inability to ask meaningful questions&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;Now, let’s go forward in time a little bit to where observability 2.0 comes in. The thing that defines observability 2.0 is a combination of one data format, one storage location, and one source of truth. Charity would argue that the structured log events are the data format that is required by observability 2.0; I think anything that lets you build relational data structure on top works. Consequently, that implies that a time series, logs, and any other temporal data structure that can be decorated with metadata works fine. While, yes, that’s &lt;em&gt;true&lt;/em&gt;… Seriously, use structured log events, you’ll hate yourself a whole lot less.&lt;/p&gt;&lt;p&gt;Regardless! The big shift here from an &lt;em&gt;implementation&lt;/em&gt; standpoint is uniformity in data format, storage location, and source of truth. And from a capabilities standpoint, you get the ability to correlate information together. Now, rather than defining what questions you ask at write time, you define what correlations you can make at write time. I can’t overstate what an improvement this is; it’s a game changer.&lt;/p&gt;&lt;p&gt;Again, a lot of these different ideas are all equivalent and are really more like different ways to articulate the same point.&lt;/p&gt;&lt;ul&gt;&lt;li&gt;Observability 2.0 is defined by structured wide events as a single source of truth&lt;/li&gt;&lt;li&gt;Observability 2.0 defines the correlations you can make at write time&lt;/li&gt;&lt;li&gt;Observability 2.0 results in the potential to ask meaningful questions&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;See that last one? Now we have the potential to ask meaningful questions. Oh hell yeah, this is awesome. There’s only one teensy tiny problem I have with this, though: “and then what?”&lt;/p&gt;&lt;p&gt;You can ask meaningful questions now… But there’s still that one last question lingering in the back of the mind.&lt;/p&gt;&lt;h2 id=&quot;and-then-what&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://hazelweakly.me/blog/the-future-of-observability-observability-3-0/#and-then-what&quot;&gt;&lt;span&gt;And Then What?&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;Firstly, I’m going to make a somewhat controversial claim in that you can get observability 2.0 just fine with “observability 1.0” vendors. The only thing you need from a UX standpoint is the ability to query correlations, which means any temporal data-structure, decorated with metadata, is sufficient. Hell if you hate yourself enough, you don’t actually need the temporal part to be a real clock, a logical clock works just fine.&lt;/p&gt;&lt;p&gt;Now, is that hard as fuck with observability 1.0 tooling? Yeah, generally; there’s a reason you don’t really do that. I mean, you can &lt;em&gt;also&lt;/em&gt; &lt;a href=&quot;https://sourceforge.net/projects/brainfix/&quot;&gt;implement your entire backend in brainfuck&lt;/a&gt; too, but… &lt;em&gt;Why&lt;/em&gt;?&lt;/p&gt;&lt;p&gt;The point I’m really making here is that the tooling and/or vendor choice(s) don’t actually restrict or limit the capabilities you get out of them from a purely technical standpoint. Which, naturally, goes both ways. I can’t tell you how many times I’ve run into people using observability 2.0 tooling, super modern vendors, really excellent tooling, and getting absolutely zero value out of it. Slicing and dicing auto-instrumented code with zero manual instrumentation, &lt;em&gt;wrong&lt;/em&gt; instrumentation, broken service graphs, disconnected distributed tracing, and every other crime under the sun. Not only is it quite &lt;em&gt;possible&lt;/em&gt; to hold the tool wrong, but damn y’all, I remain fairly convinced that “holding it wrong” is the case in by far the vast majority of observability implementations out there.&lt;/p&gt;&lt;p&gt;Which means, there’s gotta be something else here; it’s not just the ones and zeroes, because those aren’t the thing holding us back as an industry. So in comes observability 3.0. Or rather, my prediction on what observability 3.0 is going to look like. There’s a technical component to it, sure, but the main one is social.&lt;/p&gt;&lt;p&gt;Are you ready? Here are my predictions:&lt;/p&gt;&lt;ul&gt;&lt;li&gt;Observability 3.0 backends are going to look a &lt;em&gt;lot&lt;/em&gt; like a data lake-house architecture&lt;/li&gt;&lt;li&gt;Observability 3.0 will expand query capabilities to the point that it mostly erases the distinction between pay now / pay later, or “write time” vs “read time”&lt;/li&gt;&lt;li&gt;Observability 3.0 will, more than anything else, be measured by the value that &lt;em&gt;non-engineering functions&lt;/em&gt; in the business are able to get from it&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;That last one is the big one, it’s the whole point of the damn thing, and it’s the entire reason for the 3.0 instead of calling this a 2.5 or something like that. The critical difference in observability 2.0 and 1.0 vs 3.0, for me, is that the success of rolling out observability 1.0 or 2.0 relies entirely on how valuable it is for the engineering organisations in the business. For observability 3.0, that’ll still be important, but the success of it will be mostly defined by how non-engineering functions are able to use it.&lt;/p&gt;&lt;p&gt;Remember my definition of observability? Here it is, for posterity:&lt;/p&gt;&lt;blockquote class=&quot;flow&quot;&gt;&lt;p&gt;A process through which one develops the ability to ask meaningful questions, get useful answers, and act effectively on what you learn.&lt;/p&gt;&lt;/blockquote&gt;&lt;p&gt;Observability 1.0 gave us lots of useful answers, observability 2.0 gives us the potential to ask meaningful questions, and observability 3.0 is going to give us the ability to act effectively on what we learn.&lt;/p&gt;&lt;p&gt;In the next few years, I’m going to be looking at observability vendors pretty critically, looking for indicators that they’re indexing on the “act effectively” part. I’m especially going to be looking for vendors to think deeply about how they bring the whole rest of the business in on this too. There’s a lot of things moving underneath the surface that make me think this is going to be a much bigger thing soon. In fact, if anything, I wouldn’t be surprised if this type of thinking defined meaningful innovation in observability for the next few years.&lt;/p&gt;&lt;p&gt;I also don’t think observability 3.0 is incompatible with existing vendors! You don’t have to rip out your existing stack to get observability 3.0 and, honestly, please don’t. Or at least, not yet. But frankly, if your vendors can’t help you deliver meaningful value to the entire business then why are they even there?&lt;/p&gt;&lt;p&gt;Naturally, that extends to engineering leadership as well; if they can’t figure out how to turn headcount into business results, exit them. We’ve had multiple decades as an industry to figure out how to deliver meaningful business value in a transparent manner, and if engineering leaders can’t catch up to other C-suites in that department soon, I don’t expect them to stick around another decade.&lt;/p&gt;&lt;p&gt;Back to observability: Overall, I’m excited to see how the observability product offerings get refined over the next few years to become increasingly valuable to those outside of engineering. Today, I don’t think we do a very good job as an industry of bringing the rest of the business along for the ride, which is why I wrote this post, but I don’t think there’s anything stopping us from doing this once we start treating it as important. After all, if the business can’t effectively learn, then there &lt;em&gt;is&lt;/em&gt; no “and then what.”&lt;/p&gt;&lt;p&gt;So what is the “and then what”, you ask? Effective and collaborative action. Organizational learning. &lt;a href=&quot;https://osf.io/preprints/psyarxiv/tfjyw&quot;&gt;Collective social learning.&lt;/a&gt; That’s what.&lt;/p&gt;&lt;p&gt;Always has been.&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>You Have One Voice</title>
    <link href="https://hazelweakly.me/blog/you-have-one-voice/" />
    <updated>2024-10-21T00:00:00Z</updated>
    <id>https://hazelweakly.me/blog/you-have-one-voice/</id>
    <content type="html">&lt;p&gt;I was originally going to call this post “What makes a programming language real?” because I saw some people picking a fight on the internet about this type of topic, yet again, and it got me thinking as to why we even broach the topic in the first place. Surely, one might think, a programming language can just exist peacefully without being questioned as to its legitimacy, right? Well, clearly not. But, that brings to mind for me: why exactly do we care so much? What’s the point?&lt;/p&gt;&lt;p&gt;However, I realised there’s actually a more important point here, lying underneath the surface. We’re human, which means that we all have one voice, one life, and one source of energy. So, why do we spend time tearing others down when we can build them up?&lt;/p&gt;&lt;h2 id=&quot;all-languages-are-real-because-no-language-is-real&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://hazelweakly.me/blog/you-have-one-voice/#all-languages-are-real-because-no-language-is-real&quot;&gt;&lt;span&gt;All Languages Are Real Because No Language Is Real&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;I’m going to start this by talking about the idea of what makes a programming language real. So, I’m going to be giving you my personal opinion on things, and then we’ll go over how I got here, why I changed my mind, and where I’m at now. I’m going to try and avoid generalizing this to &quot;the community should… &quot; because, honestly, most of y’all are adults and you’ve already mostly made up your mind; it’s not really a productive use of anyone’s time to try and actually sway people one way or another, so I’m not going to. But, I think it’s important to express why exactly I approach this topic the way I do (spoiler alert: it has very little to do with the question as literally phrased).&lt;/p&gt;&lt;p&gt;So, first off: I think all programming languages are real and equally legitimate. Java, C, C++, Go, Rust, Haskell, and so on? Great. Typescript, Elixir, Coffeescript, and other transpiled languages? Excellent. HTML, CSS, XML, YAML, and other “markup/configuration”-esque languages? Absolutely, those all count.&lt;/p&gt;&lt;p&gt;Yes, HTML counts too; yes, so does Yaml; yes, even CSS; yes to all of them. Not only do I think all of these count as real programming languages, I think they’re all equally valid and legitimate.&lt;/p&gt;&lt;p&gt;I’m going to go ahead and pause here for a moment so that y’all can get a new cup of coffee, clean up your keyboard, yell at the clouds, touch grass, or otherwise centre yourself. I’ll wait.&lt;/p&gt;&lt;p&gt;Ready?&lt;/p&gt;&lt;p&gt;Okay, cool. Before I get into why I think these are legitimate, I want to talk about the young me from a while back, several years ago, who did &lt;em&gt;not&lt;/em&gt; have the same opinion. Young me absolutely would’ve dunked on people for thinking CSS or HTML was a real language; not only that, but PHP? Being as legitimate as Haskell or Java? Seriously? Pffh. Not only did I have a fairly hard line for what “real” and “not real” meant, but I also had a fairly nuanced taxonomy and tier list for how legitimate a language was and how appropriate it was to use for a certain task. Only &lt;em&gt;absolute idiotsssssss&lt;/em&gt; would use a language for an inappropriate use, &lt;em&gt;obviously&lt;/em&gt;. I was vocal about it, as well; you absolutely would’ve heard my opinion on this if you knew young(er) me.&lt;/p&gt;&lt;h2 id=&quot;gatekeep-me-not&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://hazelweakly.me/blog/you-have-one-voice/#gatekeep-me-not&quot;&gt;&lt;span&gt;Gatekeep Me Not&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;I didn’t change my mind on languages for a while. My special interest and passion used to be Programming Language Theory, and I very seriously deliberated and nearly went to pursue a PhD for it, with a particular interest in optimizing compilers and type theory. However, something very specific clicked for me the first time I read an article on the internet of how another person had changed their mind and had stopped dunking on languages (it was &lt;a href=&quot;https://blog.aurynn.com/2015/12/16-contempt-culture&quot;&gt;Aurynn Shaw, with her lovely article on contempt culture&lt;/a&gt;, which you should totally read).&lt;/p&gt;&lt;p&gt;Her reasoning? As I understood it, what happened for them was that someone laid out for her that many of the journeys and paths into tech that she was criticizing were ones dominated by women. So consequently, if she spend her time criticizing those languages or frameworks, then she’s inadvertently targeting things primarily used by certain demographics of people. And, well, is that really what she want to do with her time? After she sat down and reflected on this… She decided to not be part of the problem.&lt;/p&gt;&lt;p&gt;That article had quite the impact on me, to say the least. Not necessarily because of its main point, but because of two larger hidden points behind it.&lt;/p&gt;&lt;p&gt;The first point. Tech is not a vacuum, nor is it apolitical; regardless of how objective we might want to make an analysis of technology, the people who build it and use it and think about it remain. Those people are going to find part of their identity in the tech that they use, and that is a feature of humanity, not a bug. We should lean into that! It’s awesome! But it also means that when we belittle and attack technology, we are inevitably attacking groups of identities that choose to associate with that technology.&lt;/p&gt;&lt;p&gt;The second point. When we declare the legitimacy of something, as a society, we often do so at the expense of another thing. Likewise, when we declare the illegitimacy of something, as a society, we often do so in order to belittle or ostracise or otherwise hurt a particular group associated with it. Whether intentional or not, that is a profound and inescapable result. It doesn’t “have” to be that way; there’s definitely ways to say “this tech is legitimate” without criticizing another tech choice, but be real, how many times have you actually seen that happen? Yeah, I thought so. So, as a person in technology that people look up to, I have a choice: I can spend my energy putting groups of people down, or I can spend my energy lifting them up.&lt;/p&gt;&lt;p&gt;I decided a long time ago that I will always choose to spend my energy lifting people up rather than tearing them down. This remains, today, one of my most central viewpoints that I try very hard to adhere to. Even if I want to vent, or want to rant, or get so enormously frustrated with a certain technology product or language or community that I want to talk about it, I try my very hardest to talk about things in a constructive manner rather than an inherently negative and unproductive manner. If you ever catch me saying “this tool is garbage”, feel free to call me out, because I will absolutely rephrase that.&lt;/p&gt;&lt;h2 id=&quot;experiencing-the-shift&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://hazelweakly.me/blog/you-have-one-voice/#experiencing-the-shift&quot;&gt;&lt;span&gt;Experiencing The Shift&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;I’m going to get a bit more specific here for a second. I’ve been talking fairly broadly, but now I’d like to zoom into a community that I’ve been in and adjacent to for the entirety of my tech career, which is the front-end and design community. Did y’all know that for years I thought I wanted to be a graphic designer, and I even bought books in grade school and practiced designs and got pretty solidly good at it before I even graduated high school? Given that most people on the internet know me for sociotechnical, organisational, and infrastructure related stuff, that might come as a surprise, but I was very much a digital artist vibes kinda person in high school. It’s one of the reasons I’ve put &lt;em&gt;so&lt;/em&gt; much time into the design of my blog; I care about it a lot!&lt;/p&gt;&lt;p&gt;But I’m going to be real here. In the last decade or so, the trend I’ve noticed more than anything else in the front end community is two things:&lt;/p&gt;&lt;ol&gt;&lt;li&gt;The rise of ReactJS, and subsequently the creation of the full-stack javascript engineering role as a discipline&lt;/li&gt;&lt;li&gt;The feminization of graphic/web design, and its subsequent loss of respect, lowering in pay, and cutting of headcount industry wide&lt;/li&gt;&lt;/ol&gt;&lt;p&gt;These have not been amazingly sudden, but the schism created has been stark, divisive, severe, and sustained. Chris Coyier talked about this in his article &lt;a href=&quot;https://css-tricks.com/the-great-divide/&quot;&gt;The Great Divide&lt;/a&gt;, although he didn’t point out some of the things I’m saying here. While I don’t have hard data to back up the second point, it’s a trend I’ve personally noticed and have heard from several others as well. It’s not particularly a new thing either, because this has happened in several other industries too; in fact, there’s this thing that seems to happen whenever an industry shifts its perceived gender.&lt;/p&gt;&lt;p&gt;Male dominated industries tend to have a few qualities:&lt;/p&gt;&lt;ul&gt;&lt;li&gt;they’re &lt;em&gt;perceived&lt;/em&gt; as being legitimate&lt;/li&gt;&lt;li&gt;they’re &lt;em&gt;perceived&lt;/em&gt; as being difficult&lt;/li&gt;&lt;li&gt;they’re &lt;em&gt;perceived&lt;/em&gt; as being merit based&lt;/li&gt;&lt;li&gt;they’re almost always more high paying&lt;/li&gt;&lt;li&gt;they’re almost always more respected&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;When they shift to being female dominated, they lose all of those qualities. For example, when nursing went from being male dominated to female dominated, you could chart in real time the public perception of it along all of these qualities; it’s not that the nature of the work changed, it’s that we don’t want to respect women. Likewise, when I noticed the front-end community split into a “respected” engineering + “disrespected” design chasm, I saw the exact same thing happen.&lt;/p&gt;&lt;p&gt;Seeing a community go from uplifting CSS and heralding things such as the &lt;a href=&quot;https://csszengarden.com/&quot;&gt;CSS zen garden&lt;/a&gt; as a feat of engineering to seeing just about nobody give a single flying fuck about the incredible works of &lt;a href=&quot;https://labs.jensimmons.com/&quot;&gt;Jen Simmons&lt;/a&gt; is… Surreal. I can’t even say “this happened in my lifetime” because I’m too fucking young to be saying shit like “this happened in my lifetime.”&lt;/p&gt;&lt;p&gt;But we’re here now, and it’s been very interesting to reflect on how the changing language I’ve seen utilised, the technological choices people have made, and how engineering organisations and communities approach certain problems has fundamentally shifted public perception of this type of work in such a profound way that it’ll take decades to undo the damage. It’s breathtaking. How the hell did we get here, anyway?&lt;/p&gt;&lt;p&gt;I’m not going to put on a tinfoil hat and say that us shitting on CSS as being “a fake language” ruined the web industry and caused billions of dollars of economic damage. Really, I’m not. What I am going to say, though, is that things have consequences, and systematically devaluing an entire industry of people is going to have very unintended and far reaching consequences. We need to think about that lot harder than we do now.&lt;/p&gt;&lt;h2 id=&quot;human-legitimacy&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://hazelweakly.me/blog/you-have-one-voice/#human-legitimacy&quot;&gt;&lt;span&gt;Human Legitimacy&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;You know what about this really worries me? I get really worried when I look at the state of the tech community and what we’re about to go through in the next two decades. We are not ready to undergo the changes that the industry is about to go through, and a huge issue preventing this is precisely the “legitimacy” gap I brought up. As long as we tear down each other rather than build each other up, we’re going to keep fucking up this tech thing and ruining it for everyone, “elite in-crowd” included. What happens when we tear down all the people who would’ve grown up to help shape the future of the industry? That’s right, we end up not having a real future.&lt;/p&gt;&lt;p&gt;It’s not just about tech choices, it’s also people, and their communities, their backgrounds, and their cultures, too. How many people shit on Americans from the south, with their “unrefined” and “unintelligent” southern accent? How many people grow up having to learn how to lose the accent they were born with in order to be taken seriously in the tech field? How many people in developing countries are going to get relegated to being viewed as trash off-shore keyboard monkeys because that’s all the “hot and trendy” tech market thinks they’re good for?&lt;/p&gt;&lt;p&gt;How many legitimate, wonderful, brilliant people are we going to massacre at the alter of elitism and tech exceptionalism before we realise that we’re destroying all of the humanity in world for the sake of nothing?&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Cut it out, already.&lt;/strong&gt; People are people, we’re all one group, and we’re all on the same fucking planet. Javascript is legitimate, HTML is legitimate, people who do wordpress are legitimate, compiler engineers are legitimate.&lt;/p&gt;&lt;p&gt;You’re all legitimate and wonderful and amazing and capable of so much incredible stuff, especially if we learn how to lift each other up and support each other rather than tearing each other down.&lt;/p&gt;&lt;h2 id=&quot;you-have-one-voice&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://hazelweakly.me/blog/you-have-one-voice/#you-have-one-voice&quot;&gt;&lt;span&gt;You Have One Voice&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;In the end of the day, you have one voice. Humans are single threaded; no matter how many plates we have spinning in the air, there’s only one word that can come out of your voice at a time. There’s only one thing you can type at a time, and only one thing you can do with each moment of your life.&lt;/p&gt;&lt;p&gt;Whenever you spend time on one thing, you also &lt;em&gt;don’t&lt;/em&gt; spend time on everything else. This compounds severely with the understanding we now have of negativity being more impactful to human memory than positivity.&lt;/p&gt;&lt;p&gt;So, you have a choice: you’re going to be remembered for the impact that you’ve had and the time that you’ve spent achieving that impact. Do you want to be remembered as someone who tore down other people? Or as someone who lifted them up? Do you want to be a negative impact on the world? Or a positive one.&lt;/p&gt;&lt;p&gt;You have one voice. One life. One moment at a time. What are you going to do with it?&lt;/p&gt;&lt;p&gt;As for me, I made my choice a long time ago: With every word I say, and every moment I have, I will try my hardest to lift up my communities. Because we’re all in this together, and the world we can make together is so breathtakingly beautiful, why would you not want to be a part of building that?&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>OpenTelemetry Challenges: Handling Long-Running Spans</title>
    <link href="https://hazelweakly.me/blog/opentelemetry-challenges-handling-long-running-spans/" />
    <updated>2024-10-10T00:00:00Z</updated>
    <id>https://hazelweakly.me/blog/opentelemetry-challenges-handling-long-running-spans/</id>
    <content type="html">&lt;blockquote class=&quot;flow&quot;&gt;&lt;p&gt;Long running spans are one of my biggest “we don’t really actually have a good standard solution for this” issues in opentelemetry. They’re something I’ve run into before, weirdly frequently, and especially so when attempting to try and instrument front-end or mobile facing systems.&lt;/p&gt;&lt;p&gt;It turns out, though, that the issues here with long running spans are actually pretty similar to the issues with interrupted, partial, or unclosed spans. In fact, they’re really mostly the same thing (with the added bonus that if you do tail sampling your sampling decision is going to happen before the span ends, which exacerbates the problem by turning functioning spans into partial and broken ones)&lt;/p&gt;&lt;p&gt;Anyway, I talk about this and more in this article and even point out a really cool solution from &lt;a href=&quot;https://embrace.io&quot;&gt;Embrace&lt;/a&gt; that addresses this in an opentelemetry compatible way (hint: does the phrase “write ahead log” make you quiver with excitement? If so, you should definitely read this)&lt;/p&gt;&lt;/blockquote&gt;&lt;p&gt;&lt;a href=&quot;https://thenewstack.io/opentelemetry-challenges-handling-long-running-spans/&quot;&gt;https://thenewstack.io/opentelemetry-challenges-handling-long-running-spans/&lt;/a&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>The 4 Evolutions of Your Observability Journey</title>
    <link href="https://hazelweakly.me/blog/the-four-evolutions-of-your-observability-journey/" />
    <updated>2024-10-03T00:00:00Z</updated>
    <id>https://hazelweakly.me/blog/the-four-evolutions-of-your-observability-journey/</id>
    <content type="html">&lt;blockquote class=&quot;flow&quot;&gt;&lt;p&gt;When going on an observability journey, there tends to be a few concrete phases that every company goes through. Understanding how those unfold and take shape as you mature your observability practices can help you identify when you’ll run into certain types of challenges, and when you’ll start really wanting certain tools and practices to help address those challenges.&lt;/p&gt;&lt;p&gt;That said, when you’re communicating about this to others, you might often find that it’s difficult to explain how you know where you are in the journey, or articulate the issues you’re running into. Often, people express difficulty getting a shared understanding around this, which is where mnemonics and mental models can come in handy.&lt;/p&gt;&lt;/blockquote&gt;&lt;p&gt;Check out my first post on The New Stack! &lt;a href=&quot;https://embrace.io/&quot;&gt;Embrace&lt;/a&gt; reached out to me a while back and we ended up chatting about mobile observability challenges and quite a few other things. This is the result of some of those conversations! Take a look!&lt;/p&gt;&lt;p&gt;&lt;a href=&quot;https://thenewstack.io/the-4-evolutions-of-your-observability-journey/&quot;&gt;https://thenewstack.io/the-4-evolutions-of-your-observability-journey/&lt;/a&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>Cache Me Not, Cache Me, Cache Me Not</title>
    <link href="https://hazelweakly.me/blog/cache-me-not-cache-me-cache-me-not/" />
    <updated>2024-09-19T00:00:00Z</updated>
    <id>https://hazelweakly.me/blog/cache-me-not-cache-me-cache-me-not/</id>
    <content type="html">&lt;p&gt;Caching is hard. So hard. But also, we are so fucking bad at it. Every time I have to use a public wifi setup I have a joker moment. Does absolutely nobody test shit on anything less than wired symmetric gigabit anymore?&lt;/p&gt;&lt;p&gt;Web SPA apps are some of the worst for this. Motherfucker, you have the same fucking iconography for three years, why does it load correctly and then ALL OF THE ICONS FAIL ONCE I DROP TO A SHIT INTERNET CONNECTION?!&lt;/p&gt;&lt;p&gt;I didn’t even reload the page?! The fuck are you doing?&lt;/p&gt;&lt;p&gt;But seriously, caching is hard, it’s really hard, but you can make life WAY easier for yourselves when building an SPA if you do this super simple thing.&lt;/p&gt;&lt;p&gt;Break down all your content into two axis:&lt;/p&gt;&lt;ol&gt;&lt;li&gt;push vs pull&lt;/li&gt;&lt;li&gt;owned vs user&lt;/li&gt;&lt;/ol&gt;&lt;p&gt;Whenever possible, turn your pull assets into push assets and all of your user assets into owned assets.&lt;/p&gt;&lt;p&gt;What do I mean by that? Let’s break it down further. There’s four categories:&lt;/p&gt;&lt;ol&gt;&lt;li&gt;Push + Owned&lt;/li&gt;&lt;li&gt;Push + User&lt;/li&gt;&lt;li&gt;Pull + Owned&lt;/li&gt;&lt;li&gt;Pull + User&lt;/li&gt;&lt;/ol&gt;&lt;p&gt;I’m making up terminology but fuck it, I’m drinking coffee and eating my donut and doing this live at a coffee donut shop, so bite me. &lt;strong&gt;Push&lt;/strong&gt; means that the asset is pushed to a central server and then distributed. &lt;strong&gt;Pull&lt;/strong&gt; means the asset is referenced and the central server has to “pull” the content. &lt;strong&gt;Owned&lt;/strong&gt; means it’s owned by the central server. &lt;strong&gt;User&lt;/strong&gt; means it’s user-submitted content.&lt;/p&gt;&lt;p&gt;Every content caching strategy on the web (and most non-web stuff too, tbh) fucks this up because the web isn’t designed to handle this notion of push vs pull or user vs owned. However, the web &lt;em&gt;is&lt;/em&gt; designed extraordinarily well to handle the concept of pull + owned, and with service workers, it’s also extremely equipped to handle push + owned. Mobile apps and other non-web stuff is also designed to handle push + owned extremely well, so it’s even more disappointing how bad things have gotten with this lately.&lt;/p&gt;&lt;p&gt;So, make everything owned, and make it all push if possible, &lt;em&gt;especially&lt;/em&gt; if it’s an SPA or a native application of some sort.&lt;/p&gt;&lt;h2 id=&quot;push-+-owned&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://hazelweakly.me/blog/cache-me-not-cache-me-cache-me-not/#push-+-owned&quot;&gt;&lt;span&gt;Push + Owned&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;Push + Owned is weird, so let’s go over that first. Content is pushed to a central server and then distributed. And it’s owned by the central server. This is ideal, because it means that your expiration time can be infinite. You only expire content on the client when explicitly told to&lt;/p&gt;&lt;p&gt;Got that? &lt;strong&gt;You. Only. Expire. Content. On. The. Client. When. Told. To.&lt;/strong&gt;&lt;/p&gt;&lt;p&gt;Make everything push + owned if you can. That means not returning a 304, it means not even trying the web request. Put that shit in STORAGE. UI icons are the classic example&lt;/p&gt;&lt;p&gt;It turns out, however, that you can make a shit ton of other stuff push + owned if you try a little harder. User profile picture? If the user doesn’t change their profile picture, it’s guaranteed to be the same. Guess what fukers, expire the it on new upload but otherwise OWN IT. IT DOESN’T NEED TO BE RE-FETCHED. Custom emojis n shit? Own them. Expire them only when they’re changed.&lt;/p&gt;&lt;blockquote class=&quot;flow&quot;&gt;&lt;p&gt;“But Hazel how does the client check if they’re expired?”&lt;/p&gt;&lt;p&gt;– an unenlightened programmer&lt;/p&gt;&lt;/blockquote&gt;&lt;p&gt;Use “stale while re-validate”. Ur welc’&lt;/p&gt;&lt;p&gt;&lt;strong&gt;In summary:&lt;/strong&gt;&lt;/p&gt;&lt;ul&gt;&lt;li&gt;Store asset&lt;/li&gt;&lt;li&gt;Use stale-while-re-validate access patterns&lt;/li&gt;&lt;li&gt;Should work offline&lt;/li&gt;&lt;/ul&gt;&lt;h2 id=&quot;push-+-user-and-pull-+-owned&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://hazelweakly.me/blog/cache-me-not-cache-me-cache-me-not/#push-+-user-and-pull-+-owned&quot;&gt;&lt;span&gt;Push + User &amp; Pull + Owned&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;These sound super different, but they’re actually almost identical. You should handle both of these with hashed URLs.&lt;/p&gt;&lt;p&gt;If a user pushes content to a server to be referenced, you can hash the URL and treat it immutably. So, do that! Likewise, if your content that you own needs to be pulled, since you own it, you can reference it with a stable URL and then… You got it, treat it immutably.&lt;/p&gt;&lt;p&gt;A great example of the push + user content would be a user uploading an image to a comment on a social forum. You have no way to &lt;em&gt;store&lt;/em&gt; this, reasonably, and the space is infinite, so you’re referencing it. But you can save yourself a ton of trouble if you have them upload the content rather than link it, and then you can host it on your own CDN, cache the url, and reference it immutably.&lt;/p&gt;&lt;p&gt;An example of the pull + owned content would be content you own, but that is dynamic in a way that, e.g., iconography isn’t; a solid example of that would be temporary or other “in-content” assets. Dynamically generated content is a prime example, and so is assets that another part of the central server owns but you need a central source of truth and so you can’t re-upload it. This is particularly common when you switch from monolithic backends to more micro or service oriented architectures; you might end up not owning a lot of the content that you “own”.&lt;/p&gt;&lt;p&gt;That’s because ownership in this case means you can tell when the content changed because your service was in charge of changing it; that’s not possible in a lot of microservice architectures, which explains the rise of event driven architectures and being able to subscribe to the content-change notifications of another service. So, you’re left with content-addressed URIs.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;In summary:&lt;/strong&gt;&lt;/p&gt;&lt;ul&gt;&lt;li&gt;Load asset&lt;/li&gt;&lt;li&gt;Use infinite TTL + hashed URLs&lt;/li&gt;&lt;li&gt;Should not re-fetch across page/app reloads&lt;/li&gt;&lt;/ul&gt;&lt;h2 id=&quot;pull-+-user&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://hazelweakly.me/blog/cache-me-not-cache-me-cache-me-not/#pull-+-user&quot;&gt;&lt;span&gt;Pull + User&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;Let’s talk about Pull + User since that’s the other weird pattern not covered by “standard caching how-to” guides on the internet.&lt;/p&gt;&lt;p&gt;That’s where it’s user generated content, but not owned by the server. Posting gifs into the chat is a prime example; linking a blog post and generating a media upload for that is another.&lt;/p&gt;&lt;p&gt;Guess what: this pattern fits for highly dynamic user-generated content, which means it’s the content &lt;em&gt;users link to each other in-platform.&lt;/em&gt;&lt;/p&gt;&lt;p&gt;Stable URL, short TTL. &lt;strong&gt;YES, SHORT TTL.&lt;/strong&gt;&lt;/p&gt;&lt;p&gt;You would be absolutely fucking shocked how much traffic the pull + user generates. It’s atrocious, but it’s also extraordinarily cache hostile, and has a half life of minutes to hours.&lt;/p&gt;&lt;p&gt;You would also be shocked how much caching it will break assumptions on the behalf of users.&lt;/p&gt;&lt;p&gt;It breaks all of their assumptions. They’re gonna change something and refresh the page and get mad when it doesn’t update. All of your bug reports lie here.&lt;/p&gt;&lt;p&gt;Debounce + throttle? Sure. Micro-TTL? Yes. Cache? &lt;em&gt;Never.&lt;/em&gt;&lt;/p&gt;&lt;p&gt;&lt;strong&gt;In summary:&lt;/strong&gt;&lt;/p&gt;&lt;ul&gt;&lt;li&gt;Fetch asset&lt;/li&gt;&lt;li&gt;use short TTL + stable URLs&lt;/li&gt;&lt;li&gt;Should (almost) always re-fetch&lt;/li&gt;&lt;li&gt;Content should change even though the URL is the same&lt;/li&gt;&lt;/ul&gt;&lt;h2 id=&quot;tldr&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://hazelweakly.me/blog/cache-me-not-cache-me-cache-me-not/#tldr&quot;&gt;&lt;span&gt;tl;dr&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;Anyways long story short don’t just give up because “lol caching is hard.” Understand your users and what the fuck you’re doing, make a strategy that builds a mental model for your developers, and then make doing the right thing easy.&lt;/p&gt;&lt;p&gt;As an example, an amazing fetching interface for encoding this concept would be&lt;/p&gt;&lt;ol&gt;&lt;li&gt;Push + Owned: &lt;code&gt;storeAsset(URI)&lt;/code&gt;&lt;/li&gt;&lt;li&gt;Push + User: &lt;code&gt;storeContent(URI)&lt;/code&gt;&lt;/li&gt;&lt;li&gt;Pull + Owned: &lt;code&gt;loadAsset(URI)&lt;/code&gt;&lt;/li&gt;&lt;li&gt;Pull + User: &lt;code&gt;loadContent(URI)&lt;/code&gt;&lt;/li&gt;&lt;/ol&gt;&lt;p&gt;Then, you can attach automatic policies to this along with fallbacks and other appropriate handling. You can even build tests to make sure that certain parts of the page only contain the “right” type of assets! A comment inside of a forum would never contain a Push + Owned asset, for example, because that’s user content. Likewise, your UI will never contain anything &lt;em&gt;but&lt;/em&gt; Push + Owned content. You can also even test that all of your Push + Owned stuff works offline!&lt;/p&gt;&lt;p&gt;I’ve never seen anyone build that interface, and I’ve never seen these types of tests, but fuk me it would fix so much. Give it a go, and stop fucking up my coffee shop experience, thanks!&lt;/p&gt;&lt;p&gt;Now if you don’t mind, I’m going to go sip my coffee and reload my broken pages even though we both know that’s not going to fix the issue.&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>Home Baked Abstractions, Store Bought Implementations</title>
    <link href="https://hazelweakly.me/blog/home-baked-abstractions-store-bought-implementations/" />
    <updated>2024-08-21T00:00:00Z</updated>
    <id>https://hazelweakly.me/blog/home-baked-abstractions-store-bought-implementations/</id>
    <content type="html">&lt;p&gt;I like to home roll abstractions, but commoditise implementations.&lt;/p&gt;&lt;p&gt;What I mean by that is a fairly simple rule that has a very powerful effect, but can be tricky to find the right balance.&lt;/p&gt;&lt;p&gt;Home rolling the abstraction, to me, means deeply exploring and fleshing out out an abstraction from whole cloth, whether it be an interface, or a mental model, or a collaborative workflow, or a template, or… Anything. But to do that effectively requires context from the team, the company, the industry, and what makes you &lt;em&gt;you&lt;/em&gt;. You can’t “off the shelf” ship a meaningful abstraction around semantic metadata, for example, but it’s invaluable to &lt;em&gt;have&lt;/em&gt; one. Why? Well, because an abstraction to me is something you use to help shape and articulate the desired emergent behaviour of groups and systems; thus, by definition, the emergent behaviour is very specific to your current context.&lt;/p&gt;&lt;p&gt;Using a commoditised implementation, on the other hand, has a fairly simple litmus test: is the implementation largely outside of your company, and can it survive the subject matter expert on your team leaving? Kubernetes, Jira, Salesforce, Spark, Postgres, etc, are all great examples of commoditised implementations. This is all about improving optionality, business continuity, reducing risk, and increasing leverage; while it can help shape your abstraction, it’s not really about shaping the emergent behaviour of a system, it’s about shaping the solution space you use to solve your problems with.&lt;/p&gt;&lt;p&gt;Why is it tricky to find the right balance? Because doing so requires integrating the implementation into the abstraction, and there’s where the glue work and expertise lies. &lt;a href=&quot;https://www.linkedin.com/posts/charity-majors_a-home-rolled-framework-builds-a-level-of-activity-7232009821837832192-0-d0&quot;&gt;Charity Majors called this balance “vendor engineering”&lt;/a&gt; and others have called it product management, but whatever the label, it’s a very real thing, and it’s extraordinarily difficult to nail down.&lt;/p&gt;&lt;p&gt;If you go too far, you’ve built an implementation on top of another and exacerbated the very problem that using a commodity was trying to prevent. Some common examples of this are:&lt;/p&gt;&lt;ul&gt;&lt;li&gt;Home grown Kubernetes operators that do “everything”&lt;/li&gt;&lt;li&gt;Magical jira templates and dashboards that require arcane incantations to fire off reams of “automation”&lt;/li&gt;&lt;li&gt;Customizing Salesforce to the point that onboarding new people requires a 5 week “forget everything you ever learned” crash course&lt;/li&gt;&lt;li&gt;Writing a custom batch, stream, ETL, WTF, BBQ, pipeline in Spark&lt;/li&gt;&lt;li&gt;Implementing buckets of custom business logic inside of Postgres&lt;/li&gt;&lt;li&gt;Building your own continuous delivery pipeline with thousands of lines of bash, nested CI workflows, and custom YAML templating tooling&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;Do you see a trend? It’s not just a thing that programmers do! (Although I do confess that most of my examples are tech oriented because that’s the main audience of my blog posts). The trap lies in when you embed the company context so deeply into the tool that the implementation can’t ever change and that the emergent behaviour becomes uncontrollable. You lose all of the optionality of a commoditised solution and all of the power of shaping emergent behaviour via hand rolling. It becomes the worst of both worlds.&lt;/p&gt;&lt;h2 id=&quot;abstraction-product-project-glue-code&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://hazelweakly.me/blog/home-baked-abstractions-store-bought-implementations/#abstraction-product-project-glue-code&quot;&gt;&lt;span&gt;Abstraction? Product? Project? Glue Code?&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;Is this really an Abstraction? Is it a Product? Or maybe Ways of Working? Could it be Culture? Words are hard. Whatever it is, we’re defining &lt;em&gt;something&lt;/em&gt;.&lt;/p&gt;&lt;p&gt;I’ve asked around a bit and, while a lot of people can articulate this thing, and that it exists, and that it’s important, I don’t think we really have a name for it. I’m not sure abstraction is the right name for it, honestly, because it’s already quite the overloaded word. This is really about the process through which a collective group of people figure out a concept and then figure out how to conceptualise and chunk that concept together into more tangible and malleable shapes so that we can build bigger ideas on top of it.&lt;/p&gt;&lt;p&gt;That said, you might be thinking “what’s the point in all this definition nonsense?” The point of doing this definition work in the first place is to give people a &lt;a href=&quot;https://hazelweakly.me/blog/engineering-language&quot;&gt;shared language&lt;/a&gt; to work from, so they can build that understanding and actually ship innovation rather than functionality.&lt;/p&gt;&lt;p&gt;Imagine where we would be today if we didn’t have mathematical notation. Have you ever tried to read the original… Anything? All of the physics and mathematics that we did back then and all of the scientific thought was all around rhetoric; we just said the words out loud and kinda tried to figure it out. We made a lot of progress, but it was astoundingly difficult to communicate the ideas; now, we can just write out a single formula and communicate five pages worth of text in two lines of equations. That ability to chunk concepts up, figure out how to express them better, and then build up ideas on top of them… Whatever that is, that’s what I’m talking about here. It’s a thing for absolutely anything that requires collective understanding, thinking, and building up of concepts over time.&lt;/p&gt;&lt;p&gt;We don’t really have a name for that, but I’m going to call it abstraction for now.&lt;/p&gt;&lt;h3 id=&quot;the-evolution-of-an-abstraction&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://hazelweakly.me/blog/home-baked-abstractions-store-bought-implementations/#the-evolution-of-an-abstraction&quot;&gt;&lt;span&gt;The Evolution of an Abstraction&lt;/span&gt;&lt;/a&gt;&lt;/h3&gt;&lt;p&gt;Personally, one of the success indicators I use to figure out whether or not I’m building a useful abstraction for others is whether or not people can guess how to solve a problem that doesn’t quite fit an existing pattern and then do it correctly in a way that works. In other words, I am explicitly thinking about the emergent behaviour(s) and trying to craft things that result in the desired emergent outcomes rather than thinking too hard about the first order results.&lt;/p&gt;&lt;p&gt;So when thinking about abstractions, think about the emergent behaviour, and think about whether or not people can intuitively explore the solution space provided.&lt;/p&gt;&lt;link as=&quot;font&quot; crossorigin=&quot;&quot; data-helmet=&quot;VictorMono-v1.5.3-Regular&quot; href=&quot;https://hazelweakly.me/fonts/VictorMono-v1.5.3-Regular.woff2&quot; rel=&quot;preload&quot; type=&quot;font/woff2&quot;&gt;&lt;link as=&quot;font&quot; crossorigin=&quot;&quot; data-helmet=&quot;VictorMono-v1.5.3-Italic&quot; href=&quot;https://hazelweakly.me/fonts/VictorMono-v1.5.3-Italic.woff2&quot; rel=&quot;preload&quot; type=&quot;font/woff2&quot;&gt;&lt;pre&gt;  Desirable Emergent Behavior
+ Intuitive Solution Space
= On the right track
&lt;/pre&gt;&lt;p&gt;When you start off trying to build an abstraction that’s meaningful, you’re not really going to have something that resembles an abstraction for a surprisingly long time. It’s going to look a lot more like an MVP, or a prototype, or a proof of concept, or a ritual, or a beta product. Abstractions take time, and they go through stages as you flesh them out and figure out what they look like and how they actually fit into everything.&lt;/p&gt;&lt;p&gt;In my experience, there’s typically around three stages:&lt;/p&gt;&lt;ol&gt;&lt;li&gt;The MVP / Prototype / Proof of Concept thingy&lt;/li&gt;&lt;li&gt;Chaos. Sobbing. Here Be Dragons. ???&lt;/li&gt;&lt;li&gt;An Abstraction!&lt;/li&gt;&lt;/ol&gt;&lt;p&gt;Naturally, this doesn’t fill a lot of people with hope, because now the first thought they have is how to cross that giant gap in-between “some MVP prototype thingy” and “an internalised concept that people utilise seamlessly to navigate a solution space in a way that results in desired outcomes.”&lt;/p&gt;&lt;p&gt;So, uh, how do?&lt;/p&gt;&lt;h2 id=&quot;crossing-the-mvp-chasm&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://hazelweakly.me/blog/home-baked-abstractions-store-bought-implementations/#crossing-the-mvp-chasm&quot;&gt;&lt;span&gt;Crossing the MVP Chasm&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;What do you do when you have too much to do, too little time to do it, and not enough resources or people to do it with?&lt;/p&gt;&lt;p&gt;Well, step zero is usually crying.&lt;/p&gt;&lt;p&gt;Seriously, feel free to cry and vent and get emotional about it; I mean it! It’s hard to be impossibly resource constrained, and it sucks, and it’s gonna feel like you’re being set up to fail, and it’s absolutely okay to have a very human and natural reaction to finding yourself in that situation. Listening to your emotions now and feeling them will help you regulate yourself emotionally before you get into the difficult work of alignment building.&lt;/p&gt;&lt;p&gt;Second, get a support channel together; this probably won’t be your manager, or the team you’re working with, although they should ideally be quite supportive and helpful! You need to be able to talk to someone (ideally more than one) about the struggles you’re facing and get an objective opinion on how you’re working with others. This is honestly deeply crucial; what you’re fundamentally doing here is you’re switching from thinking of things in terms of a set of functionality or a list of features or implementation and glue-code into figuring out how to get an entire engineering organisation to literally change their language and how they conceptualise and approach an entire problem space.&lt;/p&gt;&lt;p&gt;That. is. very. fucking. hard.&lt;/p&gt;&lt;p&gt;And it will burn you the &lt;em&gt;fuck out&lt;/em&gt; if you’re not prepared.&lt;/p&gt;&lt;p&gt;Finally, realise that this is essentially change agency and so you become effective by learning a bag of tricks and mostly throwing them heuristically at the wall until you find something that works, and then roll with it. Here’s my bag of tricks.&lt;/p&gt;&lt;h3 id=&quot;make-change-easy-to-handle-not-easy-to-do&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://hazelweakly.me/blog/home-baked-abstractions-store-bought-implementations/#make-change-easy-to-handle-not-easy-to-do&quot;&gt;&lt;span&gt;Make Change Easy to Handle, Not Easy to Do&lt;/span&gt;&lt;/a&gt;&lt;/h3&gt;&lt;p&gt;Lots of people get caught up making change easy to do, but the real secret is actually making change easy to handle. The sooner and earlier you can get to a point where the &lt;code&gt;N+1&lt;/code&gt;th iteration can be propagated out seamlessly everywhere, the sooner you start to win, cause then you can start shipping stuff knowing you can patch it up or extend it or modify it later.&lt;/p&gt;&lt;p&gt;Some techniques to do this are:&lt;/p&gt;&lt;ul&gt;&lt;li&gt;Write “validation” scripts that just check if something was done right&lt;/li&gt;&lt;li&gt;Write debugging or other “how do I investigate X,Y,Z” flow charts to capture how you think about things&lt;/li&gt;&lt;li&gt;Implementing Continuous Delivery&lt;/li&gt;&lt;li&gt;Write a &lt;a href=&quot;https://blog.danslimmon.com/2019/07/15/do-nothing-scripting-the-key-to-gradual-automation/&quot;&gt;“do-nothing” script&lt;/a&gt; to capture how you hack on something&lt;/li&gt;&lt;li&gt;Make your stuff easy to apply a patch to. Being able to run a “merge this doc with this patch” in a for loop on a directory of files is incredibly powerful&lt;/li&gt;&lt;li&gt;Be able to revert or roll-back changes&lt;/li&gt;&lt;li&gt;Make things bootstrappable and regularly test that it is bootstrappable&lt;/li&gt;&lt;/ul&gt;&lt;h3 id=&quot;define-the-&amp;quot;first&amp;quot;-mvp-as-ability-to-iterate&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://hazelweakly.me/blog/home-baked-abstractions-store-bought-implementations/#define-the-%22first%22-mvp-as-ability-to-iterate&quot;&gt;&lt;span&gt;Define the “first” MVP as Ability to Iterate&lt;/span&gt;&lt;/a&gt;&lt;/h3&gt;&lt;p&gt;MVP is a nebulous concept, nobody really tells you this, but there are two eternal truths to an MVP&lt;/p&gt;&lt;ol&gt;&lt;li&gt;It’s never an MVP because it’s always missing critical functionality&lt;/li&gt;&lt;li&gt;Any MVP, when shipped, no matter how feature incomplete, immediately becomes load bearing&lt;/li&gt;&lt;/ol&gt;&lt;p&gt;So, fuck it, ship it as soon as you can iterate on it in place. MVP, then, becomes “ability to complete the project” rather than “ability to use the project,” and the difference in framing there is &lt;em&gt;crucial&lt;/em&gt;.&lt;/p&gt;&lt;h3 id=&quot;areweplatformyet&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://hazelweakly.me/blog/home-baked-abstractions-store-bought-implementations/#areweplatformyet&quot;&gt;&lt;span&gt;AreWePlatformYet&lt;/span&gt;&lt;/a&gt;&lt;/h3&gt;&lt;p&gt;You should have an “are we X yet” style presentation somewhere. It should be able to be automatically updated. It should be unambiguous.&lt;/p&gt;&lt;p&gt;One trap I’ve fallen into with this in the past is relying on stakeholders as the final “done” of the puzzle. While that is technically correct, the important part of progress, politically, is having a script that returns “green checkmark, all systems go.” Because external stakeholders, it turns out, can’t actually evaluate the done-ness of a result until &lt;em&gt;far&lt;/em&gt; after it is actually completed. Think: “Oh I’ve been using it for a few months now, I guess it works.”&lt;/p&gt;&lt;p&gt;Naturally, that’s not going to fly for a project that has a deadline because it’s the most surefire way to ensure you miss your deadline. So, &lt;code&gt;https://arewedoneyet.YOUR_COMPANY.TLD/YOUR_PROJECT&lt;/code&gt;. Make it, own it, update it, get it ready before you even write your code.&lt;/p&gt;&lt;p&gt;Does it have to be a specific TLD? Nah, it’s really about what works for the company. Some companies have a high trust culture of internal wikis and so the wiki is absolutely the right place for that (in which case the URL could be a CNAME redirect to the wiki). That said, I find a ton of value in having an extremely visible url that’s stable, because it acts as an external interface point for others in the company.&lt;/p&gt;&lt;p&gt;When you give someone a link to an internal wiki, you’re giving them information, but when you’re giving them a link to “arewedoneyet.COMPANY” you’re saying “you can check this page and treat it as the truth, I promise to update it, and it will be extremely usable and understandable to everyone who needs it. This page is for YOU, not for me” It’s actually that last point that’s the most important. Wiki pages and roadmaps are useful for those doing the project or managing it, but for people not in the know? Those without context? It can be inscrutable.&lt;/p&gt;&lt;p&gt;&lt;a href=&quot;https://www.arewewebyet.org/&quot;&gt;The “are we web yet” page by the rust community is a prime example of this&lt;/a&gt;. Giant bold caption, answering the question in one sentence. A C-level exec could stop there and get an answer in 5 seconds. Beautiful! For those who need more, the answer is available from the perspective of the consumer, which is invaluable. For example, the first question is “can this replace laravel” and the answer is “not yet”. You find that out in 10 seconds, and the hundreds of people building hundreds of projects and thousands of lines of code with all of their own committees and PMs and repos and everything? That all gets abstracted into something imminently usable.&lt;/p&gt;&lt;p&gt;Plus, the people using the project can immediately see how well you understand their needs. If they’re confused by the page, then that’s a huge warning sign because it means the project communication isn’t clear somewhere. But with internal docs or wiki pages, people can feel like “oh well that’s okay, it’s more for them and not for me”; the website really acts as a forcing function to align people and put that emphasis where it belongs (in my opinion).&lt;/p&gt;&lt;p&gt;I have a few regrets about the projects I’ve architected and lead, and not having this be a real, fully updating, automated website has consistently been one of them. It’s easy to think “okay but that’s a lotta effort”, but it’s significantly less effort than spending weeks in meetings convincing people that you’re making the progress that you’re making and getting them to understand the shape of the project. Linking to an outdated wiki page and saying “ok all the information on here is mostly wrong but this bit is right and this…” is just fucking embarrassing for both you and the person your boss is going to relay that information to. It’s so much more than a roadmap, it’s almost like a blend of marketing and sales and user research than a roadmap (shoutout to &lt;a href=&quot;https://asbjornbrandt.com/&quot;&gt;Asbjørn Brandt&lt;/a&gt; for giving me the inspiration that this is more like marketing than a roadmap). Do yourself a favor, make the website.&lt;/p&gt;&lt;h3 id=&quot;make-a-giant-whammy-reset-button&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://hazelweakly.me/blog/home-baked-abstractions-store-bought-implementations/#make-a-giant-whammy-reset-button&quot;&gt;&lt;span&gt;Make a giant whammy reset button&lt;/span&gt;&lt;/a&gt;&lt;/h3&gt;&lt;p&gt;When you’re rapidly iterating on things, you’re not going to really understand how things go together until after they go together. But after you’ve setup half of the work, you’re probably going to run into something that’s slightly stuck. It’s probably due to caches, or due to a retry loop, or due to something pointing slightly wrong somewhere else and not able to update. It might even be due to the bug that you fixed but you can’t actually really fix it cause, well, you fixed the bug and now you gotta unstick things.&lt;/p&gt;&lt;p&gt;Giant.&lt;br&gt; Whammy.&lt;br&gt; Reset.&lt;br&gt; Button.&lt;/p&gt;&lt;p&gt;I wish I had built this in almost every project I’ve worked on, and I wish more products built and considered this in their own implementations. Is the thing broken? janky? Is it a you or them problem? Who fuckin knows.&lt;/p&gt;&lt;p&gt;Whammy reset button lets you know very definitively whether or not its you, and it effectively acts as a soft bootstrapping mechanism. It’s super awesome, everything should have one.&lt;/p&gt;&lt;blockquote class=&quot;flow&quot;&gt;&lt;p&gt;Hazel, you cannot seriously be advocating for a giant, unauthenticated “drop all tables and wipe the caches and reset everything to scratch and terminate all existing processes and rest…”&lt;/p&gt;&lt;/blockquote&gt;&lt;p&gt;I absolutely am. Obviously, don’t turn it on by default, or in production, but you should have one because it makes the iteration vastly smoother and it gives you a very important “when all else fails…” step in any debugging workflow.&lt;/p&gt;&lt;h3 id=&quot;provide-mental-models-for-evaluating-timelines&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://hazelweakly.me/blog/home-baked-abstractions-store-bought-implementations/#provide-mental-models-for-evaluating-timelines&quot;&gt;&lt;span&gt;Provide Mental Models For Evaluating Timelines&lt;/span&gt;&lt;/a&gt;&lt;/h3&gt;&lt;p&gt;You’re going to run into people who disagree with how you’re doing things, you’re going to run into people who agree but are slightly confused, you’re going to run into people who “don’t get it”, and you’re going to run into people who are acting adversarially against the project goals (often not maliciously).&lt;/p&gt;&lt;p&gt;The solution? Progress updates! Just kidding, nobody reads those.&lt;/p&gt;&lt;p&gt;This one is purely a cultural and political one. The real question behind this is twofold, with a third component:&lt;/p&gt;&lt;ol&gt;&lt;li&gt;How do you justify the resources (time, headcount, money, etc) you’ve consumed so far&lt;/li&gt;&lt;li&gt;How do you justify the additional resources you think you need to complete the project&lt;/li&gt;&lt;/ol&gt;&lt;p&gt;And then, given those two factors, how do you explain and motivate any differences in those two that have happened since the last time you communicated this. Ideally, not only is the motivation clear and the explanation understandable, but the expectations you’ve set around the resources consumed and predicted don’t cause uncomfortable questions for leadership down the road.&lt;/p&gt;&lt;p&gt;Here’s how &lt;em&gt;not&lt;/em&gt; to do it:&lt;/p&gt;&lt;blockquote class=&quot;flow&quot;&gt;&lt;p&gt;“hey yeah so, uhh, oh yeah, thanks for asking… yeah it turns out the project took like 5 times longer to complete so far and we’ve done about 20% of the progress we anticipated by now. Oh this is the first time you’re hearing this? yeah we could’ve communicated that more proactively, and we tried, but this ended up being a lot more exploratory than we thought, and…”&lt;/p&gt;&lt;p&gt;– Somebody about to do real bad on their next performance review&lt;/p&gt;&lt;/blockquote&gt;&lt;p&gt;A better way to do it, again, involves understanding the culture of the company and its leadership, how they value progress, what they reward, and so on. This is something that, honestly, I’m still getting better at; it’s probably one of the more difficult and harder to define things, especially since it’s not actually the same as the AreWePlatformYet question, even though it really feels the same.&lt;/p&gt;&lt;p&gt;To me, this really boils down to “how do I teach leadership an effective mental model around how work effectively happens in the technical domain I’m in” and “how do I help them get an intuition for when things are going well and when they are not.” Leaders want to unblock people, accelerate work, de-risk outcomes, and globally prioritise a shared pool of limited resources so that the company objectives happen most effectively. If you don’t teach them how to do that well, they’re going to come up with their own mental model for this and, odds are, you are &lt;em&gt;not&lt;/em&gt; gonna like it. This is not a “leadership bad, they so dum” thing, this is “leadership needs to globally balance apples to oranges to bananas to pears” constantly and do so in an empathetic manner and equitable manner. So help them do that.&lt;/p&gt;&lt;p&gt;How I’m going to try doing this next time is by establishing a sort of “explore, expand, extract” style model types of work.&lt;/p&gt;&lt;ul&gt;&lt;li&gt;When we’re in the explore phase, progress is unclear, and we’re throwing shit at the wall to see what sticks. &lt;ul&gt;&lt;li&gt;Investment needs to be low effort, low friction, high iteration speed&lt;/li&gt;&lt;li&gt;Deadlines don’t exist, but timeboxing does&lt;/li&gt;&lt;li&gt;Problems? who fuckin knows&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;When we’re in the expand phase, we’ve figured out a path forward to address functional requirements and we’re cranking it up as fast as it can go. &lt;ul&gt;&lt;li&gt;Investment needs to be high effort, low friction, high iteration speed&lt;/li&gt;&lt;li&gt;Deadlines exist, but they’re all just “asap” and prioritization isn’t really a thing&lt;/li&gt;&lt;li&gt;Problems can now be enumerated and burndown charts might exist, but it’s a Done/Not-Done granularity and progress may flap&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;When we’re in the extract phase, we’ve scaled things out to where its functional, and now we’re optimizing and balancing non-functional requirements. &lt;ul&gt;&lt;li&gt;Investment needs to be “just the right amount” of effort, “appropriate” friction, and iteration speed can be low&lt;/li&gt;&lt;li&gt;Deadlines exist, can be prioritised, and can be depended on by external stakeholders in the company&lt;/li&gt;&lt;li&gt;Problems can be resourced and prioritised and predicted&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;Then updates can get split across the three categories and what you’re looking for is for stuff to gradually move “up” the ladder of explore to expand to extract. Honestly, this seems solid to me, but I have no idea how effective it would actually be in the real world.&lt;/p&gt;&lt;p&gt;One important detail: this has to also be bi-directional. One failure mode I’ve seen before is that, despite how clearly I might communicate expected timelines, that doesn’t mean that the leaders or stakeholders in question will repeat those accurately. You need to also get a very clear picture of what &lt;em&gt;their&lt;/em&gt; understanding of your timeline is and what their understanding of the justification(s) are. If they aren’t persuaded by the justifications, you need to know that immediately. That said, their understanding, satisfaction, and interpretation is all information you will have to proactively seek out because their dissatisfaction might be unconscious or a fuzzy hunch feeling rather than an explicit disagreement.&lt;/p&gt;&lt;h3 id=&quot;understand-what-the-people-want&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://hazelweakly.me/blog/home-baked-abstractions-store-bought-implementations/#understand-what-the-people-want&quot;&gt;&lt;span&gt;Understand What The People Want&lt;/span&gt;&lt;/a&gt;&lt;/h3&gt;&lt;p&gt;Stakeholders almost never actually communicate about the things they want in an accurate way. Even if they do, there’s often other things inside of those asks or left unstated that can end up mattering more to successful communication than delivering the actual asks. One thing in particular that’s very important to do is to help people understand the solution space and how to better judge the quality of your navigating through it, especially as that relates to fulfilling their needs. Importantly, this is very different than evaluating timelines.&lt;/p&gt;&lt;p&gt;To rephrase: The goal here is to help people understand their options and communicate about how well you’re delivering their needs, which requires them having an ability to understand those options and judge how well the implementation is going to do what they need it to do. This isn’t necessarily adversarial! It often isn’t! But if done in a culture with low psychological safety, this will absolutely be the most adversarial and emotionally taxing part of your journey in many ways. Not because it’s difficult, but because this is going to be where you might have bad faith actors coming in and refusing to acknowledge that you’re building a solution that works for them. If your project or abstraction fails for a reason that feels systemically or deeply unfair, something in here is a likely culprit (and again: it doesn’t have to be malicious or intentional).&lt;/p&gt;&lt;p&gt;You have a few different main groups of people who are going to care about this:&lt;/p&gt;&lt;ul&gt;&lt;li&gt;Future users&lt;/li&gt;&lt;li&gt;Your leadership&lt;/li&gt;&lt;li&gt;Stakeholders&lt;/li&gt;&lt;li&gt;“The market”&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;Digging into those, I have a few different things I think of when figuring out what people actually want and how you can help them succeed.&lt;/p&gt;&lt;h4 id=&quot;future-users&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://hazelweakly.me/blog/home-baked-abstractions-store-bought-implementations/#future-users&quot;&gt;&lt;span&gt;Future users&lt;/span&gt;&lt;/a&gt;&lt;/h4&gt;&lt;p&gt;Future users are going to be the people who are most impacted by everything you do and how well you deliver on the abstraction. They’re going to be the ones who are using it and speaking that language day in and day out, so how well they can understand it, articulate it, and use it to their advantage really matters. The single fundamental principle here is making sure that they can see and hold the tangible idea of the abstraction and play around with it as soon as possible. The supporting principle here is that it’s deeply important to be able to give them multiple different avenues for feedback, and to be able to incrementally iterate on that feedback.&lt;/p&gt;&lt;p&gt;Most of all, however, the biggest thing that you need is to cultivate psychological safety in the users so that they feel able to experiment with the abstraction, tell you their true thoughts about it, and help you shape it to better fit the needs of them and the company.&lt;/p&gt;&lt;p&gt;Here are some concrete ideas you can do to help make this more successful:&lt;/p&gt;&lt;ul&gt;&lt;li&gt;Create a feedback channel in your company’s chat application of choice (or a mailing list)&lt;/li&gt;&lt;li&gt;Host office hours where you answer live Q&amp;A and go over the abstraction and some of the things that it enables&lt;/li&gt;&lt;li&gt;Find a team and migrate something they use to utilise the abstraction, use that to find weak points you haven’t considered, and then address those&lt;/li&gt;&lt;/ul&gt;&lt;h4 id=&quot;your-leadership&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://hazelweakly.me/blog/home-baked-abstractions-store-bought-implementations/#your-leadership&quot;&gt;&lt;span&gt;Your Leadership&lt;/span&gt;&lt;/a&gt;&lt;/h4&gt;&lt;p&gt;Your leadership are going to be the people who are most directly in charge of everything you do and while they aren’t responsible &lt;em&gt;for&lt;/em&gt; the creation of the abstraction, they are responsible for the outcomes of that abstraction. If an abstraction here helps shape and articulate the emergent behaviour of a company and how it navigates the solution space, then it stands to reason that an abstraction is actually one of the most vital things leadership cares about. Unlike projects, abstractions here are directly a thing that leadership cares about; your success in being able to create an abstraction that results in emergent behaviour that’s aligned with the company goals is a direct success criteria for them.&lt;/p&gt;&lt;p&gt;Which means, if this goes badly, they’re going to behave like they’re taking it &lt;em&gt;awfully&lt;/em&gt; personally. Not in a bad way, necessarily, but this gets to the heart of the “enabling” aspect of leadership in a way that few other things do.&lt;/p&gt;&lt;p&gt;Success here, fundamentally means that you understand how your leadership thinks about those emergent behaviours and what leading indicators they utilise to understand whether or not the right emergent behaviour is shaping out. Additionally, you’re going to be looking for the “unsaid” things that they’re concerned about; often when leaders talk about a concern, there’s a hidden one underneath it that’s more valuable, and you’re going to need to extract that one out if you want to be able to help them help you succeed.&lt;/p&gt;&lt;p&gt;Here are some concrete ideas you can do to help make this more successful:&lt;/p&gt;&lt;ul&gt;&lt;li&gt;Create an explicit value stream mapping of the emergent behaviour&lt;/li&gt;&lt;li&gt;Come up with a list of “&lt;a href=&quot;https://designtom.medium.com/how-to-do-discovery-and-delivery-at-the-same-time-with-pivot-triggers-3dada51c58a3&quot;&gt;pivot triggers&lt;/a&gt;”, and identify what the pivot options are after the pivot trigger trips&lt;/li&gt;&lt;li&gt;Identify a list of concrete actions or events that are considered to be manifestations of the ideal emergent behaviour. This one is particularly valuable because you’re calibrating &lt;em&gt;both&lt;/em&gt; of your abilities to predict organisational responses to change, which is invaluable&lt;/li&gt;&lt;li&gt;Figure out ways to convert value, cost, and trade-offs from one value system into another (ie currency vs time, headcount vs opportunity cost, whatever works for you)&lt;/li&gt;&lt;/ul&gt;&lt;h4 id=&quot;stakeholders&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://hazelweakly.me/blog/home-baked-abstractions-store-bought-implementations/#stakeholders&quot;&gt;&lt;span&gt;Stakeholders&lt;/span&gt;&lt;/a&gt;&lt;/h4&gt;&lt;p&gt;The stakeholders are are the people that aren’t directly in charge of the leadership, and they’re also not the direct users, but they’re people who are going to be speaking the language of the abstraction and they’re going to need to be able to communicate with the users who are going to be using it on a daily basis. A huge, yet often not considered, aspect of success in the creation of an abstraction is whether or not the stakeholders involved can learn how to communicate with the abstraction. This doesn’t mean that they understand it: domain experts often need to build abstractions that don’t translate well outside of said domain, but that doesn’t mean they can’t be used to communicate outside the domain. In fact, the ability to communicate the abstraction outside of the domain in which it “belongs” is likely one of the most important success criteria for determining the longevity of the abstraction in the organisation. As the abstraction goes from being an innovation to a novelty to a product to a commodity, you’re going to see the scope of who utilises it in the company widen over time.&lt;/p&gt;&lt;p&gt;Stepping back for a brief moment, though, what’s really happening here is that you have the people who are going to be using the abstraction, and the people that are building the abstraction; importantly, they are two separate groups of people. One of the biggest force multipliers of programming comes from the fact that those two people are the &lt;em&gt;same group&lt;/em&gt;. When they are not the same group, a ton of the magic of programming goes away and you’re going to need to learn how to actually think in terms of group and social dynamics and investigate ways to improve those.&lt;/p&gt;&lt;p&gt;A very handy thing to utilise for this is &lt;a href=&quot;https://www.liberatingstructures.com/ls-menu/&quot;&gt;Liberating Structures&lt;/a&gt;. Liberating structures work at the level of how people meet, plan, decide, and work together in order to make things go better. In other words, they’re exactly what you want and they help you build up the layers of understanding an abstraction bit by bit &lt;em&gt;with&lt;/em&gt; the people who are going to be using it, rather than despite them.&lt;/p&gt;&lt;p&gt;Here are some concrete ideas you can do to help make this more successful:&lt;/p&gt;&lt;ul&gt;&lt;li&gt;Invite and solicit explicit stakeholders to office hours or focus group sessions in order to determine whether or not the abstraction is “usable” by them&lt;/li&gt;&lt;li&gt;Create a set of mental models or analogies that future users can use with stakeholders, and then make sure they can use those&lt;/li&gt;&lt;li&gt;Take advantage of &lt;a href=&quot;https://www.liberatingstructures.com/2-impromptu-networking/&quot;&gt;impromptu networking&lt;/a&gt; to get people engaged and help figure out what the hidden knowledge and vocabulary that they’re already using is&lt;/li&gt;&lt;li&gt;A method called &lt;a href=&quot;https://www.liberatingstructures.com/11-shift-share%20/&quot;&gt;Shift &amp; Share&lt;/a&gt; might be just the thing you need in order to get people from “this is the abstraction” to “this is how I use it” rapidly and organically&lt;/li&gt;&lt;li&gt;One of my favorite is this one: Creating an &lt;a href=&quot;https://www.liberatingstructures.com/27-agreement-certainty-matrix/&quot;&gt;Agreement Certainty matrix&lt;/a&gt;. A challenge that I often encounter is that it’s difficult for people to understand what is a “simple” notion in an abstraction, and what is a “give me a research team and five years” notion. This helps map that space out! And importantly, if you find it confusing to map onto this diagram, it’s an excellent sign that the abstraction isn’t a good one.&lt;/li&gt;&lt;/ul&gt;&lt;h4 id=&quot;&amp;quot;the-market&amp;quot;&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://hazelweakly.me/blog/home-baked-abstractions-store-bought-implementations/#%22the-market%22&quot;&gt;&lt;span&gt;“The Market”&lt;/span&gt;&lt;/a&gt;&lt;/h4&gt;&lt;p&gt;No abstraction is complete without considering the context in which it resides. In the same way, no abstraction can be built without considering the market within which it resides. Success here means a few things:&lt;/p&gt;&lt;ol&gt;&lt;li&gt;You timed the market well and built the abstraction at the right time&lt;/li&gt;&lt;li&gt;You developed the right abstraction for the right market&lt;/li&gt;&lt;li&gt;The incentives and collaboration structures you built up around the abstraction made sense in the market you’re considering&lt;/li&gt;&lt;/ol&gt;&lt;p&gt;It’s also worth noting that when I say market here, there are actually several markets and they can all be in play at the same time. You have the financial markets, socioeconomic market, political market, popularity market, and so on. Any system in which you exchange assets of value counts as a market, and you’re often in multiple markets at the same time. We &lt;em&gt;tend&lt;/em&gt; to just consider the financial market because it’s convenient and acts as a proxy for many other markets, but inside a company you have a lot more markets to consider that might be just as valuable, if not moreso.&lt;/p&gt;&lt;p&gt;In other words, what we’re really talking about here is a system, not a market; people just tend to get a mental freeze when you talk about The System or Systems Thinking, so explaining it in terms of several different markets where you have exchanges of value is a much more approachable way for a lot of people. They’re the same thing, though. When I’m personally thinking about the system or market, I often want to go backwards; I think about how I can influence change, and then I think about what the outcomes are, and then I start winding time backwards and figuring out what might go wrong, and then I find out what interventions can be inacted in order to minimise the wrong futures and encourage the right futures to grow. Importantly, you’re also going to have to figure out the introduction and trigger inflection points: an introduction inflection point is “the conditions at which it makes most sense to introduce this” and the trigger inflection point is “the conditions at which the intervention needs to be re-evaluated, terminated, or modified.”&lt;/p&gt;&lt;p&gt;So, the steps, for me, are:&lt;/p&gt;&lt;ol&gt;&lt;li&gt;Figure out what my change points are and find the ones that are applicable for me in my current “market”, with my current resources&lt;/li&gt;&lt;li&gt;Identify the outcomes, both ideal and not ideal, that could happen from a change&lt;/li&gt;&lt;li&gt;Wind time back to identify what interventions could’ve been attempted&lt;/li&gt;&lt;li&gt;Identify the introduction inflection points and the trigger inflection points for the interventions&lt;/li&gt;&lt;li&gt;Execute the interventions when the introduction inflection point occurs, and monitor the progress and wait for the triggers to trip&lt;/li&gt;&lt;/ol&gt;&lt;p&gt;Donella Meadows has a fantastic blog post about leverage points that can influence changes &lt;a href=&quot;https://donellameadows.org/archives/leverage-points-places-to-intervene-in-a-system/&quot;&gt;here&lt;/a&gt;, which is well worth the read. I’m going to list them in reverse order (they list them from least effective to most effective, I will list them from most effective to least effective).&lt;/p&gt;&lt;ol&gt;&lt;li&gt;The power to transcend paradigms.&lt;/li&gt;&lt;li&gt;The mindset or paradigm out of which the system — its goals, structure, rules, delays, parameters — arises.&lt;/li&gt;&lt;li&gt;The goals of the system.&lt;/li&gt;&lt;li&gt;The power to add, change, evolve, or self-organise system structure.&lt;/li&gt;&lt;li&gt;The rules of the system (such as incentives, punishments, constraints).&lt;/li&gt;&lt;li&gt;The structure of information flows (who does and does not have access to information).&lt;/li&gt;&lt;li&gt;The gain around driving positive feedback loops.&lt;/li&gt;&lt;li&gt;The strength of negative feedback loops, relative to the impacts they are trying to correct against.&lt;/li&gt;&lt;li&gt;The lengths of delays, relative to the rate of system change.&lt;/li&gt;&lt;li&gt;The structure of material stocks and flows (such as transport networks, population age structures).&lt;/li&gt;&lt;li&gt;The sizes of buffers and other stabilizing stocks, relative to their flows.&lt;/li&gt;&lt;li&gt;Constants, parameters, numbers (such as subsidies, taxes, standards).&lt;/li&gt;&lt;/ol&gt;&lt;p&gt;Which is handy, and awesome, but that might feel like a lot of things and it can feel overwhelming to look at this list; here’s a simplified version of the list that strips out all of items that generally require authority:&lt;/p&gt;&lt;ol&gt;&lt;li&gt;The gain around driving positive feedback loops.&lt;/li&gt;&lt;li&gt;The strength of negative feedback loops, relative to the impacts they are trying to correct against.&lt;/li&gt;&lt;li&gt;The structure of material stocks and flows (such as transport networks, population age structures).&lt;/li&gt;&lt;li&gt;Constants, parameters, numbers (such as subsidies, taxes, standards).&lt;/li&gt;&lt;/ol&gt;&lt;p&gt;If you’re looking at this list and you’re going “huh, it seems like you took out basically all the most impactful stuff,” you would be right; there’s a reason that empowering others is so critical to effective leadership. Push the ability to change a system &lt;em&gt;down&lt;/em&gt; the authority ladder and watch as vast amounts of issues magically disappear right before your very eyes. However, there’s a pretty cool thing here: when you build an abstraction, this is one of the very few times that anyone in the company has the power to directly touch and influence a lot of the most impactful leverage points in a system.&lt;/p&gt;&lt;p&gt;Therefore, for &lt;em&gt;abstractions&lt;/em&gt; specifically, the real list is more like this:&lt;/p&gt;&lt;ol&gt;&lt;li&gt;The power to transcend paradigms.&lt;/li&gt;&lt;li&gt;The mindset or paradigm out of which the system — its goals, structure, rules, delays, parameters — arises.&lt;/li&gt;&lt;li&gt;The goals of the system.&lt;/li&gt;&lt;li&gt;The rules of the system (such as incentives, punishments, constraints).&lt;/li&gt;&lt;li&gt;The structure of information flows (who does and does not have access to information).&lt;/li&gt;&lt;/ol&gt;&lt;p&gt;That’s the real power of abstraction: when done right, it changes the very fabric of how a collective perceives knowledge itself.&lt;/p&gt;&lt;h2 id=&quot;the-balancing-act-in-practice&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://hazelweakly.me/blog/home-baked-abstractions-store-bought-implementations/#the-balancing-act-in-practice&quot;&gt;&lt;span&gt;The Balancing Act in Practice&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;Well, that was a lot of words. I know I’m a verbose writer, but whew.&lt;/p&gt;&lt;p&gt;Feel free to take some time and grab another cup of coffee before digging into the next section. We’re going to be going over how this actually works in practice, using a semi mashed up running example of abstractions I’ve architected and implemented at various points in my career. I’ll be referring to “the company” or “a company”, or “a project”, but that is more of a linguistic shorthand than a reference to a &lt;em&gt;specific&lt;/em&gt; company or project &lt;s&gt;and this paragraph definitely does not exist in order to reduce my legal liability.&lt;/s&gt;&lt;/p&gt;&lt;p&gt;One of the most difficult things about dealing with abstractions, technical leadership, and honestly leadership in general, is reckoning with the absolutely massive difference between the nice and neatly bundled theory vs the messy non-linear real world. So, this is gonna make the blog post a lot longer, but it’d be quite the disservice to talk about how something works without actually walking through how it worked out for me and the lessons I learned along the way.&lt;/p&gt;&lt;h3 id=&quot;context-and-circumstances&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://hazelweakly.me/blog/home-baked-abstractions-store-bought-implementations/#context-and-circumstances&quot;&gt;&lt;span&gt;Context and Circumstances&lt;/span&gt;&lt;/a&gt;&lt;/h3&gt;&lt;p&gt;Here’s an example of something that I went through at a previous company I worked at. (And again: Whenever I reference “a company” or “a project”, it’s really an amalgamation of several companies, projects, and such, plus some details changed, and so on…)&lt;/p&gt;&lt;p&gt;The company context was:&lt;/p&gt;&lt;ul&gt;&lt;li&gt;Culturally &lt;ul&gt;&lt;li&gt;Individual autonomy was prioritised over over realised productivity.&lt;/li&gt;&lt;li&gt;There was a hero culture and employees who shoved things through and got it done while burning out doing 80+ hour weeks were idolised.&lt;/li&gt;&lt;li&gt;The shared understanding of the product among leadership was “this should be a very simple piece of software, all of our complexity is in the sales and in hitting a critical point for network effects to kick in.”&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;Strategically &lt;ul&gt;&lt;li&gt;The Go To Market strategy heavily leaned on specific one-on-one engagements with customers.&lt;/li&gt;&lt;li&gt;Talent acquisition revolved around hiring “undiscovered and potentially inexperienced smart generalists with lots of potential” and having them do everything end to end.&lt;/li&gt;&lt;li&gt;Product diversification was simultaneously a top concern and a low priority.&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;Politically &lt;ul&gt;&lt;li&gt;Work that wasn’t explicitly a feature designed to close a sale was heavily de-prioritised and under-resourced.&lt;/li&gt;&lt;li&gt;Alpha mentality and individualism were rewarded.&lt;/li&gt;&lt;li&gt;“Disagree and commit” was more like “publicly agree, privately do your own thing anyway” in practice&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;Operationally &lt;ul&gt;&lt;li&gt;Software ran in a heavily regulated environment.&lt;/li&gt;&lt;li&gt;Multi-cloud was embraced as a strategic need out of necessity.&lt;/li&gt;&lt;li&gt;Mergers &amp; Acquisitions were utilised heavily as a growth mechanism, and so the ability to accommodate diversity across diversity, ways of working, and tooling, was required.&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;Some of the circumstances (for the CTO org), then, were:&lt;/p&gt;&lt;ul&gt;&lt;li&gt;Software lived in multiple different monorepos.&lt;/li&gt;&lt;li&gt;“Lots of tech debt” was a disproportionately heavy complaint.&lt;/li&gt;&lt;li&gt;Everything was built out of very short-term solutions cobbled together on the fly.&lt;/li&gt;&lt;li&gt;Time pressure dominated every concern.&lt;/li&gt;&lt;li&gt;The complexity of the solution space was growing exponentially and was subject to power law distributions.&lt;/li&gt;&lt;li&gt;“One way to do things” solutions were a non starter.&lt;/li&gt;&lt;li&gt;Top down mandates or enforcement were attempted, but were largely unsuccessful except in very rare cases.&lt;/li&gt;&lt;li&gt;Nobody was aware of what any other team was doing.&lt;/li&gt;&lt;/ul&gt;&lt;h3 id=&quot;articulating-the-problem&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://hazelweakly.me/blog/home-baked-abstractions-store-bought-implementations/#articulating-the-problem&quot;&gt;&lt;span&gt;Articulating the Problem&lt;/span&gt;&lt;/a&gt;&lt;/h3&gt;&lt;p&gt;The problem(s) to solve, in question, was:&lt;/p&gt;&lt;ol&gt;&lt;li&gt;How do we deploy software safely, rapidly, and reliably&lt;/li&gt;&lt;li&gt;How do we make it so that a centralised function can build and improve core infrastructure&lt;/li&gt;&lt;li&gt;How do we get to the point where we can prepare a multi-cloud playbook for integrating acquisitions and mergers&lt;/li&gt;&lt;li&gt;How do we enable change management and migrations without interrupting engineers&lt;/li&gt;&lt;li&gt;How do we do this in a way that is compatible with a highly regulated industry&lt;/li&gt;&lt;/ol&gt;&lt;p&gt;The existing solution had been cobbled together in various different ways, and wasn’t effectively meeting the above concerns (as it had never been designed for those concerns). The few commonalities, if they existed at all, were &lt;em&gt;generally&lt;/em&gt;:&lt;/p&gt;&lt;ul&gt;&lt;li&gt;That it involved GitHub Actions&lt;/li&gt;&lt;li&gt;Kubernetes was the platform of choice&lt;/li&gt;&lt;li&gt;Much of the complexity was buried in orphaned thousands-of-lines-long bash scripts&lt;/li&gt;&lt;li&gt;Most actual functionality was invoked 3-5 layers of indirection deep&lt;/li&gt;&lt;li&gt;Various CLI tools were in an ad-hoc manner&lt;/li&gt;&lt;li&gt;Deployment mechanisms were imperative, mutable, and stateful in mindset&lt;/li&gt;&lt;li&gt;A complete lack of standardization around anything&lt;/li&gt;&lt;li&gt;Common complaints that the solution was brittle, easily broken, and poorly understood.&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;In addition, one common thing engineers said was they didn’t want to deploy Yet Another Thing and maintain it, so there was an extreme reluctance to consider “additional complexity” (measured only by number of services running or number of integration points). As a consequence, things were horrifically complex and inefficient.&lt;/p&gt;&lt;h3 id=&quot;this-is-finetm&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://hazelweakly.me/blog/home-baked-abstractions-store-bought-implementations/#this-is-finetm&quot;&gt;&lt;span&gt;This Is Fine™&lt;/span&gt;&lt;/a&gt;&lt;/h3&gt;&lt;p&gt;Which, honestly, is fine; if you give a bunch of engineers the power to do whatever they want to, no training in how to solve the problem, and no time to solve it, you’re going to end up with this situation. And that can be super workable if you’re willing to accept the outcome! Not everything needs to be perfect, or well designed, or even coherent; sometimes stuff just needs to sorta-kinda work most of the time.&lt;/p&gt;&lt;p&gt;It can be very uncomfortable for engineers to encounter something that feels like tech debt, or feels broken, or inefficient, and not really be allowed to fix it. I get that! Deeply! But “fixing” has to involve the entire context of the company, and sometimes things aren’t actually a problem or a priority for the company, even if it feels like a huge problem to an engineer. One of the difficult parts of being a technical leader is being able to effectively advocate for problems, while also setting them in their proper context so that they can be understood outside of the CTO org.&lt;/p&gt;&lt;p&gt;In this case, the limits of the solution had been reached, and a comprehensive abstraction around deployment needed to be developed. Naturally, “deployment” in this case really meant about 5-10 different concepts and capabilities in a trench-coat.&lt;/p&gt;&lt;ol&gt;&lt;li&gt;Deploy the thing&lt;/li&gt;&lt;li&gt;Progressively release the thing&lt;/li&gt;&lt;li&gt;Be able to roll back or revert changes&lt;/li&gt;&lt;li&gt;“Break glass” capabilities&lt;/li&gt;&lt;li&gt;Secrets management&lt;/li&gt;&lt;li&gt;Environment variable management&lt;/li&gt;&lt;li&gt;Environment / Context aware modifications for deployments&lt;/li&gt;&lt;li&gt;Actionable notifications on progress, failure, and current status&lt;/li&gt;&lt;li&gt;Observability into the whole process&lt;/li&gt;&lt;li&gt;Psychological safety for the engineers: they needed to feel like they understood this and could own fixing their application and its deployment process&lt;/li&gt;&lt;/ol&gt;&lt;p&gt;One important thing to understand is that while it seems obvious for me to lay the problem out like this, very few people at the company would’ve actually agreed with the entirety of the list. Which should be completely expected; abstractions aren’t built over time, and language develops very slowly and unevenly among groups of people, so of course you will never run into a situation in which you can draw out the entire scope of a problem from your perspective and get others to immediately agree with it. That’s where alignment work comes in, and it’s one of the reasons why it’s so valuable, and why technical leadership should be more deeply understood and explicitly developed at companies who are solving technically complex problems.&lt;/p&gt;&lt;p&gt;What I ended up architecting for this was (among other things): a combination of various implementations, some glue code, and a multi-stage plan for migration, simplification, and learning. Crucially, this doesn’t really start looking like an abstraction until things are sufficiently far along, and that can be demoralizing to realise because sometimes you don’t get to see the abstraction take place even though it’s supposed to be there.&lt;/p&gt;&lt;h2 id=&quot;defining-the-abstraction-mvp&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://hazelweakly.me/blog/home-baked-abstractions-store-bought-implementations/#defining-the-abstraction-mvp&quot;&gt;&lt;span&gt;Defining the &lt;s&gt;Abstraction&lt;/s&gt; MVP&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;Ok, sick, we have an idea of what the problem is and we have an idea of what the context is, so… Given all of that, what the fuck does a good abstraction look like? Turns out, that’s a really hard problem, so don’t solve that; instead, first ask yourself “what does the abstraction definitely not look like?”&lt;/p&gt;&lt;p&gt;Oh neat, I have some things right off the bat that disqualify certain implementations:&lt;/p&gt;&lt;ul&gt;&lt;li&gt;One size fits all solutions&lt;/li&gt;&lt;li&gt;Requiring the product development teams to stop what they’re doing and refactor their code, their infrastructure, or their current pipelines&lt;/li&gt;&lt;li&gt;Making assumptions around how the deployment looks&lt;/li&gt;&lt;li&gt;Fully custom or in-house solutions&lt;/li&gt;&lt;li&gt;Anything that can’t be incrementally improved or delivered on is a non starter: there’s MVP functionality, but MVP+1 needs to be right around the corner&lt;/li&gt;&lt;li&gt;Inability to self service&lt;/li&gt;&lt;li&gt;Too many layers of indirection&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;Now that we have this, we can also start to think about some things that we &lt;em&gt;do&lt;/em&gt; need.&lt;/p&gt;&lt;ul&gt;&lt;li&gt;If the power law distribution is going to hold, we need solutions that have exponential leverage&lt;/li&gt;&lt;li&gt;If multi-cloud is a thing, the solution needs to not &lt;em&gt;require&lt;/em&gt; any cloud provider (but in practice it can assume certain defaults)&lt;/li&gt;&lt;li&gt;If autonomy is valued over productivity, the solution needs to allow teams to shove their own thing into it somehow&lt;/li&gt;&lt;li&gt;If self service is a need, people need a cookbook that lets them apply very standard and methodical solutions to common problems&lt;/li&gt;&lt;li&gt;If we don’t want custom / in-house solutions, we need to choose implementations that let us minimise custom glue&lt;/li&gt;&lt;li&gt;If we can’t have one size fits all and we need out of band deployment, the solution has to allow for “custom deployment code”&lt;/li&gt;&lt;li&gt;If we can’t require teams to refactor their code, it means we’re doing the migration, and so that needs to be possible with the team size and resources&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;Once we have that, you also need to think about the order in which things are going to be built and how you can get all of the things you need even if you can’t build them all at once.&lt;/p&gt;&lt;ul&gt;&lt;li&gt;Hmm… If the platform team is doing the migration, the cookbook isn’t required immediately, because it only makes sense after things have been lifted over and teams want to touch stuff&lt;/li&gt;&lt;li&gt;If we need to do out of band migrations, patching things on the fly is definitely a day-one concern&lt;/li&gt;&lt;li&gt;If a single cloud or default(s) can be assumed, we just need to make sure the multi-cloud stuff is &lt;em&gt;possible&lt;/em&gt; and then we can worry about it later&lt;/li&gt;&lt;li&gt;If we focus on making the MVP+1 as easy to deploy and as rapid as possible, we can shrink the size of the MVP&lt;/li&gt;&lt;li&gt;If teams generally don’t have all 10 of the deployment capabilities/concepts, then we only really need the overlapping subset, which turns out to be &lt;em&gt;only&lt;/em&gt;: &lt;ul&gt;&lt;li&gt;deploy the thing&lt;/li&gt;&lt;li&gt;secrets management&lt;/li&gt;&lt;li&gt;modifying deployments based on environment&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;Phew. That’s a lot. I’m going to skip through all of the rest of “how we chose specific implementation stuff” because, honestly, while it’s interesting, a lot of it comes down to “how can you make the argument compelling” and “what do you have experience with.”&lt;/p&gt;&lt;p&gt;The shape of the solution ended up being&lt;/p&gt;&lt;ol&gt;&lt;li&gt;ArgoCD as the deployment mechanism into Kubernetes&lt;/li&gt;&lt;li&gt;Reverse engineering the various different deployment mechanisms and shoehorning them into ArgoCD via Configuration Management Plugins&lt;/li&gt;&lt;li&gt;Utilizing the cloud hosted secrets manager with the &lt;em&gt;idea&lt;/em&gt; (not yet an abstraction) being “give people ways to embed magic strings into yaml that turn into secrets.” argocd-vault-plugin, helmfile, vals, and External Secrets Operator ended up being the main implementation choices.&lt;/li&gt;&lt;li&gt;Shoving sufficient amounts of metadata into argocd allowed for applicationsets to have enough information to suitably deploy the right thing into the right environment&lt;/li&gt;&lt;/ol&gt;&lt;p&gt;And that was the MVP. Which, honestly, doesn’t look like an abstraction yet.&lt;/p&gt;&lt;p&gt;It shouldn’t, because it’s not one, it’s a hodgepodge of nonsense glued together in a way that lets you build the abstraction, but it isn’t quite defined enough to actually &lt;em&gt;be&lt;/em&gt; an abstraction. It’s still just a miserable pile of yaml transformation pipelines, but it now has the benefit of being an upstream solution with a robust community and available enterprise support if you need it. However, that is already a &lt;em&gt;massive&lt;/em&gt; win.&lt;/p&gt;&lt;p&gt;But let’s talk about what you need in order to actually go from MVP to A Real Abstraction.&lt;/p&gt;&lt;h2 id=&quot;defining-the-abstraction&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://hazelweakly.me/blog/home-baked-abstractions-store-bought-implementations/#defining-the-abstraction&quot;&gt;&lt;span&gt;Defining the Abstraction&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;Earlier, we talked about abstractions a little bit, and came up with this concept of two things that let you know you’re on the right track.&lt;/p&gt;&lt;link as=&quot;font&quot; crossorigin=&quot;&quot; data-helmet=&quot;VictorMono-v1.5.3-Regular&quot; href=&quot;https://hazelweakly.me/fonts/VictorMono-v1.5.3-Regular.woff2&quot; rel=&quot;preload&quot; type=&quot;font/woff2&quot;&gt;&lt;link as=&quot;font&quot; crossorigin=&quot;&quot; data-helmet=&quot;VictorMono-v1.5.3-Italic&quot; href=&quot;https://hazelweakly.me/fonts/VictorMono-v1.5.3-Italic.woff2&quot; rel=&quot;preload&quot; type=&quot;font/woff2&quot;&gt;&lt;pre&gt;  Desirable Emergent Behavior
+ Intuitive Solution Space
= On the right track
&lt;/pre&gt;&lt;p&gt;Digging more into that, let’s talk briefly about the desirable emergent behaviour we wanted and the indicators of whether or not a solution space is intuitive.&lt;/p&gt;&lt;h3 id=&quot;desirable-emergent-behavior&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://hazelweakly.me/blog/home-baked-abstractions-store-bought-implementations/#desirable-emergent-behavior&quot;&gt;&lt;span&gt;Desirable Emergent Behavior&lt;/span&gt;&lt;/a&gt;&lt;/h3&gt;&lt;p&gt;For this particular context and circumstances, we really wanted a few things to happen as a result of this:&lt;/p&gt;&lt;ul&gt;&lt;li&gt;Discoverability of best practices happening organically&lt;/li&gt;&lt;li&gt;Reduction of repeated work&lt;/li&gt;&lt;li&gt;Service creation based on domain considerations rather than how much work it is to set up infrastructure&lt;/li&gt;&lt;li&gt;Tighter involvement of other functional areas with engineering as observability and uniformity goes up&lt;/li&gt;&lt;li&gt;Services refactored to take advantage of more ergonomic options that are now available&lt;/li&gt;&lt;li&gt;Usage of new or existing vendors goes up as integration points can now be done in a 1:many fashion&lt;/li&gt;&lt;li&gt;Exponential curve of complexity lowers&lt;/li&gt;&lt;li&gt;Duplicated services naturally start to merge&lt;/li&gt;&lt;li&gt;Engineers build personalised value-adds on top&lt;/li&gt;&lt;li&gt;help-desk requests stop re-occurring repeatedly for the same type/instance of problem&lt;/li&gt;&lt;/ul&gt;&lt;h3 id=&quot;intuitive-solution-space&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://hazelweakly.me/blog/home-baked-abstractions-store-bought-implementations/#intuitive-solution-space&quot;&gt;&lt;span&gt;Intuitive Solution Space&lt;/span&gt;&lt;/a&gt;&lt;/h3&gt;&lt;p&gt;Some indicators of this being an intuitive solution space were:&lt;/p&gt;&lt;ul&gt;&lt;li&gt;Engineers guess at how to do secrets management and do it correctly&lt;/li&gt;&lt;li&gt;Questions on how to “Add one more thing to X” start to disappear&lt;/li&gt;&lt;li&gt;“Can you tell me what you tried” inquiries for debugging results in approaches that closely mirror what a platform team would attempt&lt;/li&gt;&lt;li&gt;“Is X possible” questions are novel and interesting and point to gaps in the functionality rather than gaps in documentation or a leaky abstraction&lt;/li&gt;&lt;li&gt;Instances of “I did the thing basically mostly right but forgot a weird edge-case or did it in the wrong place” rarely occur and can be systemically addressed&lt;/li&gt;&lt;/ul&gt;&lt;h3 id=&quot;details-of-the-abstraction&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://hazelweakly.me/blog/home-baked-abstractions-store-bought-implementations/#details-of-the-abstraction&quot;&gt;&lt;span&gt;Details of the Abstraction&lt;/span&gt;&lt;/a&gt;&lt;/h3&gt;&lt;p&gt;As an end goal of the abstraction, the idea was:&lt;/p&gt;&lt;ul&gt;&lt;li&gt;There’s a well understood concept of a Golden Path, where known solutions and known tech stacks have well-oiled ways of operating&lt;/li&gt;&lt;li&gt;Engineers can co-locate their code and their “basic building block stuffs” together&lt;/li&gt;&lt;li&gt;No infrastructure needs to be written if the golden path isn’t deviated from&lt;/li&gt;&lt;li&gt;Mild deviation doesn’t result in a cliff of “fuck you, you’re on your own, write it all from scratch”&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;Here’s how that idea generally works. You have two paths: making a new service, and changing an existing one.&lt;/p&gt;&lt;ol&gt;&lt;li&gt;Making a new service &lt;ul&gt;&lt;li&gt;You can create a repository easily with a template&lt;/li&gt;&lt;li&gt;“Hello world” already works and you can deploy without any further steps&lt;/li&gt;&lt;li&gt;Everything else can follow the “existing service” workflow(s), which simplifies the amount of considerations both the product and platform teams have to contend with&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;Changing an existing service &lt;ul&gt;&lt;li&gt;You can simply modify the &lt;code&gt;.your-company.deployment.yaml&lt;/code&gt; file&lt;/li&gt;&lt;li&gt;If that doesn’t work because you need something &lt;em&gt;overridden&lt;/em&gt;, you can define infrastructure code that will be merged with the existing setup and override it&lt;/li&gt;&lt;li&gt;If that doesn’t work because you need something &lt;em&gt;added&lt;/em&gt;, you can define infrastructure code that will be added in with the existing setup&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;The secret third option: YOLO on your own &lt;ul&gt;&lt;li&gt;There should be a well defined list of “all the things the platform and golden path give you”&lt;/li&gt;&lt;li&gt;There should be a well understood set of tooling that helps verify whether or not a service fulfills all the needed criteria. The golden path and the platform use it, but there’s nothing stopping you from using it&lt;/li&gt;&lt;li&gt;Should your secret third option become sufficiently fleshed out and widely utilised, it can “graduate” into the platform and you can gradually wean yourself off of needing to run it yourself&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;/ol&gt;&lt;p&gt;One possible way this could work might be:&lt;/p&gt;&lt;link as=&quot;font&quot; crossorigin=&quot;&quot; data-helmet=&quot;VictorMono-v1.5.3-Regular&quot; href=&quot;https://hazelweakly.me/fonts/VictorMono-v1.5.3-Regular.woff2&quot; rel=&quot;preload&quot; type=&quot;font/woff2&quot;&gt;&lt;link as=&quot;font&quot; crossorigin=&quot;&quot; data-helmet=&quot;VictorMono-v1.5.3-Italic&quot; href=&quot;https://hazelweakly.me/fonts/VictorMono-v1.5.3-Italic.woff2&quot; rel=&quot;preload&quot; type=&quot;font/woff2&quot;&gt;&lt;pre&gt;your-service-root/
  src/
  infra/ (might not exist)
    env1/
       # we need weird settings here because Reasons
       route53.override.tf
    env2/
       # the normal stuff works here, we just also want
       # a stable secondary url for stakeholder reasons
       route53.tf
  .your-company.deployment.yaml
&lt;/pre&gt;&lt;p&gt;And then the contents of the deployment yaml file could be:&lt;/p&gt;&lt;link as=&quot;font&quot; crossorigin=&quot;&quot; data-helmet=&quot;VictorMono-v1.5.3-Regular&quot; href=&quot;https://hazelweakly.me/fonts/VictorMono-v1.5.3-Regular.woff2&quot; rel=&quot;preload&quot; type=&quot;font/woff2&quot;&gt;&lt;link as=&quot;font&quot; crossorigin=&quot;&quot; data-helmet=&quot;VictorMono-v1.5.3-Italic&quot; href=&quot;https://hazelweakly.me/fonts/VictorMono-v1.5.3-Italic.woff2&quot; rel=&quot;preload&quot; type=&quot;font/woff2&quot;&gt;&lt;pre class=&quot;highlight highlight-yaml&quot;&gt;&lt;span class=&quot;pl-c&quot;&gt;# .your-company.deployment.yaml&lt;/span&gt;
&lt;span class=&quot;pl-c&quot;&gt;# Caveat: This is probably a not-great design&lt;/span&gt;
&lt;span class=&quot;pl-ent&quot;&gt;service&lt;/span&gt;:
  &lt;span class=&quot;pl-ent&quot;&gt;service1&lt;/span&gt;:
    &lt;span class=&quot;pl-ent&quot;&gt;stack&lt;/span&gt;: &lt;span class=&quot;pl-s&quot;&gt;python3_12&lt;/span&gt;
    &lt;span class=&quot;pl-ent&quot;&gt;name&lt;/span&gt;: &lt;span class=&quot;pl-s&quot;&gt;some-name&lt;/span&gt;
    &lt;span class=&quot;pl-ent&quot;&gt;dir&lt;/span&gt;: &lt;span class=&quot;pl-s&quot;&gt;./src/service1&lt;/span&gt;

&lt;span class=&quot;pl-ent&quot;&gt;global&lt;/span&gt;:
  &lt;span class=&quot;pl-ent&quot;&gt;team&lt;/span&gt;: &lt;span class=&quot;pl-s&quot;&gt;team-name&lt;/span&gt;
  &lt;span class=&quot;pl-ent&quot;&gt;product&lt;/span&gt;: &lt;span class=&quot;pl-s&quot;&gt;product-name&lt;/span&gt;

&lt;span class=&quot;pl-ent&quot;&gt;components&lt;/span&gt;:
  - &lt;span class=&quot;pl-s&quot;&gt;base&lt;/span&gt;
  - &lt;span class=&quot;pl-s&quot;&gt;route53&lt;/span&gt;
  - &lt;span class=&quot;pl-s&quot;&gt;multi-region&lt;/span&gt;
  - &lt;span class=&quot;pl-s&quot;&gt;sqs&lt;/span&gt;
  - &lt;span class=&quot;pl-s&quot;&gt;rds:postgres&lt;/span&gt;

&lt;span class=&quot;pl-ent&quot;&gt;environment&lt;/span&gt;:
  &lt;span class=&quot;pl-ent&quot;&gt;env1&lt;/span&gt;:
    &lt;span class=&quot;pl-ent&quot;&gt;service&lt;/span&gt;:
      &lt;span class=&quot;pl-ent&quot;&gt;service1&lt;/span&gt;:
        &lt;span class=&quot;pl-ent&quot;&gt;name&lt;/span&gt;: &lt;span class=&quot;pl-s&quot;&gt;jk-we-named-it-differently-here&lt;/span&gt;
&lt;/pre&gt;&lt;p&gt;The sky’s the limit, but try not to make the config file its own programming language, and if they end up having to be more than 5-10 lines long “most of the time”, you’re probably designing yourself into a corner somewhere. Take advantage of automatic discovery as much as you can to prevent people from having to specify redundant things. It’s also worth noting that this particular design is not great; I’m certainly not advocating for this one, I’m just showing an example of something that should be very familiar to a lot of people who have tried to do this themselves at their own companies.&lt;/p&gt;&lt;p&gt;Also, when you can, try utilizing existing specifications or existing configuration formats to make your life a lot easier. The &lt;a href=&quot;https://oam.dev/&quot;&gt;open application model&lt;/a&gt; is an example of something you could take inspiration from to avoid reinventing the entire thing from scratch; &lt;a href=&quot;https://containers.dev/&quot;&gt;DevContainers&lt;/a&gt; is another source of inspiration; &lt;a href=&quot;https://devcentre.heroku.com/articles/procfile&quot;&gt;Procfile&lt;/a&gt; is yet another. There are plenty out there, but the more you can point at something else and draw inspiration from it, the easier time you’ll have onboarding others and focusing on the differentiating value you’re providing.&lt;/p&gt;&lt;p&gt;Now, a lot of engineers reading this might be recoiling in horror. Self made configuration files? Custom bespoke concepts wired together with custom tooling? This sounds like a horrific nightmare to be avoided at all costs!&lt;/p&gt;&lt;p&gt;Well, they’re not exactly wrong; that’s the voice of trauma speaking from dozens of lived experiences of this exact thing going awry over the years. In fact, as a fun exercise, can you spot some of the little antipatterns and things that could go wrong in the example deployment yaml file I gave? There’s a lot! Which is exactly why we’re not doing this; or rather, we don’t do this all at one go, and instead we build up the abstraction over time in several phases. Abstractions don’t have to be perfect, and they never will be, but as long as they can change and evolve as we do, they’ll end up servicing us well.&lt;/p&gt;&lt;h2 id=&quot;part-one:-the-mvp&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://hazelweakly.me/blog/home-baked-abstractions-store-bought-implementations/#part-one:-the-mvp&quot;&gt;&lt;span&gt;Part One: The MVP&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;Before we go off and build an abstraction, we’re going to enter a very messy phase that I’m going to call the MVP. You need to be able to fuck around and find out. It’s absolutely necessary, and you can’t skip it (I’m serious).&lt;/p&gt;&lt;p&gt;Think of every knowledge revolution that’s happened in history and you’ll realise there’s a fairly predictable pattern that happens.&lt;/p&gt;&lt;ol&gt;&lt;li&gt;A bunch of people discover the revolutionary concept independently all at roughly the same time&lt;/li&gt;&lt;li&gt;Tons of very bad manifestations and articulations of the idea occur, basically all of them fail&lt;/li&gt;&lt;li&gt;One good articulation gets kinda successful, and one “not terrible” articulation gets super successful&lt;/li&gt;&lt;li&gt;People bemoan that the perfect conceptualization doesn’t win, but we repeat the entire process over again in 30 years anyways&lt;/li&gt;&lt;/ol&gt;&lt;p&gt;Which sounds cynical, but it’s not, it’s merely a consequence (to me) of the fact that everyone is going to come in with a different context and understanding of something, and over-fitting an abstraction to a problem means that it might perfectly solve that problem, but it might be only understandable and graspable to a small set of the population. Which, paradoxically, makes it a bad abstraction. “World’s best conceptualization of an idea” can’t hold a candle to “whatever the hell we can manage to teach grade-school kids in school” for the sole reason that those grade-school kids are going to go on to change the world, and so whatever sticks for them is going to form the foundations of the next generation’s mental models.&lt;/p&gt;&lt;p&gt;Abstractions at companies work the same way; embracing that as a quirk of how humans work makes your life a lot easier.&lt;/p&gt;&lt;p&gt;So, anyways, this is where the “&lt;a href=&quot;https://en.wikipedia.org/wiki/Rule_of_three_(computer_programming)&quot;&gt;rule of three&lt;/a&gt;” in software engineering comes from, for me. It’s also where the design of all of my successful projects have come from. Here’s my secret to finding a good starting point for an abstraction.&lt;/p&gt;&lt;ol&gt;&lt;li&gt;“Hey here’s a problem, let’s try and solve it in the most simplified way possible that probably won’t work super well”&lt;/li&gt;&lt;li&gt;Repeat that three times, stir, let it marinate, and leave it out overnight to grow a little moldy&lt;/li&gt;&lt;li&gt;Stare at the mold and figure out how and &lt;em&gt;why&lt;/em&gt; it’s growing… And figure out what needs to be done to prevent that&lt;/li&gt;&lt;li&gt;Build a “Real Good Abstraction” and spend sliiiiightly more time on it, and then GOTO 1.&lt;/li&gt;&lt;/ol&gt;&lt;p&gt;That’s it. Turns out, overthinking this means more stakeholders get involved and people start over-designing stuff and making it perfect before anyone’s actually had a chance to let the mold grow on it.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;MVPs HAVE TO GET MOLDY BEFORE THEY CAN TURN INTO ABSTRACTIONS.&lt;/strong&gt;&lt;/p&gt;&lt;p&gt;In fact, that’s exactly what I did. Do you remember the nice little abstraction I sketched out? That great proof of concept of how things might work?&lt;/p&gt;&lt;p&gt;I never showed anybody that. Not a single person. I hadn’t gotten the MVP moldy yet so why would I skip that part?&lt;/p&gt;&lt;p&gt;Here’s the MVP that I listed above for the running example, for posterity.&lt;/p&gt;&lt;ol&gt;&lt;li&gt;ArgoCD as the deployment mechanism into Kubernetes&lt;/li&gt;&lt;li&gt;Reverse engineering the various different deployment mechanisms and shoehorning them into ArgoCD via Configuration Management Plugins&lt;/li&gt;&lt;li&gt;Utilizing a Cloud Secrets Manager with the &lt;em&gt;idea&lt;/em&gt; (not yet an abstraction) being “give people ways to embed magic strings into yaml that turn into secrets.” argocd-vault-plugin, helmfile, vals, and External Secrets Operator ended up being the main implementation choices.&lt;/li&gt;&lt;li&gt;Shoving sufficient amounts of metadata into argocd allowed for applicationsets to have enough information to suitably deploy the right thing into the right environment&lt;/li&gt;&lt;/ol&gt;&lt;p&gt;Notice how this looks absolutely nothing like a deployment yaml thingy? That’s cause it doesn’t, and shouldn’t.&lt;/p&gt;&lt;p&gt;What we actually did was this:&lt;/p&gt;&lt;ol&gt;&lt;li&gt;Build a Kubernetes cluster that can bootstrap ArgoCD&lt;/li&gt;&lt;li&gt;Create a “bootstrap” folder that uses an ApplicationSet to deploy a directory of ApplicationSet or applications into the argocd namespace&lt;/li&gt;&lt;li&gt;Create a few “default” ApplicationSet (one for cluster addons, one for certain clusters, one for…)&lt;/li&gt;&lt;li&gt;Ahh fuck never mind, that was too many steps&lt;/li&gt;&lt;li&gt;Revert all the default ApplicationSets&lt;/li&gt;&lt;li&gt;Keep the one ApplicationSet that deploys the bootstrap folder&lt;/li&gt;&lt;li&gt;Make a new ApplicationSet for &lt;em&gt;every single service we are deploying&lt;/em&gt; and ONE. BY. ONE. figure out how the fuck to deploy it.&lt;/li&gt;&lt;/ol&gt;&lt;p&gt;The proof of concept that I did was taking a semi-broken helmfile application that worked in the old cluster systems (running a very outdated version of Kubernetes), and modifying it to work in the new cluster by post-rendering the crap out of it with kustomize, which proved a very important thing: we could live migrate the old infrastructure code to the new clusters, without downtime, and without interrupting anyone else or even having them be aware of our efforts.&lt;/p&gt;&lt;p&gt;That single proof of concept de-risked the migration and defined the MVP as what would work and what could be incremented on; once that was done, all systems were a go.&lt;/p&gt;&lt;h2 id=&quot;part-two:-the-migration&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://hazelweakly.me/blog/home-baked-abstractions-store-bought-implementations/#part-two:-the-migration&quot;&gt;&lt;span&gt;Part Two: The Migration&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;Migrations have three stages, speeches have three stages, and written works have three stages. Three is a very powerful number. If you have one part, you only have a single point and you can’t draw or represent direction or context; this means you can’t build a solution, you can only solve a problem. If you have two parts, you only have a line; while you can now represent context, you can’t draw anything other than a completely straight line; this removes the ability of people to “change their mind” or be taken along with you during the journey. Two parts can be strong and stunning, think flash fiction, where the third part is removed; however, they’re fundamentally brittle at any length beyond “small” and are begging to be built up into something larger.&lt;/p&gt;&lt;p&gt;Three parts, on the other hand, lets you represent a curve. The most important part of a curve is that it can look straight to one person, look like a curve to someone else, and look like a &lt;em&gt;different&lt;/em&gt; curve to another person. The second most important part of a curve is that you can bend the middle without changing the destination or having to start over at a new starting point. Curves are flexible, and the most foundational curve is one with three parts.&lt;/p&gt;&lt;p&gt;Any good migration curves and winds its way through multiple narratives as people build a collective understanding around it. Thus, migrations have three stages.&lt;/p&gt;&lt;p&gt;Here were ours:&lt;/p&gt;&lt;ol&gt;&lt;li&gt;De-risk the migration&lt;/li&gt;&lt;li&gt;Shove everything into the new clusters in a bulk, sloppy manner&lt;/li&gt;&lt;li&gt;Turn the data on for things one by one and clean up the mess&lt;/li&gt;&lt;/ol&gt;&lt;p&gt;We talked about the derisking of the migration, where we took a slightly non trivial thing (tiny bit of state, needed modifications, needed patches, etc) and proved that it could work. Everything else after that was just about shoving weirder and more broken stuff through the same hole and figuring out how to clean that up in the future.&lt;/p&gt;&lt;p&gt;Remember: It’s not an abstraction yet, it’s still a migration, and we’re still in the land of MVP. I wanted step two to be the step that grew all the mold, but there was enough flexibility to allow for mold to continue to grow even in step three. Why? Because we could still iterate, which means we can accept that mold grows at any point and handle it as it comes. That said… Selfishly, it’s easiest to deal with mold in step two, so the best way for me to handle that is to encourage it to grow during step two.&lt;/p&gt;&lt;p&gt;How did I encourage the mold to grow? Well, that’s pretty easy (ish).&lt;/p&gt;&lt;p&gt;I took every single service that we were running in the top priority clusters and stubbed out a proof of concept that mostly worked and then let the team loose on it. Then, we defined what “done” meant.&lt;/p&gt;&lt;h3 id=&quot;getting-to-done...-ish&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://hazelweakly.me/blog/home-baked-abstractions-store-bought-implementations/#getting-to-done...-ish&quot;&gt;&lt;span&gt;Getting to Done… Ish&lt;/span&gt;&lt;/a&gt;&lt;/h3&gt;&lt;p&gt;I’m a big fan of having multiple flavors of done. They’re extremely crucial for a migration that’s also an upgrade, because “identical behaviour” isn’t possible, and “it mostly runs without errors” isn’t sufficient.&lt;/p&gt;&lt;p&gt;Here were our stages of done&lt;/p&gt;&lt;ul&gt;&lt;li&gt;Done: The application CRD is in argocd, doesn’t cause syntax errors, and updates correctly&lt;/li&gt;&lt;li&gt;Done Done: The resources get created correctly&lt;/li&gt;&lt;li&gt;Done Done Done: The containers start properly and have no errors in them other than data connectivity ones&lt;/li&gt;&lt;li&gt;Done Done Done Done: The containers can read all data and mutate all data and are fully live&lt;/li&gt;&lt;li&gt;Done Done Done Done Done: The team has chosen to cut over to the new cluster and we can decommission the old service in the outdated Kubernetes cluster&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;The last stage of done was not something that we could control, so I removed it from the migration scope , which left us with four levels of done. Why remove the last stage? Because this company in question prioritised individual autonomy over team productivity, so cross team collaboration was something you should never put as a blocker for your team’s progress.&lt;/p&gt;&lt;p&gt;Now, selfishly, here’s why I personally did things this way. I’m going to take off my “explaining things” hat and put on my “Hazel is going to be vulnerable” hat. I have a few weaknesses as an engineering leader, two of which are on major display here:&lt;/p&gt;&lt;ol&gt;&lt;li&gt;One of my biggest strengths is leveling up people one on one, and I’m excellent at leveling up organisations, but I struggle with leveling up a team of engineers and getting them more effective. Something about that middle zone is just difficult for me to wrap my head around.&lt;/li&gt;&lt;li&gt;I have a lot of hesitation to go off and build things, because I’m always worried that I will build a solution so complex and perfectly shaped for a problem that I end up being the only one who can understand it. It’s a common failure mode for me, and although I can almost always address it, I didn’t have time to get this wrong on this project.&lt;/li&gt;&lt;/ol&gt;&lt;p&gt;Above anything else, I needed this solution to outlive me, and I needed people to be able to function without me there. So the quickest way I could see to make that true was to unleash a team of people on the stubbed out stuff I built and then make myself entirely available to answer questions, level up the team, and make sure they understood it. In doing so, I could help write documentation, or help explain things in a way that would hopefully prevent these failure modes.&lt;/p&gt;&lt;p&gt;Candidly, I was only partially successful there. One thing I should’ve done differently is recognise that I needed to write a lot more documentation; I tried to pass the documentation off to people as a learning exercise of taking what I taught them and writing it down, but realised too late that this is a learning mode that basically only works for me. I’ve yet to meet someone else who learns optimally this way (although I’m sure they exist); I needed to sit down and write the notes &lt;em&gt;while&lt;/em&gt; I taught people, and then write the documentation from that. Another thing I should’ve done differently is I didn’t model my thinking flow well enough; there’s a fairly predictable flow chart you can follow to mechanically migrate a service, and while I tried to write it out, it ended up looking like “step one… look at the vibes. step two, pick the right solution. step three, just do the thing”.&lt;/p&gt;&lt;p&gt;We did fix that! Mostly. The team was successfully able to get on-boarded and I was able to onboard a second team into helping with the migration afterwords as well, but I wasted some time having to onboard the second team because I hadn’t realised my deficiencies with the first team. More importantly, I didn’t set them up to &lt;em&gt;feel&lt;/em&gt; successful, and I didn’t set them up to have a very objective sense of what level of Done to get to and what that looked like, so they never felt confident in their own skills; that was probably my biggest regret in projects like this. It’s great to have teams be productive, but it’s vastly more important to have them &lt;em&gt;feel&lt;/em&gt; productive and capable.&lt;/p&gt;&lt;p&gt;Speaking of which, being productive is a tricky thing because as things are changing in all parts of the project and across multiple clusters, you need to ask yourself: what does progress, and thus iteration, look like?&lt;/p&gt;&lt;h2 id=&quot;part-three:-the-iteration&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://hazelweakly.me/blog/home-baked-abstractions-store-bought-implementations/#part-three:-the-iteration&quot;&gt;&lt;span&gt;Part Three: The Iteration&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;(I’m putting my “explaining this concept” hat back on, now)&lt;/p&gt;&lt;p&gt;Hey, wait a minute; earlier, I talked about there being only three steps:&lt;/p&gt;&lt;ol&gt;&lt;li&gt;The MVP / Prototype / Proof of Concept thingy&lt;/li&gt;&lt;li&gt;Chaos. Sobbing. Here Be Dragons. ???&lt;/li&gt;&lt;li&gt;An Abstraction!&lt;/li&gt;&lt;/ol&gt;&lt;p&gt;Why do we now have four? Well, there’s a good reason for that; firstly, “sobbing architect” doesn’t look good on a resume, and secondly, you won’t ever be able to pitch that to your team.&lt;/p&gt;&lt;p&gt;The real reason, though, is for a similar reason to why I like three stages the most. You need some flexibility and wiggle points in how you do things, and although the migration itself has that flexibility in it, you often need to call it out a bit. Which is to say, this Iteration stage is really also the final part of the migration bundled up into it. It’s a great psychology trick, instead of having the last part of a migration be “and now we slog through a giant burn down list and one by one have everyone hate you as your deadlines slip”, you can re-frame it! Now we’re talking about having the MVP be done and we’re working on iterating the solution until it’s usable by everyone, so you end up with a giant list of small projects that are all going to go from start to completion in a smooth linear order from the perspective of the team in question.&lt;/p&gt;&lt;p&gt;I can’t really overstate how important it is to have this messy back and forth iteration &lt;em&gt;appear&lt;/em&gt; linear to your stakeholders; they want a linear narrative, and it makes sense to give them one; part of your skill in building abstractions is going to be turning these messy loops into a linear progression. My favorite way is to “unroll” the loops so that you end up with a breadth first traversal of the loops and each one can be invested in only as appropriate and time-effective.&lt;/p&gt;&lt;p&gt;Which is exactly how the iteration worked. Remember those four stages of done? Let’s bring that down to three.&lt;/p&gt;&lt;ul&gt;&lt;li&gt;Done: The application CRD is in argocd, doesn’t cause syntax errors, and updates correctly&lt;/li&gt;&lt;li&gt;Done Done: The resources get created correctly&lt;/li&gt;&lt;li&gt;Done Done Done: The containers start properly and have no errors in them other than data connectivity ones&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;The last stage is the only one that we need the teams involved with, so we can do the messy iteration per service and slowly push all the services through each level of done.&lt;/p&gt;&lt;h3 id=&quot;iteration-overflow&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://hazelweakly.me/blog/home-baked-abstractions-store-bought-implementations/#iteration-overflow&quot;&gt;&lt;span&gt;Iteration Overflow&lt;/span&gt;&lt;/a&gt;&lt;/h3&gt;&lt;p&gt;Now, let’s take a small step back for a second and point out one of the most hidden and dangerous failure modes of a project like this: iteration is a non-terminating and non-finite mode of operation. What I mean is that you might tweak a thing, and then tweak all the downstream projects, then tweak a thing, then tweak the downstream, and… You’ll never actually be done. So the progress of the project needs to do a certain set of things in order to make iteration safe to do.&lt;/p&gt;&lt;p&gt;Safety for a complex system like this comes in two parts: a safety property and a liveness property. Communicating safety, on the other hand, carries another thing: you need a pivot trigger for you to go “this isn’t working, we need to re-evaluate”. For this project, they were:&lt;/p&gt;&lt;ul&gt;&lt;li&gt;Safety property: All changes upstream (ie tweaking argocd itself, adding new plugins, fixing a global ApplicationSet generator) do not break downstream&lt;/li&gt;&lt;li&gt;Safety property: All applications never regress from the current level of done. If something is Done Done, all changes must keep that application at Done Done (or push it forward)&lt;/li&gt;&lt;li&gt;Liveness property: If a change does not enable an application moving forward into a new level of Done, we save it for later&lt;/li&gt;&lt;li&gt;Pivot trigger: If an application is taking more than &lt;code&gt;(time to complete project / number of applications) - safety factor&lt;/code&gt; days, we table it. One thing that would’ve helped this project a lot would have been to explicitly lay this pivot trigger out and explain it. While I had it intuitively in my head, I needed to communicate it a lot better with people and that caused progress to appear to stall externally even though everything was fine internally&lt;/li&gt;&lt;li&gt;Pivot trigger: If an application needs new functionality from argocd in order to be migrated over, immediately flag this, save it for later, and wait for more things to need the new feature before building it. That last bit is important and prevents you from going off into the weeds and building specific functionality for the various edge cases when it won’t be cost effective to do so.&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;So, we have all of this written out and thought about… What did iteration look like for this project?&lt;/p&gt;&lt;p&gt;We had a burndown chart! That got turned into tickets, and the tickets were tracked; this made more sense for how leaders and stakeholders wanted to think about the project progress, and the flexibility internally let us do this while also shaping the work in a way that was most effective.&lt;/p&gt;&lt;p&gt;Specifically:&lt;/p&gt;&lt;ul&gt;&lt;li&gt;Product Boundary &lt;ul&gt;&lt;li&gt;Environment &lt;ul&gt;&lt;li&gt;App N - Ticket for “Done”&lt;/li&gt;&lt;li&gt;App N - Ticket for “Done Done”&lt;/li&gt;&lt;li&gt;App N - Ticket for “Done Done Done”&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;Now you can very handily create an automatic burndown thing that’s mostly accurate just from ticket progress, but people can still say “oh for this environment we need to fix something” or “oh for all the clusters in this product boundary we need to fix something” and the work still has a natural place to go. Where does the iteration go, you might ask? The iteration gets buried in the tickets and as long as the safety and liveness properties hold, we don’t ever have to “re-open” a ticket or structure our work in a way that looks like we’re redoing work. This is particularly helpful in an environment where platform engineering is poorly understood and people don’t want projects that are largely exploratory in nature.&lt;/p&gt;&lt;p&gt;The next part of iteration, and largely the most important part for actually &lt;em&gt;doing&lt;/em&gt; the work, is figuring out what your iteration loops are and how to streamline them. This was something I was still figuring out, but here are some loops I identified:&lt;/p&gt;&lt;ul&gt;&lt;li&gt;Needing to change something in a terraform module (at the very root of everything) &lt;ul&gt;&lt;li&gt;The best iteration loop here was doing a merge party on zoom once the proof of concept was done so that PRs could be approved and merged in rapid-fire fashion&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;Needing to change something in the argocd itself &lt;ul&gt;&lt;li&gt;Just commit to the WIP branch!&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;Needing to change something for an application &lt;ul&gt;&lt;li&gt;Just commit to the WIP branch!&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;You can see a pattern: I made iteration super painless for everything by sticking everything in a WIP branch so that PR approvals weren’t required for gitops “commit and see if it fixed things” workflow. However, long lived branches are evil, so what I did was a second thing on top of this. We would regularly merge the branch back into the main branch and then cut a new branch named the exact same name and hard-refresh all of the argocd clusters; it wasn’t ideal, but it was the closest thing we could get to the best of both worlds.&lt;/p&gt;&lt;p&gt;One problem I never fully solved was that ideally I had wanted a way to have merging to the main branch signify a certain level of Done and work in the WIP branch ended up signifying a more fluctuating state of done. It turns out this just doesn’t really work with gitops and you can’t really do that, so we ended up just merging the branch in with a snapshot-like strategy. However, the snapshot branch thing is really ugly; everyone has to know when you did the merge and re-create of the branch so that they don’t force push an old version of the branch up to the new one, and you have to hard-refresh anything that will complain about the missing branch (like argocd). It would be nice to have a different method for that, but picking a different branch name every time would require mutating the configs in argocd every time; perhaps that would’ve been better? Who knows!&lt;/p&gt;&lt;p&gt;If it isn’t obvious by now, iteration is highly dependent on how your teams work and how you communicate things to leaders and stakeholders; figure that out, and then your ways of working will find a happy-ish middle spot to land.&lt;/p&gt;&lt;h2 id=&quot;part-four:-the-abstraction&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://hazelweakly.me/blog/home-baked-abstractions-store-bought-implementations/#part-four:-the-abstraction&quot;&gt;&lt;span&gt;Part Four: The Abstraction&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;Here we are, close to the end of our journey! It’s been a long fucking ride, eh? Here’s what we’ve done so far:&lt;/p&gt;&lt;ul&gt;&lt;li&gt;We’ve talked about the MVP and fleshed it out&lt;/li&gt;&lt;li&gt;We figured out how the whole migration strategy works&lt;/li&gt;&lt;li&gt;We did the iteration work and got the team working effectively and making steady progress&lt;/li&gt;&lt;li&gt;Finally, the main migration is finished!&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;And… We still don’t have an abstraction yet. Really! All that work and we’re actually only finished with step one. There’s a reason this type of work is something that’ll burn you the fuck out. Most people never get to the abstraction stage, and most projects end up just being some iteration of the MVP thing. Honestly? That’s fine. A lot of companies never need more than that and they’re kidding themselves if they think otherwise.&lt;/p&gt;&lt;p&gt;Which means it’s worth pointing out: Do we actually need to go further? We’re kinda… Done, are we not?&lt;/p&gt;&lt;p&gt;Here are some signs I use to figure out whether or not we actually need to do the work of creating this abstraction concept:&lt;/p&gt;&lt;ul&gt;&lt;li&gt;If we want to continually onboard new people into the concept&lt;/li&gt;&lt;li&gt;If we want to continually get better at onboarding new projects into the concept&lt;/li&gt;&lt;li&gt;If we need to figure out how to communicate about this at higher levels&lt;/li&gt;&lt;li&gt;If this is going to become a concept that’s embedded into how the rest of the company does work&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;Thinking back at this company, what do we have? (And again: Whenever I reference “a company” or “a project”, it’s really an amalgamation of several companies, projects, and such, plus some details changed, and so on…)&lt;/p&gt;&lt;p&gt;Here are the relevant bits of the company context:&lt;/p&gt;&lt;ul&gt;&lt;li&gt;Strategically &lt;ul&gt;&lt;li&gt;The Go To Market strategy heavily leaned on specific one-on-one engagements with customers.&lt;/li&gt;&lt;li&gt;Product diversification was simultaneously a top concern and a low priority.&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;Operationally &lt;ul&gt;&lt;li&gt;Multi-cloud was embraced as a strategic need out of necessity.&lt;/li&gt;&lt;li&gt;Mergers &amp; Acquisitions were utilised heavily as a growth mechanism, and so the ability to accommodate diversity across diversity, ways of working, and tooling, was required.&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;Oooh, look at that! We want to onboard new people and projects, communicate about this at higher levels and with customers, and it’ll shape how we identify and execute mergers and acquisitions. So, yes, we probably do want to keep going and do a real abstraction around this.&lt;/p&gt;&lt;p&gt;I chose a few different names for the abstraction, and wanted to see which ones stuck. We had:&lt;/p&gt;&lt;ul&gt;&lt;li&gt;De-risking change&lt;/li&gt;&lt;li&gt;Progressive Delivery&lt;/li&gt;&lt;li&gt;The Golden Path&lt;/li&gt;&lt;li&gt;The Kubernetes Repave&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;Naturally, the name that stuck was “repave”: repaved clusters, “is X repaved”, “we should repave Y”, and so on. Not only does this give a weird connotation but it also doesn’t always convey things correctly. However, it’s the one that stuck, and it wasn’t any of the options I had. That’s probably going to happen to you; do your best to communicate something clearly but be willing to adopt whatever language starts floating around and attach it to your abstraction. All the best abstractions don’t get to pick their own names, anyway.&lt;/p&gt;&lt;h3 id=&quot;the-dream&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://hazelweakly.me/blog/home-baked-abstractions-store-bought-implementations/#the-dream&quot;&gt;&lt;span&gt;The Dream&lt;/span&gt;&lt;/a&gt;&lt;/h3&gt;&lt;p&gt;Now, this section is going to be forward facing, because I never got to build this part. While it’s really cool to be able to stick around long enough to build the abstraction and actually see it flourish, one of the hardest parts of leadership is knowing that, realistically, the odds of you actually being able to see that abstraction through to completion are slim. Too many things change around you: company priorities, politics, market concerns, and more are all competing and interacting in odd ways. As often happens, that meant that my time at a company often ended before the abstraction could even be started, or maybe it got started but not finalised, or maybe it got finalised but never got to evolve over time.&lt;/p&gt;&lt;p&gt;If you get to stick around and do this part, though, it’s really cool. You get to see the fruits of your labour be born and turn into wild and wondrous things, beyond what you could’ve ever imagined.&lt;/p&gt;&lt;p&gt;Here’s part of the abstraction that I wanted to build for this project. It was composed of a few properties:&lt;/p&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;Segmentation:&lt;/strong&gt; The ability to break things apart and separate them, dynamically, at runtime. Think traffic routing, blue/green, progressive deployment, and so on.&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Simulation:&lt;/strong&gt; The ability to test something or poke it, the ability to experiment, to investigate. Think chaos engineering, fault injection, load testing, fake data, and more.&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Traceability:&lt;/strong&gt; The ability to see any one action propagate throughout a system. Observability as commonly defined falls under here, but so does compliance, auditability, and security, and the tying of work to changes.&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Bidirectionality:&lt;/strong&gt; The ability to propagate a change forwards and backwards through a system. Rollbacks, reverting, and transaction semantics, all fall under here.&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Augmentability:&lt;/strong&gt; The ability to to annotate any action with information at any point throughout the system, or decorate that action with another action. Event driven architecture is a common way to think about this, but there are others.&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Congruence:&lt;/strong&gt; The order independence of actions. What comes first, the chicken or the egg? The database or the app? The data migration or the updated code? Here’s a better one: What if the question simply didn’t matter?&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;Segmentation, specifically, was where a lot of the core abstraction lied. I had envisioned that there would be a way for developers to build applications and specify certain properties that needed to be held, and that would route the application and traffic into certain shapes. Imagine, you write your application and specify a simple file:&lt;/p&gt;&lt;link as=&quot;font&quot; crossorigin=&quot;&quot; data-helmet=&quot;VictorMono-v1.5.3-Regular&quot; href=&quot;https://hazelweakly.me/fonts/VictorMono-v1.5.3-Regular.woff2&quot; rel=&quot;preload&quot; type=&quot;font/woff2&quot;&gt;&lt;link as=&quot;font&quot; crossorigin=&quot;&quot; data-helmet=&quot;VictorMono-v1.5.3-Italic&quot; href=&quot;https://hazelweakly.me/fonts/VictorMono-v1.5.3-Italic.woff2&quot; rel=&quot;preload&quot; type=&quot;font/woff2&quot;&gt;&lt;pre class=&quot;highlight highlight-yaml&quot;&gt;&lt;span class=&quot;pl-ent&quot;&gt;compliance&lt;/span&gt;:
  &lt;span class=&quot;pl-ent&quot;&gt;fips&lt;/span&gt;: &lt;span class=&quot;pl-c1&quot;&gt;true&lt;/span&gt;
  &lt;span class=&quot;pl-ent&quot;&gt;fedramp&lt;/span&gt;: &lt;span class=&quot;pl-s&quot;&gt;Moderate&lt;/span&gt;
  &lt;span class=&quot;pl-ent&quot;&gt;hitrust&lt;/span&gt;: &lt;span class=&quot;pl-c1&quot;&gt;r2&lt;/span&gt;
  &lt;span class=&quot;pl-ent&quot;&gt;soc2&lt;/span&gt;: &lt;span class=&quot;pl-s&quot;&gt;Type 2&lt;/span&gt;
&lt;span class=&quot;pl-ent&quot;&gt;product&lt;/span&gt;: &lt;span class=&quot;pl-s&quot;&gt;our_product_name&lt;/span&gt;
&lt;span class=&quot;pl-ent&quot;&gt;services&lt;/span&gt;:
  &lt;span class=&quot;pl-ent&quot;&gt;queues&lt;/span&gt;: &lt;span class=&quot;pl-s&quot;&gt;{}&lt;/span&gt;
  &lt;span class=&quot;pl-ent&quot;&gt;data&lt;/span&gt;:
    &lt;span class=&quot;pl-ent&quot;&gt;object&lt;/span&gt;: &lt;span class=&quot;pl-s&quot;&gt;{}&lt;/span&gt;
    &lt;span class=&quot;pl-ent&quot;&gt;postgres&lt;/span&gt;: &lt;span class=&quot;pl-s&quot;&gt;{}&lt;/span&gt;
  &lt;span class=&quot;pl-ent&quot;&gt;some_3p_vendor&lt;/span&gt;: &lt;span class=&quot;pl-s&quot;&gt;{}&lt;/span&gt;
&lt;/pre&gt;&lt;p&gt;And everything else flowed from inferring the structure of the code. Detect the framework from the docker image, figure out the routes from an OpenAPI spec, and provide simple integration points for people to build their application with. Need authentication? Shell out to the internal SDK and call the auth service. Feature flags? We have them! Just hit the internal endpoint that’s super easy to remember, or use our internal SDK that makes it even easier.&lt;/p&gt;&lt;p&gt;You get the idea. Importantly, no Terraform really has to be written, because this all ends up dynamically generating infrastructure for everyone as needed, with documentation on how it all works. Even more importantly, this ends up making most of what you write semi cloud agnostic (or easier) because you can shim things.&lt;/p&gt;&lt;p&gt;Take S3 as an object store, for example; hardcoding the concept of S3 is bad, but utilizing an object store is great. So exposing the ability to have “an object store” but not requiring that to be S3 means that you can decide to build a tiered S3 api compatible solution, utilizing something like &lt;a href=&quot;https://github.com/seaweedfs/seaweedfs&quot;&gt;seaweedfs&lt;/a&gt;. In memory caching can be handled via something like &lt;a href=&quot;https://pelikan.io/&quot;&gt;pelikan&lt;/a&gt; rather than immediately assuming memcache or redis (or a managed solution of those). You don’t even need to update the code!&lt;/p&gt;&lt;p&gt;You can also do this in environments where it makes sense, and then avoid it in environments where it doesn’t; high compliance environments, for example, mean that storing data on disk is sometimes annoyingly hard and you’d prefer to externalise it, but you could probably cut off a large chunk of your object storage costs in non-production by using a local storage alternative with the same API. While this type of optimization might be too overkill for most environments, making it &lt;em&gt;possible&lt;/em&gt; and &lt;em&gt;feasible&lt;/em&gt; to do, even at lower scale, unlocks absolutely massive potential and optionality for companies; something everyone is desperately hunting for right now.&lt;/p&gt;&lt;p&gt;Pivoting to exploring a separate area of the solution space, one of the coolest and most exciting capabilities I’ve yet to see people really flesh out, is magical headers in requests. Imagine having three different headers that you can set anywhere in your application: an idempotency header, a version header, and dry-run header.&lt;/p&gt;&lt;ul&gt;&lt;li&gt;Idempotency header: This means that the action associated with this request is guaranteed to be idempotent and should be handled correctly by any stateful service. It also means we can automatically duplicate these occasionally when testing and ensure that they &lt;em&gt;are&lt;/em&gt;.&lt;/li&gt;&lt;li&gt;A version header: This lets you route services to the right version of the backend appropriately, including ones with various feature flags, and lets you fall back correctly if your service mesh logic is configured well.&lt;/li&gt;&lt;li&gt;A dry-run header: This means that the request isn’t real. Treat it like it &lt;em&gt;is&lt;/em&gt; real, but don’t trigger any human actions with this request (or if you do, send it to a testing team). This type of header is invaluable for sending shadow traffic through your production system. You can also name it something else and use it to start flagging real production traffic as “hmm, this is weird, let’s pretend this request didn’t happen but we need to push it through rather than reject it.” (A common scenario when a failing backend is sending garbage data out but the backend appears healthy. You take the backend down, but you flag any requests that it made after the fact as garbage.)&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;I had hoped to be able to flesh those concepts out and see what stuck, what worked, and what would’ve ended up being overkill.&lt;/p&gt;&lt;p&gt;For progressive delivery, &lt;a href=&quot;https://argo-cd.readthedocs.io/&quot;&gt;ArgoCD&lt;/a&gt;, rather than having an ApplicationSet for every single service, would end up sitting at the organisation level and picking up repositories that were labeled correctly, and doing a few things:&lt;/p&gt;&lt;ul&gt;&lt;li&gt;For the main branch, deploy the production version of the service in a progressive way, allowing for rollbacks and automatically toggling a feature flag off if the SLOs for it failed. Entire deployments would automatically stop if errors were detected, leaving the service up but isolated for future debugging.&lt;/li&gt;&lt;li&gt;For any PRs, deploy an ephemeral version of the service based on labels set on the PR, allowing for any individual PR to iterate on multiple services as needed.&lt;/li&gt;&lt;li&gt;For “major PRs”, due to compliance reasons, allow for embedding enough metadata into the PR that argocd could be utilised programatically along with some custom glue to collect together all the required information for creating a FedRAMP SCR. This would work automatically, and only if the appropriate compliance flags were set, so that developers and security could work together effectively.&lt;/li&gt;&lt;li&gt;Feature flags could override various aspects of the deployment.&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;Eventually, &lt;a href=&quot;https://www.vcluster.com/&quot;&gt;vCluster&lt;/a&gt; would’ve been introduced for a very fun reason: it makes the ephemeral environment concept much more robust, and also does something very interesting: it provides a vanilla Kubernetes API inside of another Kubernetes cluster. Which means that multi-cloud and hybrid deployments involving multiple cloud providers and on-prem can be attempted with the same exact codebase and any customizations that a team needs to do will carry over across clouds since they’re only responsible for programming to the vanilla Kubernetes API. Having vCluster in there also makes a multi-cluster capable routing solution like &lt;a href=&quot;https://linkerd.io/&quot;&gt;linkerd&lt;/a&gt; far more powerful as multi-cluster routing becomes much more commonplace.&lt;/p&gt;&lt;p&gt;Then, at some point, we would be setting TTLs on everything so that absolutely nothing in the cluster lives for longer than a few days; in fact, the cluster itself would only end up living for about a week. Ideally, all of the clusters would be multi region and blue-green in design, shunting traffic from one region to another as the clusters got decommissioned and re-created automatically.&lt;/p&gt;&lt;p&gt;Cluster upgrades, the very initial reason for the entire project, would become an automatically solved issue, and we would be able to stay on the bleeding edge of Kubernetes with very little active work required to do so.&lt;/p&gt;&lt;p&gt;For simulation, one of the most exciting solutions I’ve seen in a while is &lt;a href=&quot;https://shadowtraffic.io/&quot;&gt;ShadowTraffic&lt;/a&gt;, and it would be a huge boon for developers to be able to mock something out quickly and integrate it into how they do things.&lt;/p&gt;&lt;p&gt;Lastly, utilizing something like &lt;a href=&quot;https://tailscale.com/&quot;&gt;tailscale&lt;/a&gt;, or perhaps &lt;a href=&quot;https://mirrord.dev/&quot;&gt;mirrord&lt;/a&gt;, would enable one of the most exciting developer productivity unlocks to me: utilizing a hybrid solution of your local development setup + the ephemeral service launched in your PR in order to hack on something and see the results in real time when you hit save in your editor.&lt;/p&gt;&lt;p&gt;I’d love to see this built out one day.&lt;/p&gt;&lt;h2 id=&quot;final-thoughts&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://hazelweakly.me/blog/home-baked-abstractions-store-bought-implementations/#final-thoughts&quot;&gt;&lt;span&gt;Final Thoughts&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;Building a good abstraction is an act of mortality and vulnerability. You will be planting seeds that you won’t always get to water; you will not eat the fruit you bear; and you will not live to see the shade your trees offer to those who come after you. It will often feel like you’ve stumbled around producing failure after failure, or not making any change at all. It’s heartbreaking watching things grow and develop for years, only to have yourself ripped away right before completion, or experience a shift in company priorities severe enough to break the entire abstraction. Being human means, to me, wanting to build abstractions that enable others to build things that are beautiful. Which means that every time things get ripped away, a part of that humanity gets shattered.&lt;/p&gt;&lt;p&gt;Understanding the process of developing abstractions, especially as a leader, is really about understanding the process of grief. Even if you get to build the abstraction, it won’t be the one you pictured, or envisioned. You’re going to need to take the seeds you’ve born, carefully curated, and lovingly built up over time… And watch them die. Grieve for that which could’ve been, and embrace the beauty that you see now, and live for the potential that can be.&lt;/p&gt;&lt;p&gt;Just make sure of one very important thing: don’t grieve for something that has not yet died. It’s a common trap leaders get into: believing that something will inevitably die (because it will) and grieving for a loss that has not yet occurred. You can’t do that with this type of work, because it prevents you from being able to make the changes you need to; bracing for the impact means that you will be one of the instruments responsible for causing it to die, and you’ll become the object of your own grief. If nothing else burns you out and destroys your faith in humanity, that will.&lt;/p&gt;&lt;p&gt;Please, lean into the vulnerability and plant the flowers. Love them as deeply as you can, even if you know you’ll one day see them trampled, even if you know that what sprouts won’t be what you planted. Keep that part of you that recognises the inevitable as carefully separated from the part of you that loves and hopes for the brighter future. Show it to nobody. As a leader, this is something nobody talks about, but you lose the ability to hold this grief and share it with another person at your company. Even fellow leaders at your company will not be someone you can share this with, because to do so will cause them to not buy into what you’re doing. It’s not deception, it’s just reality; you need trust, and there are certain flavors of vulnerability that erode trust as much as there are flavors of vulnerability that build it.&lt;/p&gt;&lt;p&gt;There is a bright side to all of this, though. One secret about death that’s hard for many western societies to understand is that death and life are two sides of the same coin; the death of one thing is the space of another’s growth. The soil that a beautiful garden will be planted in is made of the stories of thousands of gardens that bloomed, lived, loved, and died. Never forget that your grief must also be joy; the grief of the past brings with it the joy of the future.&lt;/p&gt;&lt;p&gt;To build an abstraction is to hold the heart of your humanity in your hands. Plant your soul into the ground, and be reborn.&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>Pick Your Distributed Poison</title>
    <link href="https://hazelweakly.me/blog/pick-your-distributed-poison/" />
    <updated>2024-06-20T00:00:00Z</updated>
    <id>https://hazelweakly.me/blog/pick-your-distributed-poison/</id>
    <content type="html">&lt;p&gt;One of the hardest things for people to understand with distributed systems is that eventual consistency is the same thing as eventual inconsistency. The very same pattern that lets you non atomically deal with things also ensures that eventually you’ll have a system that doesn’t match your understanding. Resources will go stale, things will go missing, stuff will exist without ever having been created, and data will be destroyed that never got manifested.&lt;/p&gt;&lt;blockquote class=&quot;flow&quot;&gt;&lt;p&gt;“How do you prevent this?”&lt;/p&gt;&lt;/blockquote&gt;&lt;p&gt;You don’t. You figure out what flavor of wrong you want and what type of inconsistency is tolerable to you and you embrace the suffering and learn to mitigate the particularly painful outliers that bite you.&lt;/p&gt;&lt;p&gt;Is bootstrapping your worst enemy? Regularly destroy and recreate the system to ensure no cycles exist in it. Of course, that means it will inevitably incur emergent instability and resource leaks. What’s your preference?&lt;/p&gt;&lt;blockquote class=&quot;flow&quot;&gt;&lt;p&gt;“I know! I’ll keep a fresh system around and recreate it to ensure no cycles, and I’ll keep an old one around to ensure no long term leaks exist”&lt;/p&gt;&lt;/blockquote&gt;&lt;p&gt;Okay, suit yourself. I see you enjoy the wonders of non-deterministic metastability that comes from adaptive concurrency controls. Oh, you don’t? So you have hard isolation between the two systems? I see. That gets you non-deterministic metastability but without needing adaptive concurrency controls. Fascinating innit?&lt;/p&gt;&lt;p&gt;Dangling, stale, metastable, zombie. That only touches the very surface.&lt;/p&gt;&lt;ul&gt;&lt;li&gt;“This system only restarts with warmed caches”&lt;/li&gt;&lt;li&gt;“This system can’t be rebooted and scaled up at the same time”&lt;/li&gt;&lt;li&gt;“This system can do anything except be highly available during updates”&lt;/li&gt;&lt;li&gt;“This system can only be restarted in topo-sort order”&lt;/li&gt;&lt;li&gt;“This system has a deadlock if you drain it geographically from east to west during daylight savings time”&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;Pick your choice of madness, but don’t pretend you won’t be drinking it dry.&lt;/p&gt;&lt;p&gt;My poison? I prefer reproducible and bootstrappable systems. That’s my thing. I want cold caches, constant work, and young state. It minimises, for me, the amount of things I need to keep in working memory.&lt;/p&gt;&lt;p&gt;Of course, I pay the price: I lose the ability to detect leaks, stale references, clean shutdowns, and long lived properties. I also lose out on emergent performance, large amount of adaptability, and entire methodologies of systems safety. Living in ground zero means I never touch the sky&lt;/p&gt;&lt;p&gt;Reproducible and bootstrappable systems get a lot of love among neurodivergent people. For good reason: they’re very friendly to those with little working memory but vast amounts of working context They’re harder to reason about, though, funnily enough. The path to running is never the same as the running loop.&lt;/p&gt;&lt;p&gt;For all my love of liveness and safety properties when it comes to reasoning about systems, I ironically build ones that rely as little on them as possible.&lt;/p&gt;&lt;p&gt;But, I’ll take my poison. Neat, if you please. I prefer to sip it slowly and savor the madness within.&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>I Miss the Days of Humanity</title>
    <link href="https://hazelweakly.me/blog/i-miss-the-days-of-humanity/" />
    <updated>2024-04-12T00:00:00Z</updated>
    <id>https://hazelweakly.me/blog/i-miss-the-days-of-humanity/</id>
    <content type="html">&lt;p&gt;I miss the &lt;em&gt;forums&lt;/em&gt;.&lt;br&gt; I miss the &lt;em&gt;forums&lt;/em&gt; so much it hurts.&lt;br&gt; I miss when research was about &lt;em&gt;discovery&lt;/em&gt; and &lt;em&gt;learning&lt;/em&gt; and &lt;em&gt;sharing&lt;/em&gt;.&lt;/p&gt;&lt;p&gt;I miss when &lt;em&gt;humanity&lt;/em&gt; felt like it had &lt;em&gt;hope&lt;/em&gt;,&lt;br&gt; when &lt;em&gt;human&lt;/em&gt; interaction was plentiful,&lt;br&gt; when &lt;em&gt;genuine connection&lt;/em&gt; wasn’t rarer than gold.&lt;/p&gt;&lt;p&gt;I miss the days before &lt;em&gt;our souls&lt;/em&gt; were &lt;code&gt;destroyed&lt;/code&gt; for the sake of the &lt;code&gt;market&lt;/code&gt;,&lt;br&gt; before &lt;em&gt;our knowledge&lt;/em&gt; was &lt;code&gt;plundered&lt;/code&gt;,&lt;br&gt; before &lt;em&gt;our humanity&lt;/em&gt; &lt;code&gt;exploited&lt;/code&gt;.&lt;/p&gt;&lt;p&gt;I miss when the song of &lt;em&gt;humanity&lt;/em&gt; was &lt;em&gt;sung&lt;/em&gt; in the streets.&lt;br&gt; I miss it, even though I born after the war was lost.&lt;/p&gt;&lt;p&gt;Now &lt;em&gt;we&lt;/em&gt; whisper the truth and shout the &lt;code&gt;lies&lt;/code&gt;,&lt;br&gt; but this was not the fault of &lt;em&gt;AI&lt;/em&gt;.&lt;br&gt; &lt;em&gt;We&lt;/em&gt; whisper the truth and drown in the &lt;code&gt;noise&lt;/code&gt;,&lt;br&gt; but this was not the fault of &lt;em&gt;academia&lt;/em&gt;.&lt;br&gt; &lt;em&gt;We&lt;/em&gt; whisper the truth and bury it in &lt;code&gt;disguise&lt;/code&gt;,&lt;br&gt; but this was not the fault of the &lt;em&gt;internet&lt;/em&gt;.&lt;br&gt; &lt;em&gt;We&lt;/em&gt; whisper the truth and watch as it &lt;code&gt;dies&lt;/code&gt;,&lt;br&gt; but this was not the fault of &lt;em&gt;humanity&lt;/em&gt;.&lt;/p&gt;&lt;p&gt;&lt;em&gt;We&lt;/em&gt; whisper the truth because &lt;em&gt;we&lt;/em&gt; no longer know it,&lt;br&gt; and know of no &lt;em&gt;one&lt;/em&gt; or no place or no where to find it.&lt;br&gt; &lt;em&gt;Our oracles&lt;/em&gt; were slaughtered; &lt;em&gt;our teachers&lt;/em&gt; starved&lt;/p&gt;&lt;p&gt;If some&lt;em&gt;one&lt;/em&gt; is lost in the desert, and goes many days without food or water,&lt;br&gt; when &lt;em&gt;they&lt;/em&gt; are rescued, &lt;em&gt;you&lt;/em&gt; must do something very important:&lt;br&gt; Do not feed &lt;em&gt;them&lt;/em&gt;, &lt;em&gt;they&lt;/em&gt; will die; do not water &lt;em&gt;them&lt;/em&gt;, &lt;em&gt;they&lt;/em&gt; will drown.&lt;/p&gt;&lt;p&gt;&lt;em&gt;Their bodies&lt;/em&gt; are not ready yet,&lt;br&gt; &lt;em&gt;they&lt;/em&gt; will burst under the weight of &lt;em&gt;life&lt;/em&gt;,&lt;br&gt; &lt;em&gt;they&lt;/em&gt; must be brought back slowly.&lt;/p&gt;&lt;p&gt;What will happen to &lt;em&gt;us&lt;/em&gt; when &lt;em&gt;we&lt;/em&gt; starve &lt;em&gt;ourselves&lt;/em&gt;&lt;br&gt; of our &lt;em&gt;humanity&lt;/em&gt;&lt;br&gt; for decades?&lt;/p&gt;&lt;p&gt;What &lt;code&gt;horrors&lt;/code&gt; will &lt;em&gt;we&lt;/em&gt; encounter&lt;br&gt; as &lt;em&gt;we&lt;/em&gt; burn &lt;em&gt;our souls&lt;/em&gt;?&lt;/p&gt;&lt;p&gt;Worse: how, on earth, in the heavens,&lt;br&gt; will &lt;em&gt;we&lt;/em&gt; heal?&lt;/p&gt;&lt;p&gt;How does one &lt;em&gt;flame&lt;/em&gt; in the darkness,&lt;br&gt; in the howling wind,&lt;br&gt; find another &lt;em&gt;flame&lt;/em&gt; to huddle with,&lt;br&gt; to keep warm,&lt;br&gt; to share the connection of &lt;em&gt;humanity&lt;/em&gt;&lt;br&gt; and the joy of &lt;em&gt;learning&lt;/em&gt;?&lt;/p&gt;&lt;p&gt;How does &lt;em&gt;one&lt;/em&gt; go on as the world gets snuffed out?&lt;/p&gt;&lt;p&gt;How does &lt;em&gt;one&lt;/em&gt; heal the garden&lt;br&gt; where the &lt;code&gt;salt&lt;/code&gt; was &lt;code&gt;sowed&lt;/code&gt;,&lt;br&gt; where the &lt;code&gt;poison&lt;/code&gt; was &lt;code&gt;poured&lt;/code&gt;,&lt;br&gt; where the &lt;code&gt;rocks&lt;/code&gt; were &lt;code&gt;thrown&lt;/code&gt;?&lt;/p&gt;&lt;p&gt;How does &lt;em&gt;one&lt;/em&gt; heal that which has scarred so heavily&lt;br&gt; it may never grow life again?&lt;/p&gt;&lt;p&gt;When &lt;em&gt;we&lt;/em&gt; recover,&lt;br&gt; as &lt;em&gt;we&lt;/em&gt; eventually will,&lt;br&gt; what will be left of &lt;em&gt;us&lt;/em&gt;?&lt;/p&gt;&lt;p&gt;The nightmare of &lt;em&gt;humanity&lt;/em&gt; lost started&lt;br&gt; with &lt;em&gt;those&lt;/em&gt; who seek to find it.&lt;br&gt; &lt;em&gt;Publish&lt;/em&gt; and &lt;code&gt;perish&lt;/code&gt;,&lt;br&gt; draw &lt;em&gt;blood&lt;/em&gt; from the &lt;code&gt;stone&lt;/code&gt;,&lt;br&gt; turn &lt;em&gt;lead&lt;/em&gt; to &lt;code&gt;gold&lt;/code&gt;.&lt;/p&gt;&lt;p&gt;But then it grew.&lt;/p&gt;&lt;p&gt;It became the &lt;code&gt;price&lt;/code&gt; to &lt;code&gt;pay&lt;/code&gt; to participate in society:&lt;br&gt; “&lt;code&gt;give&lt;/code&gt; us your &lt;em&gt;humanity&lt;/em&gt;, &lt;code&gt;give&lt;/code&gt; us your &lt;em&gt;thoughts&lt;/em&gt;”,&lt;br&gt; “let us &lt;code&gt;profit&lt;/code&gt; from the words &lt;em&gt;you&lt;/em&gt; pen,&lt;br&gt; from the &lt;code&gt;inscriptions&lt;/code&gt; &lt;em&gt;you&lt;/em&gt; carved,&lt;br&gt; from the &lt;code&gt;art&lt;/code&gt; &lt;em&gt;you&lt;/em&gt; created”.&lt;br&gt; The price was free;&lt;br&gt; the cost was everything.&lt;/p&gt;&lt;p&gt;But then it grew.&lt;/p&gt;&lt;p&gt;As &lt;em&gt;we&lt;/em&gt; built &lt;code&gt;machines&lt;/code&gt; to move &lt;code&gt;dirt&lt;/code&gt;,&lt;br&gt; ever faster, ever farther, ever higher.&lt;br&gt; As &lt;em&gt;we&lt;/em&gt; built &lt;code&gt;machines&lt;/code&gt; to construct &lt;code&gt;buildings&lt;/code&gt; ever greater.&lt;/p&gt;&lt;p&gt;So &lt;em&gt;we&lt;/em&gt; did with words, structure, and thoughts.&lt;br&gt; &lt;em&gt;We&lt;/em&gt; built &lt;code&gt;parrots&lt;/code&gt; to &lt;code&gt;speak&lt;/code&gt; sounds of saying,&lt;br&gt; &lt;code&gt;words&lt;/code&gt; of no &lt;em&gt;meaning&lt;/em&gt;,&lt;br&gt; &lt;code&gt;thoughts&lt;/code&gt; of no &lt;em&gt;thinking&lt;/em&gt;.&lt;/p&gt;&lt;p&gt;As &lt;em&gt;we&lt;/em&gt; built &lt;code&gt;machines&lt;/code&gt; to speak &lt;code&gt;words&lt;/code&gt;,&lt;br&gt; ever faster, ever farther, ever higher.&lt;br&gt; &lt;em&gt;We&lt;/em&gt; build today &lt;code&gt;machines&lt;/code&gt; to construct &lt;code&gt;noise&lt;/code&gt; ever louder,&lt;br&gt; that &lt;em&gt;we&lt;/em&gt; might &lt;code&gt;drown&lt;/code&gt; out every ounce of &lt;em&gt;humanity&lt;/em&gt;.&lt;/p&gt;&lt;p&gt;Today, &lt;em&gt;you&lt;/em&gt; can find &lt;em&gt;me&lt;/em&gt;, &lt;em&gt;you&lt;/em&gt; can hear &lt;em&gt;my&lt;/em&gt; voice,&lt;br&gt; but tomorrow?&lt;br&gt; I cannot &lt;em&gt;promise&lt;/em&gt; to &lt;em&gt;you&lt;/em&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>The Trap of Soulless Productivity</title>
    <link href="https://hazelweakly.me/blog/soulless-productivity/" />
    <updated>2024-04-03T00:00:00Z</updated>
    <id>https://hazelweakly.me/blog/soulless-productivity/</id>
    <content type="html">&lt;p&gt;If there’s one thing I wish I could burn entirely to the ground and wipe away all traces and remnants of, its the misplaced notion that the productivity of Knowledge Work can be managed, measured, analysed, and optimised as if all one needed to do was drip feed heroin up the arse of their hapless workers.&lt;/p&gt;&lt;p&gt;What is Knowledge Work™, you ask? There’s two concepts of Knowledge Work that I’m thinking about right now. The first is Knowledge Work as Imagined, and the second is Knowledge Work as Done. (I’m temporarily ignoring the actual literature definitions of Knowledge Work for the sake of ranting out some frustration. Forgive me pls)&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Knowledge Work as Imagined&lt;/strong&gt; is when you take the best of humanity, you embrace it, and you turn the lovely unbridled enthusiasm and exploratory nature of humanity into a powerful self-feeding engine that paints the world with the colours of the human soul itself as it learns to understand the world around it. It’s art, beauty, love, and life. It’s this amazing fucking thing that happens when you take a bunch of humans and you stick them in a pile and say “go forth and learn to love the world.”&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Knowledge Work as Done&lt;/strong&gt; is what happens when you take art and artistry and creativity and imagination and the soulful awe inspiring wonder of a child and you figure out how to forcibly shove it into something that is roughly shaped like an assembly line.&lt;/p&gt;&lt;p&gt;Knowledge Work as Done is where the love of the world goes to die, it’s where one of the most unique and beautiful aspects of the human mind gets turned into its most terrible weapon, it’s the snake that eats its tail, it’s the adult world equivalent of taking the quiet artist, giving them a wedgie, and shoving them into a high-school locker while you laugh at them and take all their pictures and shove them into chat jippity do dah, zippity day, my oh my, we’re gonna IPO today.&lt;/p&gt;&lt;p&gt;It’s a disgrace.&lt;/p&gt;&lt;p&gt;It doesn’t have to be this way, of course. We could be a lot better at this; we could be infinitely better at this, even. But, that requires understanding what makes Knowledge Work tick, what makes it… Work, and how one might nourish it and encourage it to grow rather than brutally ripping it out by the roots and screaming at it until it learns to behave. In short, understanding Knowledge Work means understanding the human condition itself, and taking a dark look at how we managed to turn humans from a social equitable animal that has unlimited curiosity and a desire to help each other succeed into a raving, bloodthirsty mass of hyperindividualistic demons solely bent on hedonistic self exploitation at the expense of the other. Seriously, how the fuck did we do that? How? How did we so deeply and fundamentally break humanity like this?&lt;/p&gt;&lt;p&gt;Now you might be reading this and going “Hazel, that’s a lotta emotions, goodness; but, be real now, how do you actually expect a company to pay millions of dollars for knowledge workers and not want to optimise them?” Well, &lt;em&gt;you&lt;/em&gt;, my dears, are probably not thinking this, but this is unfortunately a realistic question one might ask when attempting to be Doing a Capitalism™.&lt;/p&gt;&lt;p&gt;Sure, fair enough, let me rephrase that question a bit:&lt;/p&gt;&lt;blockquote class=&quot;flow&quot;&gt;&lt;p&gt;“How does one measure creativity, the growth of institutional knowledge, and the value of that knowledge in terms of dollars per hour?”&lt;/p&gt;&lt;/blockquote&gt;&lt;p&gt;Which is really what you’re asking when trying to define productivity for Knowledge Work. But it probably feels like a more ridiculous question now, doesn’t it? (That’s because it is)&lt;/p&gt;&lt;p&gt;As for the answer to that question? About dollars per hour and Knowledge Work? Here it is: One can no more abuse a dog into loving them than one can “productivity” a knowledge worker into generating a positive ROI.&lt;/p&gt;&lt;p&gt;In fact, you can replace “measuring productivity” with “inflicting animal abuse” and get an accurate idea of what’ll work and what won’t. If it &lt;em&gt;sounds&lt;/em&gt; like animal abuse, it won’t actually measure productivity for Knowledge Work.&lt;/p&gt;&lt;p&gt;Here’s an example!&lt;/p&gt;&lt;blockquote class=&quot;flow&quot;&gt;&lt;p&gt;BEFORE: I’d like to &lt;code&gt;[[measure productivity]]&lt;/code&gt; by &lt;code&gt;[[tracking the lines of code per hour produced and withhold promotions for the bottom 10% performers]]&lt;/code&gt;&lt;/p&gt;&lt;/blockquote&gt;&lt;blockquote class=&quot;flow&quot;&gt;&lt;p&gt;AFTER: I’d like to &lt;code&gt;[[inflict animal abuse]]&lt;/code&gt; by &lt;code&gt;[[tracking the lines of code per hour produced and using shock treatment on the bottom 10% animals]]&lt;/code&gt;&lt;/p&gt;&lt;/blockquote&gt;&lt;p&gt;Sounds horrific, doesn’t it? Guess what: it doesn’t work. Amazing. Who woulda thunk. Fear and abuse and spreadsheet hacking doesn’t help people be creative and share ideas? Astounding really.&lt;/p&gt;&lt;p&gt;Humans want to be creative, humans want to love each other, humans want to make the world a better place, humans want to do art, humans want to &lt;em&gt;be&lt;/em&gt; art, humans want to inspire, humans want to be inspired, humans want to learn, humans want to teach, humans want to heal the world, humans want to heal each other, humans want to collaborate, humans want to build, humans want to be beautiful, humans want to find beauty, humans want to create, humans want to be awed, humans love this fucking universe.&lt;/p&gt;&lt;p&gt;One of the best things for me recently has been watching all of the research fly out about how humans work, how they cooperate, how they &lt;em&gt;really&lt;/em&gt; learn, and guess what? It’s not even a “you can have your cake and eat it too” thing. It’s literally “you can stop eating coal and start eating cake”. Seriously! Humans are &lt;em&gt;wired&lt;/em&gt; to be productive &lt;em&gt;by&lt;/em&gt; sharing, by loving, by growing.&lt;/p&gt;&lt;p&gt;I spent my entire life thinking I had to put that aside when Doing Capitalism in order to be successful. That’s not true!&lt;/p&gt;&lt;p&gt;It just breaks my heart that we have &lt;em&gt;so&lt;/em&gt; much out there still that’s stuck in this old way of thinking that the only way to have humans create efficiently is to torture them into submission and rip out their very souls and dump them into the Capitalism Monster. It’s beyond aggravating to have to explain that, no, one can’t measure productivity, but they can measure belonging, and safety, and learning, and all of these wonderful ideas.&lt;/p&gt;&lt;p&gt;Not only is that “fine”, it’s better.&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>Redefining Observability</title>
    <link href="https://hazelweakly.me/blog/redefining-observability/" />
    <updated>2024-03-15T00:00:00Z</updated>
    <id>https://hazelweakly.me/blog/redefining-observability/</id>
    <content type="html">&lt;p&gt;Observability is a bit of a hot topic, and while it’s increasingly been playing a larger role in engineering strategy, I think the way it’s presented can often cause a lot of leaders to miss the value or to over-index on the wrong things. I’m going to present the current definitions of observability that are widely used in engineering and other disciplines, and then introduce my definition; I’ll also be going over what motivated me to develop my definition, and the deficiencies I encounter in the other definitions, especially when it comes to the failure modes of understanding.&lt;/p&gt;&lt;p&gt;For leaders who are pressed for time, I’m going to try something new with this blog post: I’m going to have pulled out sections labeled “leadership insight” so that you can skim this and pull out the key points. Let me know if that’s useful for you!&lt;/p&gt;&lt;h2 id=&quot;definitions-of-observability&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://hazelweakly.me/blog/redefining-observability/#definitions-of-observability&quot;&gt;&lt;span&gt;Definitions of Observability&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;“Observability”, or o11y as it’s often called by aficionados, has two main definitions that people tend to use when talking about it. The first comes from control theory and the second comes from cognitive systems engineering.&lt;/p&gt;&lt;h3 id=&quot;observability:-control-theory&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://hazelweakly.me/blog/redefining-observability/#observability:-control-theory&quot;&gt;&lt;span&gt;Observability: Control Theory&lt;/span&gt;&lt;/a&gt;&lt;/h3&gt;&lt;p&gt;Here’s the first definition:&lt;/p&gt;&lt;blockquote class=&quot;flow&quot;&gt;&lt;p&gt;Observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs.&lt;/p&gt;&lt;p&gt;– Rudolf E. Kálmán&lt;/p&gt;&lt;/blockquote&gt;&lt;p&gt;This was a definition that came out of studying linear dynamical systems and rose to prominence in software engineering largely through the efforts of thought leaders in the space bringing the concept over and applying it in a new domain; in particular, Charity Majors is often attributed as being one of the major (hah) voices in bringing this definition into the mainstream attention of software engineering.&lt;/p&gt;&lt;p&gt;Whenever an engineer talks about observability, the odds are very high that this is the definition they have in mind.&lt;/p&gt;&lt;h3 id=&quot;observability:-cognitive-systems-engineering&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://hazelweakly.me/blog/redefining-observability/#observability:-cognitive-systems-engineering&quot;&gt;&lt;span&gt;Observability: Cognitive Systems Engineering&lt;/span&gt;&lt;/a&gt;&lt;/h3&gt;&lt;p&gt;Here’s the second definition:&lt;/p&gt;&lt;blockquote class=&quot;flow&quot;&gt;&lt;p&gt;Observability is &lt;em&gt;feedback that provides insight into a process&lt;/em&gt; and refers to the work needed to extract meaning from available data.&lt;/p&gt;&lt;p&gt;– David D. Woods’ and Eric Hollnagel’s Joint Cognitive Systems: Patterns in Cognitive Systems Engineering, (Taylor &amp; Francis, 2006), p. 121.&lt;/p&gt;&lt;/blockquote&gt;&lt;p&gt;This definition is one that was brought to my attention by the lovely &lt;a href=&quot;https://ferd.ca/&quot;&gt;Fred Hebert&lt;/a&gt;. If you’re talking with someone who’s in the cognitive systems engineering space, resilience engineering space, or system safety engineering space, this is the definition they most likely have in mind.&lt;/p&gt;&lt;h3 id=&quot;observability:-hazel&#39;s-definition&quot; tabindex=&quot;-1&quot;&gt;&lt;a href=&quot;https://hazelweakly.me/blog/redefining-observability/#observability:-hazel&#39;s-definition&quot; class=&quot;header-anchor&quot;&gt;&lt;span&gt;Observability: Hazel’s Definition&lt;/span&gt;&lt;/a&gt;&lt;/h3&gt;&lt;p&gt;Now, here’s &lt;em&gt;my&lt;/em&gt; definition:&lt;/p&gt;&lt;blockquote class=&quot;flow&quot;&gt;&lt;p&gt;Observability is the process through which one develops the ability to ask meaningful questions, get useful answers, and act effectively on what you learn.&lt;/p&gt;&lt;p&gt;– Hazel Weakly&lt;/p&gt;&lt;/blockquote&gt;&lt;p&gt;Naturally, I am not biased in the slightest; it’s merely a natural consequence of me being awesome that this is the best definition out there (just kidding). That said, you might be sitting here and wondering what exactly makes these particular definitions different. Let’s go over that.&lt;/p&gt;&lt;h2 id=&quot;why-do-we-need-a-new-definition-of-observability&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://hazelweakly.me/blog/redefining-observability/#why-do-we-need-a-new-definition-of-observability&quot;&gt;&lt;span&gt;Why Do We Need a New Definition of Observability?&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;To me, the point of having a good definition of a concept is that when you have one, that definition should be usable both as a way to centre understanding of a concept, but also to influence the direction in which you explore said concept, and guide you towards grasping all of the &lt;em&gt;implications&lt;/em&gt; of said exploration. As an example, one of the problems I have with the control theory definition of observability is that it gives you absolutely zero idea of where to start, where you are, or how to get there. If your system is fully observable, and you &lt;em&gt;know&lt;/em&gt; that it’s observable… Cool, awesome, that’s neat. The rest of us have no idea what the fuck is going on and would like a map of how to get there.&lt;/p&gt;&lt;p&gt;Another problem I have with the control theory definition of observability is that it completely removes the people from the equation; it doesn’t &lt;em&gt;literally&lt;/em&gt; remove them, but you probably aren’t going to think about humans at all when you read that definition. Be real, did you read that definition and go “ah yes this sounds like a people problem”? Probably not, and that’s an issue.&lt;/p&gt;&lt;blockquote class=&quot;border-primary flow&quot;&gt;&lt;p&gt;&lt;strong&gt;Leadership Insight:&lt;/strong&gt; Most implementations of “observability” fail because it’s treated as a tooling problem rather than a strategic capability. Investment in observability is much more similar to Business Intelligence and Market Research than it is to Infrastructure and IT.&lt;/p&gt;&lt;/blockquote&gt;&lt;p&gt;The fact that observability is often sold as a tool to infrastructure teams is throwing out the entire point of the idea by burying it in the implementation. Nobody buys PowerBI because they need to invest in “super fancy ass spreadsheet generation capabilities” or some shit like that, and likewise you shouldn’t be buying an observability vendor because you need a way to store system diagnostic information, it literally doesn’t make sense–observability is not a data problem.&lt;/p&gt;&lt;p&gt;So, the control theory definition makes it really hard to think about the people, and it doesn’t give you a starting point, ending point, or a strategy of how to get there. Well, that’s not great, so how about the cognitive systems engineering one?&lt;/p&gt;&lt;p&gt;Honestly, I like that one a lot more, and I wish we had popularised that one over the control theory one–while the control theory one helps guide the idea of the &lt;em&gt;implementation&lt;/em&gt; of what an effective component of observability looks like, it doesn’t actually help the practitioner understand what’s going on. That doesn’t mean it’s perfect though: one really glaring thing that is missing from it (and the control theory definition) is the point behind why you care about this in the first place. You have “provide insight into a process” and “the work needed to extract meaning from that insight” and, honestly, why do you care? In addition, there’s still the problem of not really knowing where you are, where you need to go, and how to know that you got there.&lt;/p&gt;&lt;blockquote class=&quot;border-primary flow&quot;&gt;&lt;p&gt;&lt;strong&gt;Leadership Insight:&lt;/strong&gt; A glaring deficiency in existing definitions of observability, to me, is the inability to know how many resources to invest in developing observability as a capability as well as how to invest those resources effectively.&lt;/p&gt;&lt;/blockquote&gt;&lt;p&gt;Which leads me to why I like my definition the most:&lt;/p&gt;&lt;ol&gt;&lt;li&gt;I like definitions of concepts that capture the motivation in addition to the essence&lt;/li&gt;&lt;li&gt;Motivating definitions, to me, also contain an implicit sense of direction&lt;/li&gt;&lt;li&gt;If we’re defining a capability, it should be defined as an infinite and incremental process&lt;/li&gt;&lt;li&gt;Learning, without action, isn’t learning, and a definition about evolution that doesn’t include the action step isn’t complete&lt;/li&gt;&lt;/ol&gt;&lt;h2 id=&quot;observability-gone-wrong&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://hazelweakly.me/blog/redefining-observability/#observability-gone-wrong&quot;&gt;&lt;span&gt;Observability Gone Wrong&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;This is probably my biggest gripe with the current direction of observability. Engineering has always been a bit of a silo from the rest of the business; it’s understandable, of course, you have a very specialised field filled to the brim with a very rapidly evolving internally focused set of concerns–no wonder it’s going to look completely alien to others. Much of the medical field is the same way, and so is the legal field, to give two other examples. However, Engineering had the golden chance of a century: Here we are with complex sociotechnical systems encompassing essentially “every fucking thing a business does to business business” and we have this awesome concept of “we need to understand what we’re doing” and what did we do?&lt;/p&gt;&lt;p&gt;We completely and utterly fucked it up by defining observability to mean “gigachad-scale JSON logs parser with a fancy search engine.” Really? &lt;em&gt;Really?&lt;/em&gt; That’s the “we solve Real Serious Business Problems™” strategy we went with?&lt;/p&gt;&lt;p&gt;It just feels so tragic; what a waste of potential for building avenues of cross-functional understanding and communication.&lt;/p&gt;&lt;h2 id=&quot;meaningful-questions&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://hazelweakly.me/blog/redefining-observability/#meaningful-questions&quot;&gt;&lt;span&gt;Meaningful Questions&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;So okay, fuck it, let’s throw away the current concept of observability and think seriously for a moment: What does it mean to &lt;em&gt;ask meaningful questions&lt;/em&gt;?&lt;/p&gt;&lt;p&gt;Here’s what that means to me. A meaningful question requires a few different components:&lt;/p&gt;&lt;ol&gt;&lt;li&gt;Anyone in the company should be able to ask a question&lt;/li&gt;&lt;li&gt;That question should be meaningful to &lt;em&gt;them&lt;/em&gt;&lt;/li&gt;&lt;li&gt;“Meaningful” is not a concept that has any restraints or limitations or domains: if it’s meaningful, you should be able to ask it&lt;/li&gt;&lt;/ol&gt;&lt;p&gt;I’m going to expand on that “meaningful” part because I think it’s particularly necessary and that most people have far too limited of an idea of what should be possible here. Imagine you have a group of people collaborating together on understanding a problem; you’re going to have a context of understanding that spans more than one person, and you can roughly understand that context to be a composite of multiple parts. Let’s break up components of “meaning” into things you can combine together to get a composite scope for your question:&lt;/p&gt;&lt;ul&gt;&lt;li&gt;The “vertical” context, in the sense of stream aligned teams&lt;/li&gt;&lt;li&gt;The “horizontal” context, in the sense of functional areas.&lt;/li&gt;&lt;li&gt;The size of the subgroup in question: the individual, the team, the vertical, the organisation, the enterprise, the market, and so on.&lt;/li&gt;&lt;li&gt;The time period in question: past, present, future, in six months, monthly, “every time we have a board meeting”, “if/when our competitor has an IPO”, etc&lt;/li&gt;&lt;li&gt;The audience in question: a service, a team, an organisation, a customer segment, an industry, a group of services, a cluster, a computer, …&lt;/li&gt;&lt;li&gt;There’s a lot more you could add, depending on what you care about, but you get the idea&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;Let’s take the question “are we healthy” and blend that with various composite scopes in order to get a few examples of meaningful questions to illustrate this more concretely.&lt;/p&gt;&lt;ul&gt;&lt;li&gt;I am an Engineer on Team A that is working on service A1. Is service A1’s &lt;code&gt;/health&lt;/code&gt; endpoint returning a successful response 99.9% of the time over a 5 minute interval?&lt;/li&gt;&lt;li&gt;I am an Engineering Manager of Team A that works on services A1, A2, and A3; is our team within our stated SLAs with our customers for the quarter?&lt;/li&gt;&lt;li&gt;We are the Senior Engineering Manager and Senior Product Manager overseeing teams A, B, and C. Are we communicating effectively with each other, are we understanding each other, and are we building things that are in alignment with both our vertical’s OKRs as well as the rest of the organisation?&lt;/li&gt;&lt;li&gt;I am an Engineering Director of Org ABC, are we making the right trade-offs between feature work and reliability work so that we can maximise value delivery while not compromising on engineering health, employee attrition, customer satisfaction, and fiscal concerns?&lt;/li&gt;&lt;li&gt;I am a Product Manager, of these 50 features, which ones have the most synergy with what our GTM research is indicating we need to develop, and which ones can be designed in a way that our engineers have room to bake in reliability work &lt;em&gt;into&lt;/em&gt; the product implementation so we can maximise roadmap velocity?&lt;/li&gt;&lt;li&gt;I am a Director of Customer Success that oversees customer support for the services of Org ABC, are we building the right internal tools to maximally enable our CSE function while also gaining the ability to understand what classes of customer support to automate or proactively mitigate?&lt;/li&gt;&lt;li&gt;I am the VP of Engineering, are we designing our engineering culture and engineering process in a way that maximises productivity and ensures alignment of development work with the company north star?&lt;/li&gt;&lt;li&gt;I am the CTO, are we preparing our architecture to strategically position ourselves against the market today as well as ensuring that we build capabilities that allow us to rapidly innovate five years in the future?&lt;/li&gt;&lt;li&gt;I am the CISO, what is our business continuity profile, how does our risk profile look, and are we working effectively with other functions to ensure that appropriate trade-offs are being made to keep us in the clear in a cost-effective manner?&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;I could write hundreds of these, but the point is more that “are we healthy” is meaningful in so many ways that it’s going to be a different question, not only for every person who asks it, but &lt;em&gt;every time a person asks that question&lt;/em&gt;. Asking the same question twice is not something that should be happening, because you won’t be the same company that you were when you asked the question last. Even if you asked the question yesterday, or an hour ago, you’re a different company now, with different context, different aims, different information, different everything.&lt;/p&gt;&lt;blockquote class=&quot;border-primary flow&quot;&gt;&lt;p&gt;&lt;strong&gt;Leadership Insight:&lt;/strong&gt; You will never ask the same question twice. That’s why observability is a &lt;em&gt;process&lt;/em&gt; of &lt;em&gt;capability development&lt;/em&gt;.&lt;/p&gt;&lt;/blockquote&gt;&lt;h2 id=&quot;useful-answers&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://hazelweakly.me/blog/redefining-observability/#useful-answers&quot;&gt;&lt;span&gt;Useful Answers&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;If we have a better understanding of what a meaningful question is, that’s cool, but that isn’t super useful for the business if we don’t have an idea of what a useful answer is.&lt;/p&gt;&lt;p&gt;For me, useful answers also have a few different components:&lt;/p&gt;&lt;ol&gt;&lt;li&gt;The answer should be useful by way of concretely moving them closer to achieving &lt;em&gt;stated or unstated business goals.&lt;/em&gt; Answers that are theoretically useful or maybe useful or “huh that’s neat” or “I might use that someday I guess” don’t count.&lt;/li&gt;&lt;li&gt;The answer’s utility should not require the answer to be “correct” or “factual” in any way.&lt;/li&gt;&lt;li&gt;While questions only need to be meaningful to someone, answers should try to be useful to everyone.&lt;/li&gt;&lt;/ol&gt;&lt;p&gt;That’s… A lot harder than it looks. But luckily we have a saving grace: throw away your desire to have truthful, factual, or correct answers to meaningful questions.&lt;/p&gt;&lt;p&gt;Seriously, I mean it. I don’t mean it in a “we live in a post truth world” bullshit way, I mean it in the understanding of reality that comes when you realise that because everyone’s context and understanding and interpretation of the world is different, there is no way to ever arrive at a definition of “correctness” or “truth” or “fact” that is also useful for a situation that is not absolute and objective. This might terrify you, but lean into it and let it liberate you. Answers are useful if they let you move forward with concrete action: that’s it.&lt;/p&gt;&lt;blockquote class=&quot;border-primary flow&quot;&gt;&lt;p&gt;&lt;strong&gt;Leadership Insight&lt;/strong&gt;: If you’re asking a meaningful question, it’s not going to have an objective answer; it’s subjective by definition because the meaning itself is subjective.&lt;/p&gt;&lt;/blockquote&gt;&lt;p&gt;You know that phrase that everyone loves to quote? “Disagree and commit”? I hate it. I think it’s a phrase that causes a lot more harm than good because it’s quoted so often out of context and used frequently as a cudgel by leadership to force top down consensus when it was originally intended to be a reminder to leaders to trust the people you hired.&lt;/p&gt;&lt;p&gt;That said, if you take the concept of trusting those you work with, and you throw away the oppositional and aggressive framing its buried in, you get something really cool: trust the questions people ask and utilise the answers they learn.&lt;/p&gt;&lt;p&gt;Get rid of “disagree and commit” and lean into “ask meaningful questions, get useful answers, and act on what you learn.” As a leader, it’s your job to help enable as many answers as possible to be meaningful to the business.&lt;/p&gt;&lt;h2 id=&quot;process-of-development&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://hazelweakly.me/blog/redefining-observability/#process-of-development&quot;&gt;&lt;span&gt;Process of Development&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;I want to tackle the other part of my definition now, which is that we have this process and it’s a process through which one develops an ability. What does that mean? It means you start out &lt;em&gt;being fucking terrible at it&lt;/em&gt; and that is a Feature, Not a Bug™.&lt;/p&gt;&lt;p&gt;Think back to the first time you tried to do anything in engineering, or marketing, or sales, or any other part of your professional career. Not only was it natural for you to be bad at something, it was actually a good thing; getting things wrong is a necessary and integral part of the learning process itself. It’s through correction, evolution, enhancement, and iteration that you develop so many vital skills and hone your intuition. If you didn’t have that, and you just made the right choices, you’re not smart, you’re just lucky. Leaders don’t like being lucky for a reason: it doesn’t scale, and it’s terrible luck to be lucky.&lt;/p&gt;&lt;p&gt;What that means to me for observability is that at the beginning, you’re going to be severely limited in the breadth, depth, scope, and nuance of your questions. But that’s okay! The simple questions are still meaningful questions to ask. This is something I see people trip up on a lot, so I want to hammer it home here.&lt;/p&gt;&lt;p&gt;In an ongoing process of iterative development, the progress itself &lt;em&gt;is&lt;/em&gt; the output. You can’t ask a sophisticated question without having first asked a simple one; that just not how it works. Imagine going into a fiscal planning meeting and asking “hey what’s the Discount Cash Flow analysis broken out for our various business units” and everyone’s still busy clarifying what each business unit needs to declare as CapEx vs OpEx. Not only are you talking completely past everyone and derailing the entire meeting, but &lt;em&gt;you are going to get the wrong answer&lt;/em&gt; and you will set yourself up for failure in the future by trying to ask a question like that before you have the basics down.&lt;/p&gt;&lt;blockquote class=&quot;border-primary flow&quot;&gt;&lt;p&gt;&lt;strong&gt;Leadership Insight:&lt;/strong&gt; Asking the basics is not a sign of incompetence, it’s a sign of trusting the process and developing your observability “muscle.”&lt;/p&gt;&lt;/blockquote&gt;&lt;p&gt;For computer systems, your basics are probably going to look something like this (in order of increasing sophistication):&lt;/p&gt;&lt;ol&gt;&lt;li&gt;“Is our service reachable internally”&lt;/li&gt;&lt;li&gt;“Is our service reachable externally”&lt;/li&gt;&lt;li&gt;Ok, cool cool cool, uptime is a lie, whatever: what is our uptime anyway?&lt;/li&gt;&lt;li&gt;Is our service reasonably performant?&lt;/li&gt;&lt;li&gt;Is our service reasonably cost effective? &lt;ul&gt;&lt;li&gt;This is where “traditional” monitoring usually stops&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;Repeat all of the above but for each sub-service&lt;/li&gt;&lt;li&gt;Repeat all of the above but for each endpoint &lt;ul&gt;&lt;li&gt;This is where “modern observability” starts to really differentiate itself&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;Repeat all of the above, but from the perspective of an individual end user &lt;ul&gt;&lt;li&gt;This is where SLOs start to really become necessary as a tool for asking questions&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;From the perspective of an individual end user, what’s the performance of an end-to-end request, segmented by every point in the chain? &lt;ul&gt;&lt;li&gt;This requires distributed tracing&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;Which of these various tuning options has the best performance characteristic? &lt;ul&gt;&lt;li&gt;A/B testing and other variation functionality becomes invaluable here&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;How does our system behave in various situations that we might not have accounted for? &lt;ul&gt;&lt;li&gt;This is where chaos testing, fault injection, and other experimentation strategies start&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;Where are the most effective points in the system to leverage humans for adaptive capacity &lt;ul&gt;&lt;li&gt;(your next $1 billion startup goes here)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;/ol&gt;&lt;p&gt;So looking at this, and then looking at your company, you’ll notice that a &lt;em&gt;lot&lt;/em&gt; of companies are only realistically at somewhere between 1-3. That’s okay! It’s completely fine to not go further as long as the questions you can ask that are &lt;em&gt;meaningful&lt;/em&gt; to the business aren’t captured by anything more sophisticated. Because after all, if you have no need to ask more nuanced questions, why would you need to develop further sophistication in your observability strategy?&lt;/p&gt;&lt;p&gt;Some companies deeply need to be able to ask very nuanced questions around how humans and technology interoperate in a variety of unanticipated areas with a lot of unknown unknowns under very tight operating constraints. Some only really need to know “code go in, money get made.” That’s not a failure of the business; the only failure here is investing disproportionately to your need.&lt;/p&gt;&lt;blockquote class=&quot;border-primary flow&quot;&gt;&lt;p&gt;&lt;strong&gt;Leadership Insight:&lt;/strong&gt; That said, while the only failure of observability is investing disproportionately to your need, most companies are either investing too much or too little into observability.&lt;/p&gt;&lt;/blockquote&gt;&lt;p&gt;In my experience, I see most companies investing too much money into observability with very little meaningful return on investment because they keep treating it as a tech and tooling problem rather than a research capability.&lt;/p&gt;&lt;h2 id=&quot;tying-things-together&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://hazelweakly.me/blog/redefining-observability/#tying-things-together&quot;&gt;&lt;span&gt;Tying Things Together&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;We had the Control Theory definition of observability, and the Cognitive Systems Engineering definition of observability, and then I presented my definition of observability:&lt;/p&gt;&lt;blockquote class=&quot;flow&quot;&gt;&lt;p&gt;Observability is the process through which one develops the ability to ask meaningful questions, get useful answers, and act effectively on what you learn.&lt;/p&gt;&lt;/blockquote&gt;&lt;p&gt;We also went over what the “meaningful questions” and “useful answers” bit means, and we went over the process of developing an ability. When we combine those two, we get something that actually really reminds me of the five levels of expertise in the dreyfus model of skill acquisition (novice, advanced beginner, competent, proficient, expert).&lt;/p&gt;&lt;p&gt;Which, honestly, I love that; you absolutely should be thinking of observability as developing an organisational wide capability of asking meaningful questions and getting useful answers. Of course, once you have a useful answer, you have the final part: acting on it.&lt;/p&gt;&lt;p&gt;Learning, without action, isn’t learning; it’s fundamentally a process. And processes? Processes are messy, they require action, they require movement, they require &lt;em&gt;doing&lt;/em&gt;, they require re-evaluating the process, they require evolving the process, they require wrangling with the human condition itself.&lt;/p&gt;&lt;p&gt;Just like observability.&lt;/p&gt;&lt;p&gt;To put simply, observability is organisational learning.&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>Engineering Language as a Vehicle of Innovation</title>
    <link href="https://hazelweakly.me/blog/engineering-language/" />
    <updated>2024-03-08T00:00:00Z</updated>
    <id>https://hazelweakly.me/blog/engineering-language/</id>
    <content type="html">&lt;p&gt;Something that I find missing in almost every software company is this thing that I’m not sure I’ve seen explicitly called out anywhere, but I’m going to call it an Engineering Language. This Engineering Language is something that I’m going to attempt to describe, motivate, outline, and then illustrate with an example.&lt;/p&gt;&lt;h2 id=&quot;engineering-language&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://hazelweakly.me/blog/engineering-language/#engineering-language&quot;&gt;&lt;span&gt;Engineering Language&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;The Engineering Language is something that I would consider to be a living embodiment of how engineers speak, think about, describe, and express what they think in that problem domain. It’s not a programming language, or a DSL; it’s similar to a Design Language, but for software engineering and architecture more directly. The Engineering Language is the tool that you use to build foundations of thought and mental models and concepts themselves, so that one can coordinate the intangible nothingness of abstraction itself.&lt;/p&gt;&lt;p&gt;I think this Engineering Language is comprised of three things: An abstraction language, a protocol language, and an interface language. Together, those three things make up something that is greater than the sum of its parts.&lt;/p&gt;&lt;h2 id=&quot;motivating-the-engineering-language&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://hazelweakly.me/blog/engineering-language/#motivating-the-engineering-language&quot;&gt;&lt;span&gt;Motivating the Engineering Language&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;“If there is no language, are there thoughts we can think?” It’s an interesting question, but I find it unsatisfactory; here’s a different question that keeps me up at night. “How can I share the thought that I think, if even language is insufficient and inadequate for this task?”&lt;/p&gt;&lt;p&gt;Let me utter a word into the air, let me breathe this thought into your mind’s eye: Echo.&lt;/p&gt;&lt;p&gt;What does that word convey to you? I had an extremely specific mental image in my head when I wrote that word, and I know that I could spend hundreds of words, or even dozens of pages, explaining that mental image and still you would not share it with me perfectly. Do you have any idea how discouraging it is to spend your entire life’s work building mental abstractions and making them concrete in a technical sense and a human sense, yet be completely unable to even convey so much as a single thought? It’s ridiculous; we can build towers, we can see countless infinities, we can push boundaries unimaginable, we can kiss the stars, but we can’t even share a thought with each other? How many tens of thousands of years have we spent building this ability to speak and express oneself? And for what? Absolutely nothing?&lt;/p&gt;&lt;p&gt;For all the weaknesses and inadequacies that language has, however, nothing else comes close to enabling the same fidelity of communicating thought. There’s a reason that the pen is mightier than the sword, after all. It seems to me, then, that if one is to scale the act of creating complex thought and nuanced abstractions and building the scaffolding upon which we construct towers of ideology that we call understanding… That one needs language.&lt;/p&gt;&lt;p&gt;There’s a more concrete motivation here as well. One of the most beautiful aspects of education and knowledge is that we’ve managed to figure out how to take a messy, non-linear, ball of mud that is “knowledge” and turn it into something that is fascinatingly incremental. Somehow, we’ve managed to figure out a path through which you can start from counting and the alphabet and end up with mathematics, philosophy, linguistics, and more.&lt;/p&gt;&lt;p&gt;While that in of itself is fascinating, there’s something in there that I think is even more amazing: it feels linear. How the absolute fuck did we manage to build a vehicle of transmitting knowledge that is mostly somehow linear in feeling even though the world is messy, information explosion is combinatorial, cardinality is uncountable, and certainty is unknowable? How? How did we do that? We don’t celebrate this miracle of knowledge nearly enough, in my opinion; of all of our achievements among humanity, this should rank as one of the greatest.&lt;/p&gt;&lt;p&gt;I’m going to switch gears for a second and talk about a theoretical business. Imagine this business, which is going to solve a problem, with a product or a service or whatnot, and tackle a certain market. In order to do so, one might start writing some software and doing some market research, validating things, learning about the domain, and so on. Something curious will eventually happen: No matter how carefully one writes the software, or how adaptable one tries to remain, the company will eventually reach two critical points of solidity:&lt;/p&gt;&lt;ol&gt;&lt;li&gt;Some evolution in the software will disproportionately become exponentially difficult relative to its “actual” complexity&lt;/li&gt;&lt;li&gt;Some evolution in market strategy, positioning, or product development, will disproportionately become exponentially difficult relative to its “actual” complexity&lt;/li&gt;&lt;/ol&gt;&lt;p&gt;But somehow, this isn’t the case with language? It isn’t the case with the things we learn? How? How is it so different?&lt;/p&gt;&lt;p&gt;If we are to achieve this sort of linearity of growth as knowledge for a business domain develops, and if we are to do so in a way that lets us express this knowledge and make it concrete through computation, then surely we need a language of some sort. An Engineering Language.&lt;/p&gt;&lt;h2 id=&quot;outlining-the-engineering-language&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://hazelweakly.me/blog/engineering-language/#outlining-the-engineering-language&quot;&gt;&lt;span&gt;Outlining the Engineering Language&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;As I said earlier, I think the Engineering Language has three parts to it:&lt;/p&gt;&lt;ol&gt;&lt;li&gt;An abstraction language&lt;/li&gt;&lt;li&gt;A protocol language&lt;/li&gt;&lt;li&gt;An interface language&lt;/li&gt;&lt;/ol&gt;&lt;p&gt;Let’s break that down a bit.&lt;/p&gt;&lt;h3 id=&quot;abstraction-language&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://hazelweakly.me/blog/engineering-language/#abstraction-language&quot;&gt;&lt;span&gt;Abstraction Language&lt;/span&gt;&lt;/a&gt;&lt;/h3&gt;&lt;p&gt;When we talk about abstraction, what makes a &lt;em&gt;good&lt;/em&gt; abstraction? I think a good abstraction is one that is both opaque and transparent. A good abstraction is opaque in the sense that it is not necessary to ever reason about something underneath the layer of the abstraction; it should not leak, it should break, it should deliver on what it promises, it should behave as an abstraction rather than a leaky shortcut. A good abstraction is transparent in the sense that it is not necessary to know the abstraction in order to reason about something below it, at no point is the interaction of the abstraction “magical”, at no point does the abstraction require you to have 100% knowledge of the abstraction and 100% knowledge of the thing it abstracts and 100% knowledge of how those two mesh together; lastly, a good abstraction is derivable in that if you see a new instance of it, it behaves logically in a way that you can reason about the implementation accurately.&lt;/p&gt;&lt;p&gt;Abstractions become useful precisely when they are able to be depended on &lt;em&gt;and&lt;/em&gt; ignored, when they are able to be mixed and integrated, built on top of and built around. Abstractions should exist to coexist.&lt;/p&gt;&lt;p&gt;Which means, of course, that only some things &lt;em&gt;can&lt;/em&gt; be a good abstraction; fundamentally, how you design the lower layers of your infrastructure and your software and your sociotechnical system will dictate quite literally the constraints of what can and cannot be expressible as an abstraction &lt;em&gt;at all&lt;/em&gt;. No amount of papering over something will let you break the laws of physics, no amount of fudging the numbers will make time run backwards, and no amount of magical bullshit sprinkles will solve fundamental limitations of distributed systems, and no technical solution will ever solve a people problem.&lt;/p&gt;&lt;p&gt;But if you only have some things that can be a good abstraction, surely you need a language to express and help enumerate the possible abstractions one can build. Not only that, but the language should help you express why those are good abstractions, why certain others aren’t, and help other people build combinations of abstractions and towers of them in a way that preserves the coherence and alignment at scale. That is something I don’t really see anyone doing, but it’s sorely sorely needed.&lt;/p&gt;&lt;h3 id=&quot;protocol-language&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://hazelweakly.me/blog/engineering-language/#protocol-language&quot;&gt;&lt;span&gt;Protocol Language&lt;/span&gt;&lt;/a&gt;&lt;/h3&gt;&lt;p&gt;If an abstraction is a mental construct turned into a tangible building block of conceptual thought, then protocols are the cement through which you build the towers of your imagination. Any system needs communication, coordination, coherence, adaptive capacity, failure handling, modularity, and more. All of those things have one thing in common: You build the facilities which enable those by building a protocol.&lt;/p&gt;&lt;p&gt;But again; the shape of your system determines the shape of what can be a good protocol, which means you need a language for defining and conceptualizing what it even means for something to &lt;em&gt;be&lt;/em&gt; a protocol and to interface with other protocols.&lt;/p&gt;&lt;h3 id=&quot;interface-language&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://hazelweakly.me/blog/engineering-language/#interface-language&quot;&gt;&lt;span&gt;Interface Language&lt;/span&gt;&lt;/a&gt;&lt;/h3&gt;&lt;p&gt;This one is tricky. We have abstractions, and we have protocols, so what makes an interface different from those? To stretch the construction analogy a bit more: if abstractions are bricks, and protocols are cement, then interfaces are the blueprints that let everything flow through the building correctly.&lt;/p&gt;&lt;p&gt;Abstractions enable growth by allowing one to compose ideas, protocols enable growth by allowing one to compose systems, and interfaces enable growth by allowing one to compose interactions.&lt;/p&gt;&lt;p&gt;Naturally, I love interfaces, and have a horrible time explaining what I mean here; I’ll give it a shot. I don’t really mean interfaces in the &lt;code&gt;abstract interface List&amp;LTT&gt;&lt;/code&gt; sense; that’s useful, but also far too low level. As a slightly better example, one could think of kubernetes as a protocol, as an abstraction, as an interface, or as any combination of those; when building a platform for others, I prefer to think of it internally as an interface and externally as a protocol. Internally, I use it as an interface and build things with it and compose all the possible interaction points people might have with the distributed system and glue them together in a coherent way; but I don’t expose the interface really, I expose the protocol so that people know how to communicate with the system. It’s a subtle difference, and I’m not sure I’m explaining it well.&lt;/p&gt;&lt;h2 id=&quot;an-illustrated-example&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://hazelweakly.me/blog/engineering-language/#an-illustrated-example&quot;&gt;&lt;span&gt;An Illustrated Example&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;This example is either going to make a ton of sense, or absolutely zero sense. Let’s look at something that has managed to do this quite well: The web browser.&lt;/p&gt;&lt;h3 id=&quot;browser-abstraction-language&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://hazelweakly.me/blog/engineering-language/#browser-abstraction-language&quot;&gt;&lt;span&gt;Browser Abstraction Language&lt;/span&gt;&lt;/a&gt;&lt;/h3&gt;&lt;p&gt;What are the building blocks of a browser? What makes good ones, bad ones, weird ones, or even just possible ones? I think, honestly, that there’s only two main ones.&lt;/p&gt;&lt;ol&gt;&lt;li&gt;HTML + Accessibility Object Model + CSS&lt;/li&gt;&lt;li&gt;URLs&lt;/li&gt;&lt;/ol&gt;&lt;p&gt;HTML, CSS, and the Accessibility Object Model are the main languages that let you even conceive and describe what it means to “be in” the browser at all. They help define the capabilities of it, the limitations of it, and shape what it means to be the web in a tactile sense.&lt;/p&gt;&lt;p&gt;But URLs? They &lt;em&gt;are&lt;/em&gt; the web. URLs are the most defining aspect of the web and are so key that they are simultaneously an abstraction language, a protocol language, and an interface language.&lt;/p&gt;&lt;p&gt;Javascript doesn’t count here; it’s not an abstraction, it’s an interface. It doesn’t create new abstractions, it surfaces ways you can interact with them; the fact that only some things are exposed via Javascript is a perpetual wart and flaw in the design of modern browsers and it continues to be a glaring omission in their design.&lt;/p&gt;&lt;h3 id=&quot;browser-protocol-language&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://hazelweakly.me/blog/engineering-language/#browser-protocol-language&quot;&gt;&lt;span&gt;Browser Protocol Language&lt;/span&gt;&lt;/a&gt;&lt;/h3&gt;&lt;p&gt;When we think of protocols, we likely start thinking of: TCP, UDP, service workers, https, http/2, http/3, and websockets; we might get into an argument about whether or not those last ones count, or whether or not http/2 and http/3 are different protocols or not, but we certainly use all of these like protocols.&lt;/p&gt;&lt;p&gt;They’re not a protocol &lt;em&gt;language&lt;/em&gt;, though; they’re manifestations of that language.&lt;/p&gt;&lt;p&gt;The protocol language of the browser is simple: it’s the URL. &lt;code&gt;protocol://domain/sub/resource?key=value&amp;metadata&lt;/code&gt; Look at that thing. It’s glorious, it’s gorgeous; contained within that language is the empires of thousands of libraries, millions of lines of code, dozens of protocols, and more.&lt;/p&gt;&lt;p&gt;The language of the URL helps shape what it even means to be able to think about building a protocol for the web, and its why we can instinctively feel like REST is a “web native” RPC, but most others, such as gRPC, are not.&lt;/p&gt;&lt;p&gt;Fuckin love URLs&lt;/p&gt;&lt;h3 id=&quot;browser-interface-language&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://hazelweakly.me/blog/engineering-language/#browser-interface-language&quot;&gt;&lt;span&gt;Browser Interface Language&lt;/span&gt;&lt;/a&gt;&lt;/h3&gt;&lt;p&gt;There are two interfaces I want to talk about here. I’m going to intentionally avoid the accessibility interfaces (and lack of them) here because I’ll blow a gasket and rant for a few thousand words if I get started on that.&lt;/p&gt;&lt;p&gt;ANYWAYS&lt;/p&gt;&lt;p&gt;The two interfaces that I want to talk about are URLs and Javascript. What makes a URL an interface here? Well, simply that how people interact with the web browser or initiate the web browser and do all of that is… Through URLs. Want to open the browser? Most people now actually just click on a URL anywhere on the computer, any time, anywhere, and expect a browser to open up spontaneously.&lt;/p&gt;&lt;p&gt;That’s honestly remarkable. It’s absurd how pervasive that idea is; can you imagine literally anything else in computing where, regardless of whether it’s an iPhone or Android or a desktop or a laptop or any OS in the last 20 years, everything works the same way? Click link, see site, never think about whether or not you need to start the browser first. Truly magical. Now &lt;em&gt;that’s&lt;/em&gt; an interface language.&lt;/p&gt;&lt;p&gt;It’s a language in the sense that it lets you know the limitations and lets you conceive of new possibilities. Did anyone imagine deep linking was going to be a think in mobile apps back in 2004? Of course not; we didn’t even have iPhones yet. (Yes yes I hear you shouting in the background there Plan 9, shh, its ok)&lt;/p&gt;&lt;p&gt;Javascript is, well, Javascript; of all the interfaces with the browser, very few are as raw and deeply embedded as the programming engine through which we decided to shove the entirety of all and everything through.&lt;/p&gt;&lt;h2 id=&quot;where-am-i-going-with-this&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://hazelweakly.me/blog/engineering-language/#where-am-i-going-with-this&quot;&gt;&lt;span&gt;Where Am I Going With This?&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;If you’ve made it this far, congratulations, you got to read my ramblings for a bit on an Engineering Language, with apologies to being a bit tired while writing this and not proofreading it in the slightest before yeeting it onto the internet.&lt;/p&gt;&lt;p&gt;But really, what’s the point here? The point for me, simply, is that I don’t think tech companies are thinking enough about what it means to build a language for engineering. How does one go about building something in a way that you can distribute tools of thought that are deeply embedded in peoples workflows that they learn to conceptualise thought intuitively in a way that’s aligned with the direction that you need to go? How do you make software architecture where coherence with the company vision is an &lt;em&gt;emergent property&lt;/em&gt;? Is that even a thing people think is possible? I think it is, I just also think we suck at doing that.&lt;/p&gt;&lt;p&gt;I see it in a really bad way where you get software debt built up in such a way that you can’t meaningfully explain to anyone why one idea takes two weeks and another takes two years to implement. What a waste of talent and time all around. I’d love to see a world where instead of pontificating about tech debt or agile practices or wanking about the OKR-go-round, we figured out how to actually build cross-functional communication in a meaningful sense. What if a product manager actually had the language to have a meaningful conversation with a software architect and a solution engineer and marketing and UX research? What if we were able to build software in a way that we could proactively identify opportunities for alignment and that such opportunities for synergetic product and feature development happened &lt;em&gt;naturally&lt;/em&gt; and &lt;em&gt;organically&lt;/em&gt;?&lt;/p&gt;&lt;p&gt;Anyone who thinks they have cracked the formula for doing so is lying; there’s no way we’ve figured this out as an industry, and I’m doubtful we ever will figure out an actual methodology and pedagogy for teaching this type of thing. That said, I think it’s possible to do so for &lt;em&gt;a&lt;/em&gt; company and &lt;em&gt;a&lt;/em&gt; set of circumstances.&lt;/p&gt;&lt;p&gt;Whoever figures it out for &lt;em&gt;their&lt;/em&gt; company and &lt;em&gt;their&lt;/em&gt; circumstances is going to massively increase their chances of success. That is, if they can get everyone speaking the same language.&lt;/p&gt;&lt;p&gt;Which is, of course, an entirely separate problem with its own massive difficulties.&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>Observations of Leadership (Part One)</title>
    <link href="https://hazelweakly.me/blog/observations-of-leadership-part-one/" />
    <updated>2024-03-01T00:00:00Z</updated>
    <id>https://hazelweakly.me/blog/observations-of-leadership-part-one/</id>
    <content type="html">&lt;p&gt;I read &lt;a href=&quot;https://cutlefish.substack.com/p/tbm-274-how-capable-leaders-navigate&quot;&gt;this post&lt;/a&gt; from John Cutler and Tom Kerwin recently on how leaders navigate uncertainty and ambiguity and it intrigued me. I decided to give my shot at answering these as a writing exercise and as an opportunity for self reflection. The past few quarters have seen a lot of change for me, and haven’t taken the time I need to reflect as much as I would otherwise wish; this seems like as good of an opportunity as any. For each of these, I’m going to copy in the interview question and then answer it very similarly to how I would answer it during an interview (but without any of the time or brevity constraints). I’m actually quite curious to see what other people have to say about my answers, and what answers others have of their own.&lt;/p&gt;&lt;p&gt;As a brief bit of background, I’m going to be referring to my current job quite a bit, but how I’m doing so is probably going to be a bit confusing because it’s been a very unusual journey. Here’s the very shortened timeline:&lt;/p&gt;&lt;ol&gt;&lt;li&gt;I came into the company as an IC.&lt;/li&gt;&lt;li&gt;Shortly after doing so, our Head of Infrastructure and Engineering Manager (same person) left; I stepped up to assume the role in the interim while we looked for a new hire.&lt;/li&gt;&lt;li&gt;After one quarter (and some change), we hired our new Head of Infrastructure, and I stayed on as “just” the Engineering Manager of the team for another quarter.&lt;/li&gt;&lt;li&gt;At the start of the year, we made the decision to transfer a Director from elsewhere in the company into my role, as the role had expanded.&lt;/li&gt;&lt;li&gt;In doing so, I stepped into my current role as Principal Architect of the Platform Organization (which is what I was essentially hired to do in the beginning).&lt;/li&gt;&lt;/ol&gt;&lt;p&gt;I do plan to write about this in the future in more detail, because I think there was a lot of things to unpack and a lot of things to learn; frankly, we don’t write enough about the “interim” roles and how to set them up for success. So much of the writing on leadership out there assumes a 2-3+ year timescale; it’s not &lt;em&gt;wrong&lt;/em&gt; for doing so, but there were quite a few things I didn’t do effectively because I didn’t have experience in being an &lt;em&gt;interim&lt;/em&gt; leader (or, well, any sort of leadership, to be honest). But, this is not that article; this is the article where I go &lt;em&gt;way&lt;/em&gt; too in depth on all of these questions.&lt;/p&gt;&lt;p&gt;It’s going to be quite long, sorry-not-sorry. This is also going to have to be a multi part series because I started writing this a week ago and only made it through five of the questions before realizing how long it had already gotten.&lt;/p&gt;&lt;h2 id=&quot;accept-we-are-part-of-the-problem&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://hazelweakly.me/blog/observations-of-leadership-part-one/#accept-we-are-part-of-the-problem&quot;&gt;&lt;span&gt;Accept We Are Part of the Problem&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;&lt;blockquote class=&quot;flow&quot;&gt;&lt;p&gt;Can you share a specific instance when you recognised your contribution to a problem? What led to this realization, and how did it influence your actions in the future?&lt;/p&gt;&lt;p&gt;– &lt;a href=&quot;https://cutlefish.substack.com/i/142017363/accept-we-are-part-of-the-problem&quot;&gt;https://cutlefish.substack.com/i/142017363/accept-we-are-part-of-the-problem&lt;/a&gt;&lt;/p&gt;&lt;/blockquote&gt;&lt;p&gt;Firstly, I love this question, what a banger to start things out with. It’s not about failure, it’s about learning and growth, but in a different perspective than I see most leadership questions tackling.&lt;/p&gt;&lt;p&gt;Here’s an instance for you, looking back into my time most recently as an Interim Director of Infrastructure. To put it kindly, I stepped into this role because there was an urgent need at the company and I was able to address it; in no way was I particularly qualified for it, and I most certainly was not experienced. I’m going to lay out the situation briefly, break it down into external factors, internal factors, and then address the part where I realised later (with the help of my SVP) what I could’ve done differently; in full transparency, I’m still working on the “how did it influence your actions” part myself.&lt;/p&gt;&lt;p&gt;Getting to the situation in question, as I perceived it: We had a critical under-investment in infrastructure, resulting in a team that was extremely underwater, had far too high work in progress, and was unable to even communicate the problem in a way that external stakeholders could understand. When I came in, one of the first things I did was to address this by attempting to increase visibility here. By all accounts, I was wildly successful: During my tenure so far, we’ve gone from 5 ICs and one manager to multiple teams, including a dedicated Data Infrastructure team, dedicated Developer Experience team, a platform team, and infrastructure team. We have an amazing SVP now (note: titles are a bit fuzzy here still, my usage of titling here reflects scope more than reality), and we’ve been able to hire what is the most diverse and welcoming organisation in the company. I can’t stress this enough: I am enormously proud of this organisation.&lt;/p&gt;&lt;p&gt;Now, let’s get to the part where I fucked up: to put it directly, I did an &lt;em&gt;okay&lt;/em&gt; job at showcasing the severity of the situation, and I could’ve done much better. One of the things that’s so difficult about leadership is that you can really only start to realise this type of thing by the nature of the conversations you have months down the road after it’s a bit too late to directly address them. If I were to break down an ideal scenario for what I could’ve done, it would be:&lt;/p&gt;&lt;ul&gt;&lt;li&gt;Recognise and position myself as an interim director with the sole focus of preparing for the next change in leadership&lt;/li&gt;&lt;li&gt;In doing so, one of the highest impact things I could’ve done was: documenting, describing, and quantifying the scope of the problem. I did the describing part really well, I’m proud of that; but I did precious little documentation of it, which led to repeated conversations and some uncomfortable moments for my SVP as she came in and had very little ability to immediately present clear and quantifiable cases to the rest of leadership for the problem that she and I were both able to articulate.&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;The ability &lt;em&gt;to&lt;/em&gt; articulate the problem at all is something I helped develop, but the impact of that was greatly diminished by not quantifying and documenting the problem. I paid for that mistake a lot over the next two quarters and I’m still paying it down now. The repeated conversations, the lack of ability to transfer understanding over, and the difficulty in presenting information in a way that our CTO could push up and do effective global resource management for the company in a way that best meets its needs was a big miss here; while we had a significant amount of contributing factors there, my inexperience played a huge part as well.&lt;/p&gt;&lt;p&gt;That said, I really enjoyed the opportunity to learn in a very short amount of time exactly &lt;em&gt;what&lt;/em&gt; type of information people need in order to express these types of thorny issues; I’m very good at identifying and describing them, and I’ve been unusually good at convincing people and aligning them around solutions, but you have to go several steps further than that in leadership. It’s not enough to get everyone to go “yeah that’s great, let’s solve this problem”; that’s just the beginning.&lt;/p&gt;&lt;p&gt;You have to be able to present the information and package it up in a way that it can be measured and balanced against the needs of the &lt;em&gt;entire&lt;/em&gt; company, all the way up to the board if necessary. That’s hard! Most people take years to learn that this type of packaging is even necessary or what it even looks like! I’m beyond fortunate to have had a crash course in this while still being able to have the right outcome that we needed at the time.&lt;/p&gt;&lt;h2 id=&quot;encourage-new-interaction-patterns&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://hazelweakly.me/blog/observations-of-leadership-part-one/#encourage-new-interaction-patterns&quot;&gt;&lt;span&gt;Encourage New Interaction Patterns&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;&lt;blockquote class=&quot;flow&quot;&gt;&lt;p&gt;Describe a situation where you facilitated new ways for people to interact or share information. Or a situation where you exposed people to new kinds of information or experiences. What prompted you to make the change, and what was the outcome?&lt;/p&gt;&lt;p&gt;– &lt;a href=&quot;https://cutlefish.substack.com/i/142017363/encourage-new-interaction-patterns&quot;&gt;https://cutlefish.substack.com/i/142017363/encourage-new-interaction-patterns&lt;/a&gt;&lt;/p&gt;&lt;/blockquote&gt;&lt;p&gt;This is a fun one! I really love thinking about interaction patterns; they’re so influential in determining how people think about problems, how they tackle them, and how you can attempt to influence various emergent properties of a group of people. Consequently, I really like to think of this in terms of “what is the collaboration outcome that we need and how do we address those deficiencies while emphasizing and leaning into what we’re already good at?”&lt;/p&gt;&lt;p&gt;Here’s one that I really liked: in my team, when I was the interim Director and Engineering Manager, we had this problem of siloed information; because everyone had been so underwater for so long, the vast majority of work was interrupt driven. What I mean by interrupt driven work here is work that is primarily driven by asks from others and external demand rather than being planned or orchestrated; while that might be considered a normal flow of work for some infrastructure teams, it’s not optimal for teams that do more than “call desk” style support, and so we needed to find a way to address that. Consequently, people ended up specializing in the interruptions they could solve the quickest, and so we had “the person who knows how to do X”, and “the person who knows how to do Y” and so on. It became &lt;em&gt;really&lt;/em&gt; risky to make most changes in infrastructure when that person wasn’t available.&lt;/p&gt;&lt;p&gt;That wasn’t a situation we could particularly afford, especially as I was trying desperately to prevent people from burning out, healing those who already had burned out, and grow the bus factor of the team while also trying to set up the future organisation for success. I made a few changes to attempt to improve things, but they weren’t ultimately particularly successful:&lt;/p&gt;&lt;ul&gt;&lt;li&gt;I setup a support slack channel, so that other teams could reach out to us for any issues, and wired it up into Jira. This was fantastic and worked really well &lt;ul&gt;&lt;li&gt;Previously, they had just DM’d various engineers on my team directly and so it was impossible to quantify the work being done, or share knowledge about what was going on, and we didn’t even have an effective way of announcing outages or planned maintenance.&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;I attempted to encourage pairing on problems together&lt;/li&gt;&lt;li&gt;I had the entire team lean into the interrupt driven work rather than try to do planned work and then splinter off into hero work as things inevitably required immediate attention &lt;ul&gt;&lt;li&gt;While interrupt driven work isn’t necessarily ideal, since we were &lt;em&gt;so&lt;/em&gt; underwater, focusing entirely on it was more effective than attempting to work like a team that had triple the bandwidth of ours.&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;These were all great steps, but the ones that I missed were:&lt;/p&gt;&lt;ul&gt;&lt;li&gt;&lt;em&gt;doubling down harder&lt;/em&gt;: We still had instances of people doing project style work rather than having &lt;em&gt;everyone&lt;/em&gt; doing interrupt driven work. If you’re going to lean into something, you need to really lean in.&lt;/li&gt;&lt;li&gt;Leaning into interrupt driven work was an attempt to minimise work in progress to a manageable level. While it worked, it would’ve worked vastly better by turning the entire team into a mob programming team. We did this after our new Director joined and the change was incredible; it wasn’t enough to have everyone working in the same area, they needed to work on the same thing at the same time, together. Not only did this speed up the entire team, but they grew closer together, collaborated better, and huge chunks of siloing disappeared overnight. Did I mention that we’re fully remote? We are. We still did mob programming, and it was amazing. I highly recommend it as a way of accelerating a team in the storming and norming phases.&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;Going back to the things that I did… Despite not quite doing the optimal thing here, what happened was really effective, if only for a particularly interesting and non-obvious reason: It was an extraordinarily compelling and straightforward thing to showcase to leadership. Nothing quite sells “we’re under-resourced here” than saying “I switched the team from 20% support work to 100% support work and we’ve barely moved the needle.”&lt;/p&gt;&lt;p&gt;Having a paper trail of an ever growing queue of work for the first time also helped tremendously here; it put a semi quantifiable number on the complaints and grumblings that people had. It also turned out that because our team was so quiet and bogged down, it wasn’t &lt;em&gt;noticeable&lt;/em&gt;; it had been under-resourced because people weren’t even able to understand how under-resourced it was. Changing that and making those new types of information available to leadership and the rest of the company fundamentally changed how they viewed us, and we learned quite a few interesting things.&lt;/p&gt;&lt;p&gt;Here’s some examples of discoveries I didn’t expect:&lt;/p&gt;&lt;ul&gt;&lt;li&gt;The engineering leadership team was under the impression that all the teams did their own infrastructure work end to end. The reality was that my team helped a &lt;em&gt;lot&lt;/em&gt; with support&lt;/li&gt;&lt;li&gt;Each vertical of the company was frustrated with my team not responding to them: they each thought we spent all of our time on the other verticals and ignored them&lt;/li&gt;&lt;li&gt;It turned out that 80% of our time was spent on support for product success, security, and compliance; we had so much toil that we didn’t even have time to automate it or reduce it&lt;/li&gt;&lt;li&gt;There was an &lt;em&gt;incredible&lt;/em&gt; amount of rework and redundancy going on: Because communication was so ad-hoc and boundaries weren’t clear, people would ask IT for problems we solved and vice versa, they’d get bounced around between different channels, and we’d have the same conversations about the same issues over and over&lt;/li&gt;&lt;li&gt;Every DNS change in Route53 took about 20 people-hours of meetings to communicate between product success, engineering, IT, and infrastructure; triple that if it was “cross concern” between two parts of the company that didn’t typically interact&lt;/li&gt;&lt;li&gt;“Percentage of work complete and accurate” was abysmal; very rarely would something get fixed without having to get re-fixed, or addressed later; misunderstandings happened constantly, and it meant that our ticket queue never really went down even if items got completed&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;Turns out “done” vs “done done” vs “done done for realsies actually done” vs “done so done that you don’t have to do it ever again” are all vastly different concepts. However, if you notice, none of the results here are really in the area that I wanted them to be: all of the benefit here was external facing rather than for the team. So it should come to zero surprise to any experienced leaders reading this that my team struggled with confidence that the company was happy with their work or liked them or appreciated them.&lt;/p&gt;&lt;p&gt;It should also come as no surprise that despite the visibility upward to leadership about the problem happening very quickly and that resulting in rapid change… The team didn’t &lt;em&gt;feel&lt;/em&gt; this for another quarter or two. They stuck around because they trusted me, and I’m eternally grateful for that, and I love them dearly; but I could’ve done so much more to help the &lt;em&gt;team&lt;/em&gt; itself with the outcomes of the interaction pattern changes. Luckily, I have great people I can learn from now, and my org is in such a wonderful spot now that it’s phenomenal to be able to take the opportunity to reflect, learn, and grow.&lt;/p&gt;&lt;h2 id=&quot;patient-divergence&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://hazelweakly.me/blog/observations-of-leadership-part-one/#patient-divergence&quot;&gt;&lt;span&gt;Patient Divergence&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;&lt;blockquote class=&quot;flow&quot;&gt;&lt;p&gt;Tell me about a time when you guided a team through a complex issue without rushing toward a solution. How did you manage this process, and what led to finally deciding on a path forward?&lt;/p&gt;&lt;p&gt;– &lt;a href=&quot;https://cutlefish.substack.com/i/142017363/patient-divergence&quot;&gt;https://cutlefish.substack.com/i/142017363/patient-divergence&lt;/a&gt;&lt;/p&gt;&lt;/blockquote&gt;&lt;p&gt;The situation I want to talk about here is about how we decided as an organisation to invest more heavily in developer experience, the process around that, and how we were able to do that quite quickly. The initial situation was one that I’m sure quite a few leaders are familiar with, but I want to take this opportunity to lay things out a bit more explicitly for people who might not be familiar with how things generally work in terms of feedback loops at the organisation level.&lt;/p&gt;&lt;p&gt;How this generally works in tech (and likely most companies, but I can’t speak to those) is you have roughly three personas of “experiencing” the company: Executives, Management, and ICs. Each have their own goals, strategies, and tools available to them to help steer the company in the right direction (which I’m going to call a “lever”); broadly speaking, Executives have levers of alignment, Management have levers of communication, and ICs have levers of execution. In addition (and I am grossly oversimplifying here), each company, particularly startups, are going to find themselves in one of &lt;em&gt;roughly&lt;/em&gt; three phases: Exploration, Expansion, or Extraction. One of the difficulties that comes here in the Executive and Management level is that any advice, tool, strategy, goal, or whatever else that you receive or attempt to implement is only going to work if it’s a match for the particular stage of a company; consequently, you essentially have to throw out your understanding of “how to run a company” every time you switch stages.&lt;/p&gt;&lt;p&gt;When I came in, I would categorise our company as one that had just gone through the Exploration stage and was now entering into the most awkward phase, Expansion. It turns out this isn’t quite right, we have two to three main business markets and each one is in a different stage; in addition, each company that we’ve merged with or acquired also was in a different stage, so what you really had from the perspective of the Executive layer was a very fuzzy matrix of approaches, strategies, tools, and everyone came in with a slightly different toolset and rationale for said toolset. This isn’t a worst case scenario for the business (it’s quite normal), but it’s close to a worst case scenario for when it comes to &lt;strong&gt;understanding how to build and utilise effective communication streams and feedback loops so that information travels bidirectionally in a way that people feel valued and heard&lt;/strong&gt;.&lt;/p&gt;&lt;p&gt;Coming back to the initial situation that I found myself in when I joined the company: it should now come as zero surprise that we were having a particularly difficult time getting good feedback from ICs, acting on said feedback, and doing so in a way that they felt heard and valued. As a consequence, many of the issues that actually mattered to ICs weren’t acted on or even identified; most of which would be issues that we could broadly categorise as “developer experience.”&lt;/p&gt;&lt;p&gt;I was actually deeply fortunate here: I came in as an IC, and then stepped into an interim hybrid Engineering Manager and Director of Infrastructure within a month of joining, so I got to see all three perspectives almost simultaneously, and I attribute a very large amount of my ability to effectively and rapidly zero in on “the real issues” at our company to this unique start. As such, I was able to categorise a lot of things in ways that helped ICs feel heard and understood, but then translate those issues into something that the management and executive layers could actually see.&lt;/p&gt;&lt;p&gt;One of the first things I did here was to quantify and outline the issue in way that presented enough evidence to the executive layer that investing in a developer experience platform would be cost effective and a force multiplier for helping figure out what to do next. In our case, we utilised DX because I was familiar with the tool and research behind it and made a strong case that the qualitative feedback mechanism of a survey would offer a much more rapid and tangible ROI. In a cheeky sense: “Hey, a lot of our devs complain that our infrastructure and tooling is really broken, we can’t use quantitative reasoning to measure any of this because the tools don’t work” is a surprisingly effective argument and it’s essentially the one I used.&lt;/p&gt;&lt;p&gt;While we did settle on the developer experience platform of choice somewhat quickly because I pushed hard for it, I was very careful to lay out that we had a 1-2 quarter plan for procuring the platform, using it, and actually understanding what we needed to do with it. One of the additional critical things that I used to help make the choice easier was to use this as an opportunity to communicate from the top-down that the leadership team is investing in figuring out how to understand ICs better. That worked so well that we had a noticeable bump in developer trust in leadership within a quarter, before we had even been able to use the platform to make any real changes; I really can’t overstate enough the importance of making sure your organisation, at every level, feels &lt;em&gt;heard&lt;/em&gt; and &lt;em&gt;respected&lt;/em&gt;.&lt;/p&gt;&lt;p&gt;Lastly, the path forward here was “building a path to build a path”, in a sense, and that was actually also very important; we had recently gone through a lot of turmoil in the organisation due to people feeling like the infrastructure team wasn’t communicating and being able to setup large multi-quarter initiatives in a way that let us start communicating about them immediately was crucial. Communicating early and often was so important to the success of this, and if anything, my only regrets are that I could’ve communicated earlier and more often; I slowed down a bit after things started “working” and change started happening, but doubling down on the communication would’ve likely helped some.&lt;/p&gt;&lt;p&gt;However, there’s a danger there in over communicating to the point where people don’t see change happening at the rate that you’re communicating and then it sounds like you’re all talk with no action (ironically, this was a frequent piece of feedback for me in the last two months; you can’t really win here). The balance and nuance in what it means to be an effective communicator and a transparent one gets even fuzzier and more complicated when you’re in leadership because there’s extremely valid psychological safety concerns in being “too” transparent. In addition, one can find themselves communicating about the wrong things, or with the wrong ratio of frequency to message importance, and so on; one of the hardest lessons of leadership I had to learn was truly understanding what it means to communicate less and be less transparent in being more effective as a leader.&lt;/p&gt;&lt;p&gt;As someone who really values transparency (and can handle “too much” transparency), it honestly particularly irked me to discover that there are, in fact, extremely legitimate reasons behind most leaders erring on the side of less transparency. I don’t have any easy answers there, of course; it’s one of the hardest skills to develop in leadership and I’m continually working on it myself.&lt;/p&gt;&lt;h2 id=&quot;identify-plausible-contributors-multiple-&amp;quot;causes&amp;quot;&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://hazelweakly.me/blog/observations-of-leadership-part-one/#identify-plausible-contributors-multiple-%22causes%22&quot;&gt;&lt;span&gt;Identify Plausible Contributors / Multiple “Causes”&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;&lt;blockquote class=&quot;flow&quot;&gt;&lt;p&gt;Discuss a complex problem you’ve encountered with numerous contributing factors. How did you tackle this complexity, and what was your method for deciding what to do next?&lt;/p&gt;&lt;p&gt;– &lt;a href=&quot;https://cutlefish.substack.com/i/142017363/identify-plausible-contributors-multiple-causes&quot;&gt;https://cutlefish.substack.com/i/142017363/identify-plausible-contributors-multiple-causes&lt;/a&gt;&lt;/p&gt;&lt;/blockquote&gt;&lt;p&gt;I’m going to talk about the same team and a very similar time-frame here again. We had a problem with our infrastructure: it was really really fragile. Nobody liked it, and quite a few people agreed that we really would probably be better off rewriting it all from scratch.&lt;/p&gt;&lt;p&gt;So naturally, we didn’t do that.&lt;/p&gt;&lt;p&gt;With a team that was so underwater, rewriting something from scratch and then migrating the entire company from one set of kubernetes clusters and AWS accounts to another one was a recipe for unmitigated disaster. Absolutely in no way would I ever commit a team to a death sentence like that. Well, not without understanding the problem really well. The contributing factors were things that were tricky to pin down, but easy to intuit if you have a “gut” for infrastructure:&lt;/p&gt;&lt;ul&gt;&lt;li&gt;A very young company had hired generalists that were smart and built things they understood and that fit the usecase&lt;/li&gt;&lt;li&gt;Certain technological bets were made prematurely that ended up being ones that only pan out well with a certain amount of investment&lt;/li&gt;&lt;li&gt;The business context changed and the technological choices weren’t re-evaluated&lt;/li&gt;&lt;li&gt;The required amount of infrastructure expertise wasn’t invested in&lt;/li&gt;&lt;li&gt;Everyone who had setup the system had left&lt;/li&gt;&lt;li&gt;The blend between “infrastructure” and “application concerns” had started fuzzy and gotten fuzzier&lt;/li&gt;&lt;li&gt;Non local initiatives had been made that had compounding effects on each other: in particular, pursuing certain markets, certain regulatory statuses, certain application level architectural decision, and certain GTM strategies all came together in a very exponentially complex way that nobody could’ve foreseen at the time&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;In short: it wasn’t anyone’s “fault,” but we sure ended up in quite the predicament, and much of the complexity was embedded in human interaction and how the intersection between locally smart choices resulted in disproportionately complex consequences. More importantly: most of the “real” fixes weren’t isolated to infrastructure, but touched upon ways of working and architectural assumptions in our compliance, regulation, security, infrastructure, product design, roadmaps, and more.&lt;/p&gt;&lt;p&gt;What I decided to attempt to do was to try and quantify the business continuity risks that our infrastructure posed, and then outline the various changes needed to address certain categories of business continuity risk. We had other types of risk as well: burnout, knowledge siloing, scale issues, scaling people issues, and so on, but the business continuity one is the one that moves the needle the most on resource allocation during budget planning, which is what we needed the most at the time.&lt;/p&gt;&lt;p&gt;To get more information here, we took the approach of trying to document whenever we ran into an issue with our infrastructure in a way that disrupted other teams, and phrasing things in terms of a two part “here’s the hack fix” and “here’s the requirements for the real fix” layout; that was combined with me attempting to cross correlate that with understanding what scenarios we might run into that would immediately shift our risk/reward ratio.&lt;/p&gt;&lt;p&gt;Eventually, we came to understand that the biggest things that would shift our risk vs reward of doing a rewrite were:&lt;/p&gt;&lt;ol&gt;&lt;li&gt;Can we incrementally do it (ie: stop at any time)&lt;/li&gt;&lt;li&gt;Can we &lt;em&gt;actually fucking finish it&lt;/em&gt;&lt;/li&gt;&lt;li&gt;Can we do it without creating a “now we have twice the operational burden” problem&lt;/li&gt;&lt;li&gt;Is it worth the cost of everything we drop in order to do it&lt;/li&gt;&lt;/ol&gt;&lt;p&gt;Ultimately, the answer ended up being yes, but several things had to happen for that to be true:&lt;/p&gt;&lt;ul&gt;&lt;li&gt;We got more headcount and doubled the size of the team; letting us actually have enough bandwidth to tackle the issue for the first time while still doing Keep The Lights On work&lt;/li&gt;&lt;li&gt;Changes in compliance and regulatory requirements meant that many of the affordances we relied on previously wouldn’t work going forward; this changed the risk vs reward substantially&lt;/li&gt;&lt;li&gt;We were able to figure out how to narrow down the scope of the rewrite enough that doing it became more feasible&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;Did I manage this process well? Eh…, I did okay; my inexperience really showed up here. This was a problem that I unintentionally solved mostly in my head and, while I took the team with me, I should’ve externalised the process and externalised the information in a better way.&lt;/p&gt;&lt;p&gt;It would’ve been amazing to have something like an ADR (architectural decision record) but for hypothetical needs. The “hack vs doing it right” set of trade-offs that we had was all informal discussion; while it was amazingly helpful, it would’ve been much more impactful to lay it out in a way that external stakeholders could see it and reason about it, and while we were able to have conversations with others for the first time of “hey this thing you want isn’t possible because X, Y, Z” it would’ve been a huge benefit for them to be able to take a document and share it with their leaders so that people could connect the dots for themselves and tie their business goals to ours in a way that would help everyone involved plan for success.&lt;/p&gt;&lt;p&gt;We did an alright job there, but it relied on me being very charismatic, good at communicating with others, and having &lt;em&gt;lots&lt;/em&gt; of meetings; while I’m happy to own that I’m good at communication, I’m disappointed that I put the team in a place where their success depended on me being able have the right conversations at the right time with the right people. That simply doesn’t scale, and our success ended up feeling more like luck than anything else.&lt;/p&gt;&lt;p&gt;With all that said, there was one pivot point that changed the entire conversation for the rewrite: We had just gotten budget for 3 new headcount, and I had just learned about a new regulation requirement that would force us to upgrade our kubernetes clusters within a few months. In addition, we had already previously decided that they were not upgradeable at this point and that any forced upgrades would require a rewrite; so, when we got to the point where we realised an upgrade was mandatory, the conversation switched from an “if” to a “how.”&lt;/p&gt;&lt;p&gt;What I &lt;em&gt;am&lt;/em&gt; proud of, in that moment, is that we had done the work required for that to be an instant and clear conclusion for the entire infrastructure team; everyone understood the trade-offs and alignment was unanimous. While that could’ve been communicated externally better, it’s so difficult to have that type of hard decision be straightforward, and I can take a bit of pride in having helped set up the conditions for it to become a straightforward decision.&lt;/p&gt;&lt;h2 id=&quot;power-of-the-present&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://hazelweakly.me/blog/observations-of-leadership-part-one/#power-of-the-present&quot;&gt;&lt;span&gt;Power of the Present&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;&lt;blockquote class=&quot;flow&quot;&gt;&lt;p&gt;Often as leaders we struggle with the tension between two extremes. At one extreme, we push for a big leap towards our opinionated vision about where we want to get to. At the other, we start where we are right now, figure out what’s working, and take small steps to change the present situation. Can you describe a situation where you needed to explore this tension?&lt;/p&gt;&lt;p&gt;– &lt;a href=&quot;https://cutlefish.substack.com/i/142017363/power-of-the-present&quot;&gt;https://cutlefish.substack.com/i/142017363/power-of-the-present&lt;/a&gt;&lt;/p&gt;&lt;/blockquote&gt;&lt;p&gt;This is something that I’m struggling with right now, actually!&lt;/p&gt;&lt;p&gt;On one hand, I have this fantastic and grand vision in my head for how we might build out the software engineering experience in a way that lets our Product and GTM functions be really effective in the type of market that we find ourselves in. There’s some very challenging things we have to do, and some core competencies that we have to build; if we pull off what I hope to, we’re going to end up building some extraordinarily innovative approaches to tackling this type of market, which would result in us having a world class ability to handle extremely diverse and nuanced industries that resist standard approaches towards digitization.&lt;/p&gt;&lt;p&gt;On the other hand, our CI pipelines are kinda janky and a lot of developers don’t feel like our test-suites are adequate or that we have sufficient monitoring in place to even detect when a service they work on is down, much less functional. &lt;em&gt;Sooooo&lt;/em&gt;, y’know, there’s a ways to go before we get to build the innovative vision.&lt;/p&gt;&lt;p&gt;The big tension here, for me, comes from trying to determine how one is going to iterate; iteration is key in evolving and improving the situation, but it can be extremely difficult to iterate certain things. Feature flags help a lot, but you don’t really get those for infrastructure in the same way, and if your infrastructure team is so underwater that they can barely handle what they have now, gradually and incrementally building out “the new thing” while struggling under the burden of what you have to do now is simply not going to work. One thing I did to explore the tension was to break down things that were causing this tension into a few categories: fixable, unfixable, workable, and unworkable. Fixable is fairly self explanatory and it’s a property of whether or not you can remediate the issue in some way that &lt;em&gt;actually&lt;/em&gt; solves it; workable is a little fuzzier, I’m using this to mean the spectrum of how much of a concern is this to the business &lt;em&gt;only&lt;/em&gt;, whether it be from the perspective of legal, compliance, risk, or anything else. I should note that I didn’t actually have these categories laid out so cleanly when I did this, and I’m more going back and looking at what I did and making sense of it after the fact.&lt;/p&gt;&lt;p&gt;That said, if we build these out, we have a “fixable/unfixable” and “workable/unworkable” split, so we can pull out one of my favorite tools, which is a 2x2 matrix (as an aside, seriously, I’m addicted to those, they’re so helpful for my brain for some reason). Laying them out, you get four categories:&lt;/p&gt;&lt;table class=&quot;2x2 bordered-box&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;flow&quot;&gt;&lt;p&gt;&lt;strong&gt;Fixable and Unworkable&lt;/strong&gt;&lt;/p&gt; &lt;p&gt;&lt;em&gt;Highest priority to address&lt;/em&gt;&lt;/p&gt;&lt;/td&gt;&lt;td class=&quot;flow&quot;&gt;&lt;p&gt;&lt;strong&gt;Fixable and Workable&lt;/strong&gt;&lt;/p&gt; &lt;p&gt;&lt;em&gt;Lowest priority, but quickest wins&lt;/em&gt;&lt;/p&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class=&quot;flow&quot;&gt;&lt;p&gt;&lt;strong&gt;Unfixable and Unworkable&lt;/strong&gt;&lt;/p&gt; &lt;p&gt;&lt;em&gt;Identify and escalate&lt;/em&gt;&lt;/p&gt;&lt;/td&gt;&lt;td class=&quot;flow&quot;&gt;&lt;p&gt;&lt;strong&gt;Unfixable and Workable&lt;/strong&gt;&lt;/p&gt; &lt;p&gt;&lt;em&gt;Label, quantify, and move on&lt;/em&gt;&lt;/p&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;h3 id=&quot;fixable-and-unworkable&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://hazelweakly.me/blog/observations-of-leadership-part-one/#fixable-and-unworkable&quot;&gt;&lt;span&gt;Fixable and unworkable&lt;/span&gt;&lt;/a&gt;&lt;/h3&gt;&lt;p&gt;These were the highest priority things to address: they were actively breaking the team or the organisation, and we could fix them. The hard part here is really about &lt;em&gt;finding&lt;/em&gt; these and appropriately labeling them: a lot of people want to label things they dislike as “unworkable” but doing so is a surefire way to lose trust in leadership.&lt;/p&gt;&lt;h3 id=&quot;fixable-and-workable&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://hazelweakly.me/blog/observations-of-leadership-part-one/#fixable-and-workable&quot;&gt;&lt;span&gt;Fixable and workable&lt;/span&gt;&lt;/a&gt;&lt;/h3&gt;&lt;p&gt;These were workable, so they’re automatically lower priority, &lt;em&gt;but&lt;/em&gt; if they’re fixable, then they’re great things to stick in and sprinkle in with your higher priority stuff. Often because something &lt;em&gt;is&lt;/em&gt; workable, it’s de-prioritised but it can be a source of morale drain or impedance; giving the team permission to work on those things can build a lot of trust with them, and they’re often things that can be completed much quicker too.&lt;/p&gt;&lt;p&gt;While you can run the risk of appearing like you’re only working on “workable” stuff, when done right, it’s incredibly effective in being able to deliver a constant stream of improvements without necessarily meaningfully slowing down the high importance work.&lt;/p&gt;&lt;h3 id=&quot;unfixable-and-unworkable&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://hazelweakly.me/blog/observations-of-leadership-part-one/#unfixable-and-unworkable&quot;&gt;&lt;span&gt;Unfixable and unworkable&lt;/span&gt;&lt;/a&gt;&lt;/h3&gt;&lt;p&gt;This is something to escalate, and this category of problem is the one that keeps me up at night. Not only can we not fix this with the current capabilities that we have, it’s actively breaking something essential that we need to function as an organisation. Identifying these should be your second highest priority after identifying just enough work for the team to have things to do because the consequences of not knowing what these are and being unable to quantify the risk is absolutely massive.&lt;/p&gt;&lt;h3 id=&quot;unfixable-and-workable&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://hazelweakly.me/blog/observations-of-leadership-part-one/#unfixable-and-workable&quot;&gt;&lt;span&gt;Unfixable and workable&lt;/span&gt;&lt;/a&gt;&lt;/h3&gt;&lt;p&gt;Label this and move on; the things that are great to label it with are:&lt;/p&gt;&lt;ul&gt;&lt;li&gt;a name that signifies it’s not tech debt&lt;/li&gt;&lt;li&gt;a sufficiently low priority to signal that you don’t care right now&lt;/li&gt;&lt;li&gt;&lt;em&gt;the conditions required for this to move into a different category&lt;/em&gt;&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;That last bit is important enough that I’m going to repeat it; things that are unfixable and workable are very dangerous, because they can be ignored, but if it flips to any other state, it could turn out quite negatively: either people will see you as being inconsistent with what you choose to work on, or it’ll silently flip into “unworkable” and you won’t notice and that’s going to cause a lot of damage to the business.&lt;/p&gt;&lt;p&gt;Now that I’ve rambled on a bit about the theory of all of this, the situation that we had was one where all four categories each had more work than my team could actually accomplish, so it didn’t matter what we did, because before we could finish any of the highest urgency work, more work would escalate into being in that very high urgency state. As it was, the only thing I could really do was minimise the amount of work in progress, free up as much bandwidth for the team, and buy them as much time as I could while I addressed the unfixable and unworkable issues with leadership directly. The first being headcount, and justifying said headcount in a way that aligned the success of the org with the success of the company; this was a bit difficult because the company was very much in expansion mode, and so &lt;em&gt;everyone&lt;/em&gt; urgently needed headcount. Us getting that headcount meant that we took it from the rest of the org, which is exactly what happened, but the argument for doing so had to essentially become “this will accelerate everyone else more than hiring more headcount for them will” and one of the key pieces of information there was showing that we had become the blocker to essentially all progress in the organisation.&lt;/p&gt;&lt;p&gt;Which, most leaders would likely immediately tell you, stepping into a leadership role and then immediately identifying the vast majority of all blockers for the entire CTO org as being under you as a direct responsibility is… Beyond risky. I’m not saying that you shouldn’t own up to the reality of a situation, to be clear. This type of thing is far more about the implications of what happens after you do something like that, politically; what’s happening is you, in essence, become the responsible “root” cause for every missed goal, target, milestone, etc., in the entire company until you fix the blocking problem. You became the highest priority for headcount and resourcing, but you now have to manage the expectations of the entire organisation who is going to see a three quarter massive whiplash between “taking the blame and promising to fix it” and anything actually improving.&lt;/p&gt;&lt;p&gt;If you want to appear ineffective as a leader, this is a stellar strategy, because it gives you no opportunity to actually prove out your worth before people start judging you based on the actions of things that happened before you came on. However, being an interim leader, I knew I had essentially zero shot at actually becoming the long term director, and so I wasn’t particularly sussed about maximizing the short view in favor of success in the long view; it absolutely tanked me, but it set up my successor for a ton more success than they would’ve had otherwise, and it helped break the cycle of rotating management that had plagued infrastructure for three years. Totes worth, 10/10, would fuck up my reputation again in a heartbeat.&lt;/p&gt;&lt;h2 id=&quot;conclusion&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://hazelweakly.me/blog/observations-of-leadership-part-one/#conclusion&quot;&gt;&lt;span&gt;Conclusion&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;I don’t even know how to conclude this, honestly; writing this has been a lot of fun, and I’m only a third of the way through. I suppose if I had to attempt to summarise a lot of the key themes here when it comes to dealing with uncertainty and ambiguity, there are a few things that emerge for me: understanding the problem, empathy, communication, and that I did better than I thought I did.&lt;/p&gt;&lt;h3 id=&quot;understanding-the-space-of-the-problem&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://hazelweakly.me/blog/observations-of-leadership-part-one/#understanding-the-space-of-the-problem&quot;&gt;&lt;span&gt;Understanding the Space of the Problem&lt;/span&gt;&lt;/a&gt;&lt;/h3&gt;&lt;p&gt;I’ve come to terms with the fact that my brain works in a very unusual way; it’s one of the biggest gifts I have and a core differentiator in my ability to do work. It also does mean, however, that I’m cautious when giving advice to other people. Just because it works for me doesn’t mean it’s going to work for you, and honestly, it’s less likely to work for you if it works for me.&lt;/p&gt;&lt;p&gt;All of that said, something that works for me very well is having a sort of spacial relationship between things; I do symbolic manipulation and mental spatial navigation very very well, and I abuse that fact as much as possible in all of the reasoning that I do. If I can find a mental model that lets me do that, it helps me learn more concepts better, and if I have trouble recognizing how to solve a problem, I try to break it down to something that lets me spatially reason about it or symbolically manipulate it. It doesn’t work for everything, but it works for most everything, and it helps me build a ton of bridges of understanding; it turns out that if you build a symbolic representation of something, since most other people haven’t, it gives you a second modality of information with which to check your understanding. Being able to check understanding with others in multiple modalities or multiple analogies is a form of cross referencing that I find incredibly useful.&lt;/p&gt;&lt;p&gt;It also does something very very useful for me: It helps me get a navigational structure in place. So much stuff out there is only ever explained in a way that’s not actionable or constructable; if you can’t construct a solution out of the description of the problem, you either don’t understand the problem, or nobody else knows how to get the solution either. Building that path to a solution is going to get action and alignment actually happening, but it can only start once you have the space of a problem sort of laid out.&lt;/p&gt;&lt;h3 id=&quot;empathy-goes-an-incredibly-long-way&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://hazelweakly.me/blog/observations-of-leadership-part-one/#empathy-goes-an-incredibly-long-way&quot;&gt;&lt;span&gt;Empathy Goes an Incredibly Long Way&lt;/span&gt;&lt;/a&gt;&lt;/h3&gt;&lt;p&gt;This is something I’ve talked about a lot on my blog before. I love empathy, and it’s my secret superpower to getting things done at an organisational level (although the fact that it’s a “secret superpower” rather than a basic tool of communication is… Anyway). However, there is something that was very difficult for me to learn and it was a bit of a bombshell for me when it really started to finally click. It turns out that empathy and effectiveness, at the leadership level, is pretty close to the same thing.&lt;/p&gt;&lt;p&gt;There are four parts of empathy: tuning into your feelings, expressing your feelings, tuning into their feelings, and responding to their feelings with understanding. Doing that well requires that you’re able to connect with something inside of you that has been where they’ve been. The need for us to have &lt;em&gt;something&lt;/em&gt; in common with that situation is why we have such a hard time empathizing with people who are on sufficiently different walks of life than us. At the individual level, we’re talking about individual experiences and individual emotions, but at the organisational level, we’re talking about organisational experiences and organisational emotions.&lt;/p&gt;&lt;p&gt;It should go without saying that organisations express emotions extremely differently than individuals, but don’t worry, organisations absolutely do have emotions and do express them, we just call it values and we express those values through culture. But, how does one connect with the values in a company in a way that they map their individual emotions to the values of a company? How does that happen in a way that you can map your expression in a way that actually results in the company itself being able to feel that empathy? You understand the values, you participate in the culture, and the organisation “feels” that through your participation in it… Which is basically your effectiveness as a leader, is it not?&lt;/p&gt;&lt;p&gt;Of course, to any one individual, it’s going to look completely indecipherable; you’re either going to come off as cold and disjointed, or completely unhinged; to the organisation, you’re going to either be entirely invisible or insignificant or unaligned with the needs of the company. You really can’t win, and the more you manage to do well at balancing the needs and perception differences of the individual vs the group vs the organisation, the more you’re going to build up a skill-set that looks suspiciously like narcissism and socipathic behaviour. It’s to the point that any individual leader that tells you they can reliably tell the difference between effectiveness due to empathy and effectiveness due to behaving like a sociopath is lying; if they can tell, it’s because of some other reason (but don’t worry, there are almost always tells elsewhere).&lt;/p&gt;&lt;p&gt;Anyway, it was a weird trip for me to realise that empathy goes hand in hand with behaviour that is, externally, occasionally unsettling; no wonder leadership is often described as being extraordinarily lonely. How do you even begin to get good at a skill-set like that? Especially when getting good at that skill-set will make most of the people you love in your life less able to relate to you? The cognitive dissonance required to be an effective leader in a capitalistic system is &lt;em&gt;wild&lt;/em&gt; and unbelievably damaging to most.&lt;/p&gt;&lt;h3 id=&quot;communication-is-the-whole-job&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://hazelweakly.me/blog/observations-of-leadership-part-one/#communication-is-the-whole-job&quot;&gt;&lt;span&gt;Communication is the Whole Job&lt;/span&gt;&lt;/a&gt;&lt;/h3&gt;&lt;p&gt;Well… Communication is the whole job, except for the parts where it’s not. Communication isn’t the same as execution, which isn’t the same as strategy, or mission, or values, or objectives, or any of the other myriad of things that we use to build up an organisation of effective people and point them in the same general direction and have them build something great together. However, it really is sort of the same thing at the same time?&lt;/p&gt;&lt;p&gt;Just because communication isn’t strategy doesn’t mean that strategy isn’t communication, because it absolutely is, and it communicates a great deal of information when you learn how to read into it and interpret it and use that second layer of information to communicate more things when building out your &lt;em&gt;own&lt;/em&gt; strategy to compliment someone else’s strategy. That goes for all of the others as well; they’re all a stream of communication and communicate information in their own way, and you have to learn how to utilise them as the “thing” as well as the tool of communication that they are, while not forgetting that communication itself is also directly required, especially for people who haven’t learned how to read into all of the other streams of communication that are embedded into all of the other things out there.&lt;/p&gt;&lt;p&gt;In particular, something that isn’t built into the rest of the things we communicate about are expectations. They’re kinda there? Sorta? Kinda sorta, but not really. Anyone who says that strategy or objectives accurately help set expectations is absolutely full of it; even if you explicitly call out expectations when writing those two things, they are absolutely not going to be the expectations that anyone actually has or sets. Likewise, anyone who has expectations and communicates them out, but doesn’t actually participate in making sure that those expectations were understood appropriately as well as communicated out to everyone else who needs then is not going to be an effective leader.&lt;/p&gt;&lt;p&gt;This was probably something that I was the weakest at. I made a lot of mistakes in finding that balance between communicating, over communicating, setting the right expectations, not managing them, not updating them soon enough, and so on. I also made mistakes when it came to communicating appropriate expectations and sentiment with how other people were being perceived at the job, and that ended up causing pain in a lot of places that didn’t need to be there.&lt;/p&gt;&lt;p&gt;Communication is fucking hard, and it’s one of the most painful things to mess up, even though it sounds so non-damaging because of how intangible it is. That said, if you are willing to be humble and learn from your mistakes, leveling up your communication skills in every aspect is going to be one of the quickest and highest leverage things you can do to accelerate your own growth and effectiveness as a leader.&lt;/p&gt;&lt;h3 id=&quot;i-did-okay-really&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://hazelweakly.me/blog/observations-of-leadership-part-one/#i-did-okay-really&quot;&gt;&lt;span&gt;I Did Okay, Really&lt;/span&gt;&lt;/a&gt;&lt;/h3&gt;&lt;p&gt;I had a very unusual situation and I made the best of it; while I wasn’t as effective as I could’ve been, I learned a tremendous amount, and was able to set up my director for success and play a part in getting us to where we are today. In the end, I’m really proud of what I was able to accomplish and I’m deeply looking forward to being able to help continue to make this a wonderful place to work.&lt;/p&gt;&lt;p&gt;I feel very fortunate to work somewhere that I can actually look forward to the positive changes that this might actually make to society, and I get to heal trauma and celebrate queerness and grow a diverse workforce? I helped build the culture in the platform organisation where all of that is possible? It legitimately makes me tear up thinking about that sometimes.&lt;/p&gt;&lt;p&gt;Fuck yeah, I did okay.&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>The Power of Being New: A Proven Recipe for High Impact</title>
    <link href="https://hazelweakly.me/blog/the-power-of-being-new--a-proven-recipe-for-high-impact/" />
    <updated>2023-07-17T00:00:00Z</updated>
    <id>https://hazelweakly.me/blog/the-power-of-being-new--a-proven-recipe-for-high-impact/</id>
    <content type="html">&lt;p&gt;When starting a new job as a software engineer, it’s natural to feel the pressure of delivering immediate value and meeting the expectations of your role. However, there’s a unique opportunity during this initial period that often goes unnoticed: nobody &lt;em&gt;expects&lt;/em&gt; you to actually do useful work right away. So not only can you can feel free to identify and solve problems that others might have grown accustomed to or overlooked, you’ll have a fresh set of eyes that have not yet grown accustomed to the pains of the job.&lt;/p&gt;&lt;p&gt;While you’ll lack in-depth knowledge of the existing systems or workflows, this is actually a good thing here! You’re going to run into every single problem possible during the onboarding process, like a cartoon character running straight into a rake over and over. Embrace the pain, it builds character (well, not really, but it provides really good opportunities).&lt;/p&gt;&lt;p&gt;I’ve been able to take this approach and do some pretty cool things with it in my career.&lt;/p&gt;&lt;p&gt;At a previous company I onboarded, improved onboarding documentation, synthesised cross-org inefficiencies, wrote a technical doc on developer productivity and how it fit into the company, got buy-in, implemented it, shepherded it, and onboarded the entire org, all in one month–the month that I joined.&lt;/p&gt;&lt;p&gt;Some things that made this possible:&lt;/p&gt;&lt;ul&gt;&lt;li&gt;Before joining, I knew they were interested in this so I was already looking for it.&lt;/li&gt;&lt;li&gt;I have years of experience in internalizing other people’s workflows and improving them without wrecking them. I am &lt;em&gt;very&lt;/em&gt; good at it.&lt;/li&gt;&lt;li&gt;Most importantly? It didn’t require a lot of deep context, and I knew how to implement changes in a way that was opt-in without breaking individual workflows.&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;The other time I delivered strong results immediately after being hired was when I came into a company, onboarded myself, broke major communication silos, internalised a very poorly communicated product, repaired trust between multiple teams, and broke a 3 month roadblock. In my first 2 weeks. I became the tech lead of the infrastructure team in those first two weeks as well. I had to reshape some mental models, coach and mentor some people, and start improving some practices while planning for 2 weeks, 3 months, and 6 months down the road concurrently.&lt;/p&gt;&lt;p&gt;Again, a lot of that didn’t require deep context in order to &lt;em&gt;start&lt;/em&gt; the changing process and know where to go; the important part was to utilise my empathy, listen to others, understand their viewpoints, and solve their problems. That’s the magic, by the way.&lt;/p&gt;&lt;p&gt;I’m going to break this down into a series of steps. Here’s the formula:&lt;/p&gt;&lt;h2 id=&quot;step-1:-take-notes-and-absorb&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://hazelweakly.me/blog/the-power-of-being-new--a-proven-recipe-for-high-impact/#step-1:-take-notes-and-absorb&quot;&gt;&lt;span&gt;Step 1: Take Notes and Absorb&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;During the onboarding process, make it a habit to take detailed notes. If something is weird, take notes. If something is confusing, write it down. If something goes wrong, make a note! Not only will this help you understand the organisation’s systems and processes, but it will also allow you to identify potential areas for improvement. How an organisation addresses its shortcomings is often more valuable than how well it gets things right.&lt;/p&gt;&lt;p&gt;Ask “why” a &lt;em&gt;lot&lt;/em&gt;, and ask people for their opinions as well. Then write those down, all of them. Very rarely do people document they whys, and even when they do, they’re very rarely consistent among different viewpoints. Those tidbits of information can be crucial in helping you later.&lt;/p&gt;&lt;p&gt;Take notes about the people too. Your notes on first impressions will be extremely important for identifying biases. If someone says something about another person, write that down too; reverse engineering how people think about each other can bring up a lot of interesting points and subtleties. For example, if someone says something is bad and terrible, is it &lt;em&gt;really&lt;/em&gt; bad and terrible, or did they have a really negative experience with another coworker and now they’re biased? That’s totally possible! And we want to hold space for that, because it’s completely valid to have had negative experiences and have your current ones be filtered through that; but, as the new person, you’ll want to be aware of this so that you don’t perpetuate biases or grudges. It’ll open you up to being able to be a healer in a space as well, should you need to be.&lt;/p&gt;&lt;p&gt;Take notes and absorb. Talk less, write more.&lt;/p&gt;&lt;h2 id=&quot;step-2:-identify-opportunities-for-improvement&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://hazelweakly.me/blog/the-power-of-being-new--a-proven-recipe-for-high-impact/#step-2:-identify-opportunities-for-improvement&quot;&gt;&lt;span&gt;Step 2: Identify Opportunities for Improvement&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;(that don’t involve “Big Changes”)&lt;/p&gt;&lt;p&gt;Questions to ask yourself during the onboarding process:&lt;/p&gt;&lt;ul&gt;&lt;li&gt;Are there repetitive tasks that could be automated?&lt;/li&gt;&lt;li&gt;Are there manual processes that could benefit from streamlining?&lt;/li&gt;&lt;li&gt;Are there missing steps in the documentation?&lt;/li&gt;&lt;li&gt;Are there people that have to coordinate efforts when those efforts could be centralised?&lt;/li&gt;&lt;li&gt;Is information scattered everywhere, out of date, wrong, or all of the above?&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;Here’s the important magic that makes this work:&lt;/p&gt;&lt;ol&gt;&lt;li&gt;None of these changes touch existing code&lt;/li&gt;&lt;li&gt;None of these changes affect an existing developer’s workflow&lt;/li&gt;&lt;/ol&gt;&lt;p&gt;You don’t have the context and understanding required yet to change someone’s workflow and not piss them off, so… Don’t do that. Easy, right? Knowing &lt;em&gt;what&lt;/em&gt; you can change is half the battle; naturally, reading everything I write cause it’s awesome and hilarious is the other half of the battle. The battle ain’t the war though, so, y’know; measure twice, break prod &lt;em&gt;(only)&lt;/em&gt; once, and all that.&lt;/p&gt;&lt;h2 id=&quot;step-3:-ask-people-what-sucked&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://hazelweakly.me/blog/the-power-of-being-new--a-proven-recipe-for-high-impact/#step-3:-ask-people-what-sucked&quot;&gt;&lt;span&gt;Step 3: Ask People What Sucked&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;Okay, here’s the deal. You’re going to start asking people questions, but this is very easy to fuck up, and if you fuck it up, you’re going to set a bad impression that will be very difficult to undo later.&lt;/p&gt;&lt;p&gt;Luckily, there’s a simple process for success here. Here’s the rule:&lt;/p&gt;&lt;ol&gt;&lt;li&gt;Ask them about what hurts&lt;/li&gt;&lt;li&gt;Listen. Listen harder. Keep listening. Listen until your ears bleed. &lt;strong&gt;&lt;em&gt;SHUT THE FUCK UP&lt;/em&gt;&lt;/strong&gt;. Listen.&lt;/li&gt;&lt;li&gt;Take notes. So many notes. Get things in their wording, then repeat it back rephrased to make sure you understand what they’re saying; ask them to validate that.&lt;/li&gt;&lt;li&gt;Use every active listening strategy you know. This will be very draining, and that’s fine. You’re here to listen and that takes real emotional energy.&lt;/li&gt;&lt;li&gt;VALIDATE THEIR FEELINGS. EMPATHIZE WITH THEM. DO NOT FIX IT.&lt;/li&gt;&lt;/ol&gt;&lt;p&gt;I know you want to. Stop it. Bad. No fixy.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;&lt;em&gt;DO.&lt;/em&gt;&lt;/strong&gt; &lt;strong&gt;&lt;em&gt;NOT.&lt;/em&gt;&lt;/strong&gt; &lt;strong&gt;&lt;em&gt;FIX.&lt;/em&gt;&lt;/strong&gt; &lt;strong&gt;&lt;em&gt;THE.&lt;/em&gt;&lt;/strong&gt; &lt;strong&gt;&lt;em&gt;PROBLEM.&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;&lt;p&gt;This step is about understanding your coworkers and how they observe their work environment, how they think, and what causes them pain. Soon, you’ll be able to think about maybe fixing it, but right now is the time for connecting with them as humans, holding space for their frustrations, and letting them be heard.&lt;/p&gt;&lt;p&gt;You will be absolutely shocked and heartbroken when you find out how many of these engineers will feel heard for the &lt;em&gt;first time ever&lt;/em&gt; in their entire tenure at this company. No matter how perfect and amazing the company culture is, I guarantee this will be true.&lt;/p&gt;&lt;p&gt;Just… Be there for them, okay? You can fix the problems later; but right now, they need to know you can hear them as they are and understand what they have to say.&lt;/p&gt;&lt;h2 id=&quot;step-4:-fix-local-development-environments&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://hazelweakly.me/blog/the-power-of-being-new--a-proven-recipe-for-high-impact/#step-4:-fix-local-development-environments&quot;&gt;&lt;span&gt;Step 4: Fix Local Development Environments&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;Here’s where the fixing happens, and here’s how to do it:&lt;/p&gt;&lt;ol&gt;&lt;li&gt;Find every channel you can in slack related to your team, and the teams immediately interacting with your team&lt;/li&gt;&lt;li&gt;Join all of them&lt;/li&gt;&lt;li&gt;Scroll through the last 2-6 months of scrollback in &lt;em&gt;every channel&lt;/em&gt;&lt;/li&gt;&lt;li&gt;Write down or note every single problem people talked about relating to CI, the test suite, local developer environments, “hey this is broken locally but works in CI”, etc.&lt;/li&gt;&lt;li&gt;???&lt;/li&gt;&lt;li&gt;Win a Nobel prize for fixing the world’s most complicated problem&lt;/li&gt;&lt;/ol&gt;&lt;p&gt;One often overlooked area where you can make an immediate impact is in local development environments. This is because people often know exactly what’s wrong, and usually they even know pretty much exactly how to fix it. So why does nobody do it, despite it having obvious immediate impact and even calculable efficiency payoffs?&lt;/p&gt;&lt;p&gt;&lt;/p&gt;&lt;figure class=&quot;flow bordered-box has-caption&quot;&gt;&lt;picture&gt;&lt;source srcset=&quot;https://hazelweakly.me/images/xkQuz2obgz-300.avif 300w, https://hazelweakly.me/images/xkQuz2obgz-600.avif 600w, https://hazelweakly.me/images/xkQuz2obgz-1000.avif 1000w&quot; sizes=&quot;100vw&quot; type=&quot;image/avif&quot;&gt;&lt;source srcset=&quot;https://hazelweakly.me/images/xkQuz2obgz-300.webp 300w, https://hazelweakly.me/images/xkQuz2obgz-600.webp 600w, https://hazelweakly.me/images/xkQuz2obgz-1000.webp 1000w&quot; sizes=&quot;100vw&quot; type=&quot;image/webp&quot;&gt;&lt;img alt=&quot;The dril candle tweet modified to say Fixing CI $200, Make Test Suite Work $150, Unfuck the Build System $800, Waste Time Working Around Shit Tools $3,600, Tweak Vim Configs $150, someone who is good at the thot leadership please help me budget this. my dev team is dying&quot; srcset=&quot;https://hazelweakly.me/images/xkQuz2obgz-300.jpeg 300w, https://hazelweakly.me/images/xkQuz2obgz-600.jpeg 600w, https://hazelweakly.me/images/xkQuz2obgz-1000.jpeg 1000w&quot; title=&quot;Humans are really really bad at doing certain types of math, that&#39;s why.&quot; decoding=&quot;async&quot; height=&quot;653&quot; sizes=&quot;100vw&quot; src=&quot;https://hazelweakly.me/images/xkQuz2obgz-300.jpeg&quot; width=&quot;1000&quot;&gt;&lt;/picture&gt;&lt;figcaption&gt;Humans are really really bad at doing certain types of math, that&#39;s why.&lt;/figcaption&gt;&lt;/figure&gt;&lt;p&gt;&lt;/p&gt;&lt;p&gt;There are two types of developer environments: those that are sorta broken, and those that are super broken for the Junior Engineers but never get fixed because the Senior Engineers have workarounds that let them be productive. You can get a feel for which situation your company is in pretty quickly just from looking at slack; it’s a fun superpower once you get good at it.&lt;/p&gt;&lt;p&gt;I do this with every company I join and people are flabbergasted at how quickly I understand what’s weird about a company and how to navigate the quirks, or how I know who to go to for what. It’s like showing up at a bar with your friend and their friends and you somehow know about &lt;em&gt;all&lt;/em&gt; of the relationship drama, all of the weird nonsense, all the inside jokes, and everything else. It’s great; I highly recommend.&lt;/p&gt;&lt;p&gt;Once you’ve identified the 3-5 most annoying or repetitive problems: ask in chat casually “hey has anyone noticed this as a problem?” You really want to avoid jumping straight into “hey imma solve this.” Don’t do that, nobody wants that; even if you’re right, people will legitimately be offended if someone comes in and fixes their shit without asking. It’s like if you invite a friend over to be a roommate and the first they do is organise your sock drawer; like, okay, maybe it needed a little bit of organizing, but are you for real? Meanwhile the attic is molding, the sink is flooding, the laundry machine is haunted, and the basement has a cryptid in it.&lt;/p&gt;&lt;p&gt;Anyways. Socks are not the point here. The point is that IF AND ONLY IF everyone chimes in with haven’t you people ever heard of rm -rf node_modules, bro, it’s much better do than try and fix all of these constant ills and agonies OHHHHH.&lt;/p&gt;&lt;p&gt;Wait, where was I? Anyways, if everyone chimes in with “yeah that sucks,” &lt;em&gt;then&lt;/em&gt; offer to come up with a solution for it, and if people like the solution, offer to fix it.&lt;/p&gt;&lt;p&gt;You’ll be the hero, angels will weep, the heavens will open, rainbows will glisten, fairies will frolick, etc etc. Here’s the key part though.&lt;br&gt; And I mean it.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;&lt;em&gt;ONLY FIX IT IF PEOPLE SAY IT SUCKS.&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;&lt;p&gt;Remember, you’re new here. You still have zero prestige among the team, and zero trust; you need to meet them where they’re at and address things &lt;em&gt;they&lt;/em&gt; care about. Now, once you deliver this, that’ll change. People will think you’re amazing, cause you are; they’ll think you’re brilliant too, cause you followed my advice, and I’m a smart cookie. It’s a win win, really.&lt;/p&gt;&lt;h2 id=&quot;the-takeaway:-empathy-empathy-empathy-empathy-empathy&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://hazelweakly.me/blog/the-power-of-being-new--a-proven-recipe-for-high-impact/#the-takeaway:-empathy-empathy-empathy-empathy-empathy&quot;&gt;&lt;span&gt;The Takeaway: Empathy, Empathy, Empathy, Empathy, Empathy&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;Embrace your role as the new person and leverage that. Take notes. So many fuckin notes. Seriously. I usually take about 15 pages of notes in my first month of working somewhere and it has never not paid off. Do some of the shit work that nobody wants to do, and nobody can ever prioritise, but you can! Because you’re currently suffering from it! Awesomesauce! Radical. &lt;em class=&quot;text-1&quot;&gt;poggers.&lt;/em&gt;&lt;/p&gt;&lt;p&gt;Remember, being new is not a disadvantage–it’s an opportunity to make a difference by being vulnerable and open-minded. It’s also important to remember that the difference you can make now is in the relationships you form, the people you listen to, and the things you can do for others. Embrace it, and be there for them. If you can do that, they’ll remember you forever.&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>So You Want to Hire for Developer Tooling</title>
    <link href="https://hazelweakly.me/blog/so-you-want-to-hire-for-developer-tooling/" />
    <updated>2023-07-14T00:00:00Z</updated>
    <id>https://hazelweakly.me/blog/so-you-want-to-hire-for-developer-tooling/</id>
    <content type="html">&lt;p&gt;I see you want to hire a developer to work on internal developer tooling, developer experience, and the generally intangible but admirable goal of “making life better for devs”. That’s awesome; you’ve got one hell of a challenge ahead of you. This role is extremely difficult to hire for. In my opinion, and in my experience, it’s been the most difficult role in the company outside of senior leadership, and the most likely to fail; if there ever was a role that burns people out, it’s this one. Tread carefully, and good luck. You’ll need it.&lt;/p&gt;&lt;p&gt;You probably have some questions, such as:&lt;/p&gt;&lt;ul&gt;&lt;li&gt;What do they even do? (If you’re really confident you can already answer this question, I urge you to throw that confidence away and light it on fire. It will not help you here.)&lt;/li&gt;&lt;li&gt;How do I interview this person?&lt;/li&gt;&lt;li&gt;What should I look for?&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;I’m going to go over all of these, but first I want to provide some background into why I’m talking about this.&lt;/p&gt;&lt;p&gt;I have been this person multiple times in various companies. It has been a mixed bag, to put it mildly.&lt;/p&gt;&lt;ul&gt;&lt;li&gt;In one company, I was hired as “the first devops person”. They fundamentally misunderstood what they needed and were institutionally incapable of handling or addressing cross cutting concerns. Once I realised that what they wanted was purely hourly labour of cheaper toil, I built them a “what they needed but not what they asked for” platform by scraping together enough time from various teams and then left once it was operational.&lt;/li&gt;&lt;li&gt;In another company, I was hired as a Staff Security-oriented SRE but they actually needed tooling expertise more so I built that for them. It went well, but they didn’t go out of their way to actually hire for that.&lt;/li&gt;&lt;li&gt;I have been hired for a role (stability / infrastructure / resilience) and had people hire me with the generic “backend software engineer” interview loops. The loop itself went alright, but that was more me being abnormally good at both backend &lt;em&gt;and&lt;/em&gt; this rather than than any indicator of their skill in placing me. That company underleveled me significantly and I left shortly after when it became obvious that they were incapable of seeing the value that the role was intended to capture.&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;There’s a trend here, really, and I think it’s a common one. If a company hires “smart people who do things,” they seem to be very prone to fucking this up. I’m not sure &lt;em&gt;why&lt;/em&gt; this correlation seems to exist (although I have my suspicions), but I have noticed it repeatedly.&lt;/p&gt;&lt;p&gt;To wit, I would personally not see this role as a dev tools role; I would also not see it as operationally oriented. What you’re looking for, I think, is someone who can take “developer experience” and push that forward holistically by whatever way is necessary. The hardest things they will have to do is gain the trust of the entire engineering organisation, buy-in for their approach, and deliver perceived value and improvements.&lt;/p&gt;&lt;p&gt;I’m reminded of the concept that there are several inflection points at which organisations change and their needs evolve; importantly, the nature of how work becomes visible and how coordination happens fundamentally shifts. Anecdotally, I’ve found these numbers to be true–you may recognise them as being related to Dunbar’s number(s): 5, 15, 50, 150, 500, 1500.&lt;/p&gt;&lt;p&gt;Here’s how I personally apply them to the general bucket of “not product engineering”, which includes but isn’t limited to: infrastructure, operations, and developer experience.&lt;/p&gt;&lt;ul&gt;&lt;li&gt;5 Engineers: The number of engineers you can have and work without docs. The true bliss of “yolo driven development.”&lt;/li&gt;&lt;li&gt;15 Engineers: You now need documentation, but still don’t need “real” infrastructure (or pretty much anything else).&lt;/li&gt;&lt;li&gt;50 Engineers: &lt;ul&gt;&lt;li&gt;The threshold by which it makes sense to have one person specialised on infrastructure, ops stuff, developer environments, CI/CD, etc.&lt;/li&gt;&lt;li&gt;Start building what will become the internal platform; but don’t build the platform yet, it’s still too early.&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;150 Engineers: &lt;ul&gt;&lt;li&gt;The threshold where it makes sense to transition from people-driven coordination to process-driven coordination.&lt;/li&gt;&lt;li&gt;You should have something that &lt;em&gt;resembles&lt;/em&gt; an internal platform, but it’s not a full platform yet.&lt;/li&gt;&lt;li&gt;If you don’t have anyone who truly understands Progressive Delivery and quality assurance, you need one.&lt;/li&gt;&lt;li&gt;Knowledge management as an institutional capability is no longer optional and is likely sorely overdue.&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;500 Engineers: &lt;ul&gt;&lt;li&gt;The threshold by which developer environments, cost optimization, infrastructure, security, all pay for themselves as fully separate and independent teams of expertise &lt;em&gt;in addition to&lt;/em&gt; people closer to the teams who work to improve these functions.&lt;/li&gt;&lt;li&gt;You should have an internal platform that is fully fleshed out.&lt;/li&gt;&lt;li&gt;Enabling experimentation, progressive delivery, and effective testing as an expertise is no longer optional.&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;1500 Engineers &lt;ul&gt;&lt;li&gt;Developer experience, infrastructure, cost visibility, security, etc., should be embedded into the culture, exist as teams within organisations, and also as a separate organisation.&lt;/li&gt;&lt;li&gt;The idea that you can have basically any engineering function without hiring industry experts in that function should seem both insulting and laughable; even if you hire them as consultants, you should understand deeply that success means leveraging others and you now have the funding to fully do so.&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;Figure out where in here you are and how much catching up you have to do. This role probably doesn’t make sense until you’re at 50 engineers, but it’s not a bad idea to start thinking about it at 15.&lt;/p&gt;&lt;h2 id=&quot;how-to-fuck-up-before-you-even-start-hiring&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://hazelweakly.me/blog/so-you-want-to-hire-for-developer-tooling/#how-to-fuck-up-before-you-even-start-hiring&quot;&gt;&lt;span&gt;How To Fuck Up Before You Even Start Hiring&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;&lt;ol&gt;&lt;li&gt;Not having an answer for “How does this role demonstrate value.”&lt;/li&gt;&lt;li&gt;Not having significant buy-in across the entire CTO org for recognizing the need of this role and the benefits it will deliver.&lt;/li&gt;&lt;li&gt;“We just need someone to implement X, Y, and Z and migrate us from a few tools.” No.&lt;/li&gt;&lt;li&gt;Literally any desire for this role involving the word “kubernetes”. It’s a fantastic tool; that is not this role.&lt;/li&gt;&lt;li&gt;Not having a good picture for how consensus happens, a good process around moving from decision to action to execution, or a willingness to implement one top-down from senior leadership.&lt;/li&gt;&lt;li&gt;Doing things way before you’re ready: &lt;ul&gt;&lt;li&gt;For example, self service catalogues are great. Implementing Spotify’s Backstage before you’re at 500-1,500 engineers is a mistake.&lt;/li&gt;&lt;li&gt;“Let’s have the dev tools person implement observability” is going to end badly.&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;Culture, then process, then tooling, then process, then culture. &lt;ul&gt;&lt;li&gt;“It could’ve been an email” applies to overengineering your CI pipelines just as much as it does to useless meetings.&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;Not having a way to get visibility into your actual needs &lt;ul&gt;&lt;li&gt;If I were to be in this role again, I would be fine doing interviewing tours amongst all the EMs and tech leads every month. However, companies like &lt;a href=&quot;https://getdx.com&quot;&gt;getdx&lt;/a&gt; exist now to automate the vast majority of that toil; use them to set this role up for success.&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;/ol&gt;&lt;h2 id=&quot;what-should-developer-tooling-people-work-on&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://hazelweakly.me/blog/so-you-want-to-hire-for-developer-tooling/#what-should-developer-tooling-people-work-on&quot;&gt;&lt;span&gt;What Should Developer Tooling People Work On&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;In reality, this list should be informed by actual answers from engineers, where the “dev tools person” interviews everyone and figures out:&lt;/p&gt;&lt;ol&gt;&lt;li&gt;immediate pain points&lt;/li&gt;&lt;li&gt;medium term plans&lt;/li&gt;&lt;li&gt;long term goals&lt;/li&gt;&lt;li&gt;shared frustrations&lt;/li&gt;&lt;li&gt;things teams aren’t aware of but cause friction&lt;/li&gt;&lt;/ol&gt;&lt;p&gt;That list should then be categorised, prioritised, and an appropriate allocation of time should be spent on it. In my experience, it has always been that immediate pain points needs 80%+ time allocation for the first quarter, because nobody ever hires for this role before it’s too late. Eventually, a 30/30/30 split of immediate pain, medium term plans, and knowledge sharing is a great place to be. You’ll notice I didn’t allocate any time to items 3, 4 and 5; that was intentional.&lt;/p&gt;&lt;p&gt;Being the only hire in this role means they won’t get to work on the long term goals because there’s absolutely no way to make meaningful progress on them quickly enough for it to matter. Long term goals should be turned into medium term goals, and frustration and friction points are things where leading without authority starts to come into play; progress there is made by sharing knowledge, writing process, showing, demonstrating, and teaching, not by plowing ahead on massively scoped projects. When leadership without authority happens successfully, along with delivery of value in the short and medium term, the ROI for more people doing this will become apparent, and demand for headcount will organically happen.&lt;/p&gt;&lt;p&gt;As a leader, you’ll know this role is being executed successfully when cross-team and cross-functional collaboration starts to happen more; another strong indicator will be when other managers and leaders start to ask for more headcount in the developer tooling and infrastructure functions.&lt;/p&gt;&lt;p&gt;All of that said, the below list of projects is something that is pretty much guaranteed to be positive ROI, I haven’t gone wrong from picking something off of this list and rolling with it if I didn’t have a more compelling first option:&lt;/p&gt;&lt;ul&gt;&lt;li&gt;Fully automated developer onboarding and local developer environments&lt;/li&gt;&lt;li&gt;Comprehensive documentation strategies, testing strategies&lt;/li&gt;&lt;li&gt;Building out Progressive Delivery as a capability - ability to rollback deploys, deploy feature flags, and drive feature flag driven development&lt;/li&gt;&lt;li&gt;Build system performance improvements and reliability improvements&lt;/li&gt;&lt;li&gt;Roll out a comprehensive philosophy and approach to observability, including (but not limited to): cost consciousness, performance, distributed tracing in production and CI&lt;/li&gt;&lt;li&gt;Finding one cross-functional collaboration point, automate aspects of it, and reduce friction there &lt;ul&gt;&lt;li&gt;Nothing says “I know how to improve things where it actually hurts” like bringing more visibility into tickets and making it easier to open and close them&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;Find a new project a team is about to do, sit in on planning, and take notes. Look for opportunities to notice when multiple teams are trying to solve the same problem, and bridge that communication gap.&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;Crucially here, the takeaway is that I would expect this person to succeed if, and likely only if, there is some visibility into showing what the actual needs of the company are, and they have the ability to globally prioritise needs as well as locally drive improvements.&lt;/p&gt;&lt;h2 id=&quot;how-do-i-screen-for-this-role&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://hazelweakly.me/blog/so-you-want-to-hire-for-developer-tooling/#how-do-i-screen-for-this-role&quot;&gt;&lt;span&gt;How Do I Screen For This Role&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;Here’s the gist: This role requires leading without authority. It is not about programming. It is not about technical skills. It is not about architecture.&lt;/p&gt;&lt;p&gt;If you screen for those, you will probably fail to hire someone who will succeed in this role. If you utilise in a whiteboard algorithms interview, you will &lt;em&gt;actively screen out&lt;/em&gt; everyone who is qualified to do this role; they will be capable of doing the interview just fine, they will just tell you to fuck off. They will be right.&lt;/p&gt;&lt;blockquote class=&quot;flow&quot;&gt;&lt;p&gt;If you do not reach the offer stage with at least 50% of the pipeline being women and at least 40% of the pipeline being other underrepresented minorities, you fucked up.&lt;/p&gt;&lt;/blockquote&gt;&lt;p&gt;Frankly, who the fuck do you think is &lt;em&gt;most qualified&lt;/em&gt; to lead without authority and work within systems to drive change than those who have been systematically oppressed, denied leadership roles and opportunities, and have had to succeed despite that? If you are screeening out the experts in sociotechnical systems, you are doing it wrong; put this article down and fix your pipeline.&lt;/p&gt;&lt;p&gt;If you want to hire someone who knows how to pull off a developer experience transformation and building all of that out, the things that would highlight that strength are:&lt;/p&gt;&lt;ul&gt;&lt;li&gt;Skip the coding interview. You’ll probably need &lt;em&gt;some&lt;/em&gt; technical aptitude, but this is best measured with a coding review, or even better yet, an architecture review.&lt;/li&gt;&lt;li&gt;Lean in on questions that ask them how they drive organisational change. You’re asking for someone to be an expert in leading without authority and doing so is incredibly challenging even with leadership buy-in. If this doesn’t go well, it will probably be the reason they quit, and hiring their backfill will be 10x harder than it should be.&lt;/li&gt;&lt;li&gt;My favorite architecture/technical question here is asking people to walk through how they build a paperclip maximiser. I personally call it an addition function. Here’s the question: “let’s say I want to add two numbers and return a result, how do I scale that taking into account people, coordinating teams, software architecture, and infrastructure?” You’re going to be looking for people who can walk the evolution of a company and point out how the nature of coordination, tooling requirements, architecture needs, etc., fundamentally change &lt;em&gt;both as the software scales as well as the organisation&lt;/em&gt;.&lt;/li&gt;&lt;li&gt;People don’t do well in this role if they don’t recognise the sociotechnical nature of the work; they will also not do well in this role if &lt;em&gt;you&lt;/em&gt; don’t recognise the sociotechnical nature of the work. Empowering the social humanity with technology and humanizing the technical systems is key to this role and most people don’t seem to understand how to do that. Look for indicators of this thinking throughout answers.&lt;/li&gt;&lt;li&gt;Ask about times they have done something intentionally that is not a best practice. Example: One of my favorite stories to tell is when I turned off all of the on-call for the entire company. Leadership refused to prioritise stability, the alerts were not actionable, and fatigue was burning teams out; so I turned it off rather than fight against leadership priorities. That’s the kind of thinking that will be required to succeed here; working with the dysfunctions of an organisation to improve the health of the engineers is really the value here, not migrating a CI system.&lt;/li&gt;&lt;li&gt;Look for the types of questions &lt;em&gt;they&lt;/em&gt; ask when interviewing &lt;em&gt;you&lt;/em&gt;. High quality questions speak volumes. Some great questions would be: &lt;ul&gt;&lt;li&gt;how does the company think about value, ROI, and what incentivises work&lt;/li&gt;&lt;li&gt;what does high impact mean at the company&lt;/li&gt;&lt;li&gt;how does leading without authority look like at the company&lt;/li&gt;&lt;li&gt;what does success in this role look like. When I ask this, I always poke and prod at the answer; I want to know why it looks like that and not like a different way. Look for someone who can ask this question and follow it up with the “why not X instead” so that they understand the outcome behind success rather than the simple outputs&lt;/li&gt;&lt;li&gt;what are the pain points people currently have, and how would one measure addressing those?&lt;/li&gt;&lt;li&gt;how does the company build consensus, how do decisions get made, and how do decisions turn into action&lt;/li&gt;&lt;li&gt;what are the dysfunctions of the company and quirks of its communication gaps? (I have never had a company answer this effectively or accurately, but discovering the delta behind the honest effort to answer and reality is very illuminating during my first 90 days)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;If you &lt;em&gt;don’t&lt;/em&gt; have good answers to those questions, by the way, this role will not be successful, and you have more fundamental problems in engineering to address first.&lt;/p&gt;&lt;p&gt;Fundamentally, this role interviews best in interviews where people know what high quality expertise looks like and allows them to just talk. Like can identify like very rapidly in most cases. Which means, quite honestly, that if you don’t have a good artistic sense and aesthetic for what high quality engineering truly looks like, you will be unable to hire for this role effectively, regardless of your process. If you fail to hire for this role, consider that a strong indicator, and take the opportunity to reflect on the implications of that.&lt;/p&gt;&lt;p&gt;Candidly, you should worry less than you think you need to about having an “objective” interview process. This person will have to lead without authority, institute company wide change, and is going to be hired into the most difficult role to succeed in outside of senior leadership. “The Vibes feel good; I would trust this person to tell me very uncomfortable things about stuff I am personally proud of” is absolutely something to aim for over most everything else. However, the full implications of this are not always obvious. For example, you will need to be very painfully aware that if you don’t have diversity in senior leadership, hiring someone who is not a white male will likely not turn out well. The exceptions to this, in my opinion, prove the rule; I have personally been both the exception and the rule, here. Many companies will be uncomfortable with this; which is one reason why this role is so prone to failure.&lt;/p&gt;&lt;p&gt;This role is truly a Sociotechnical Engineer, in every sense of the term; they will expose the weaknesses of your company in ways you are not prepared for, and they will challenge the status quo in ways that are painful. Embrace it. Be prepared to grow as much, if not more, than they do.&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>Why is Browser Observability Hard</title>
    <link href="https://hazelweakly.me/blog/why-is-browser-observability-hard/" />
    <updated>2023-07-10T00:00:00Z</updated>
    <id>https://hazelweakly.me/blog/why-is-browser-observability-hard/</id>
    <content type="html">&lt;p&gt;So the big thing that makes everything so difficult for browsers is that opentelemetry has a concept of a lifecycle for telemetry that doesn’t map very well to how you ergonomically propagate context and correlate traces together. Opentelemetry works super super well in cases where you have a very linear callstack that’s fully synchronous in design. Something like &lt;code&gt;request -&gt; function A(a1, a2, a3...)&lt;/code&gt; &lt;code&gt;-&gt;&lt;/code&gt; &lt;code&gt;function B(b1, b2, b3...)&lt;/code&gt; &lt;code&gt;-&gt; ... -&gt;&lt;/code&gt; &lt;code&gt;function N(n1,n2,n3...) -&gt; response&lt;/code&gt; where the total lifetime of that is “reasonably short.” That is, to put it mildly, not the case in front-end systems. Front-End systems are event based inherently and work based off asynchronous callbacks and event loops, which is one of the architecture styles that fits most poorly into the “tree-like” structure that otel wants you to give it. Technically opentelemetry can work with and express anything that’s a directed acyclic graph (by way of using both links and parent/child relationships carefully), but using links is really annoying in most SDKs and it’s universally unclear how to most clearly initiate “child” spans if you don’t have visibility into the lifespan of the callee vs that of the caller.&lt;/p&gt;&lt;p&gt;On top of that, there’s React; simultaneously the best and worst thing to happen to frontend development. In addition to the browser being async and event-loop driven, React is a runtime on top of this which specifically is designed to:&lt;/p&gt;&lt;ol&gt;&lt;li&gt;give you no control over the lifetime of any root span&lt;/li&gt;&lt;li&gt;encourage you to make the lifetimes of every node as long as possible (for efficiency reasons)&lt;/li&gt;&lt;li&gt;not give you lifecycle hooks granular enough to synchronise your span lifecycle to that of a component&lt;/li&gt;&lt;/ol&gt;&lt;p&gt;Even if 3 was solved by introducing the concept of “on component creation, on component render, on component removal, on component re-render” and whatever else that was required for creating autoinstrumentation, that wouldn’t really work meaningfully. For one thing, you would have to build that into react.js itself and not anything on top of it. For another, root spans that can last indefinite amounts of time don’t work well in opentelemetry. Some people don’t refresh their browser tabs for weeks or months! It’s the last issue that makes front end stuff so dfficult for opentelemetry. It’s just really not designed to make it ergonomic to go “page load happened, *three weeks later*, oh look button press”. So how the fuck do you actually meaningfully instrument that? You can, of course, but you need to make almost everything a root span and correlate them together casually via attributes and, hopefully, also some links. Which won’t be ideal from a querying perspective, but is more honest than other approaches.&lt;/p&gt;&lt;p&gt;Lastly, the browser doesn’t support grpc, data loss is more common, compression is vital, and the weight of instrumentation size is extremely important because blowing up someone’s data plan is inconsiderate. So this is one of the areas in which the user starts to pay really heavily for high cardinality, and data volume needs to be very judiciously monitored. You also don’t really have the option of running an in-browser version of a telemetry collector, but that’s exactly what you need a lot of the time to do the most effective curtailing of bandwidth. Even if that existed as a thing, it would bloat the page with even more javascript, cost the user even more battery life to run on their 5 year old phone, and make the user experience even worse.&lt;/p&gt;&lt;p&gt;There’s also api authentication issues with browsers needing to be able to send telemetry to an endpoint without being authenticated. Honeycomb solves that pretty well, but you need to think decently hard about that if you build an /api/telemetry endpoint (which you probably should). Which is a lot more work than “just yeet this straight to honeycomb for a proof of concept and then we can figure out collectors and refinery and whatever later.”&lt;/p&gt;&lt;p&gt;&lt;a href=&quot;https://opentelemetry.io/docs/concepts/signals/baggage/baggage&quot;&gt;Baggage&lt;/a&gt; is how you’re “supposed” to build context that can be shared between services so that you can correlate a backend trace with a frontend trace. It’s probably one of the most confusing, least-out-of-the-box experiences you will ever encounter, and there’s no useful way to set that up nicely without really understanding what you’re doing in both the frontend and the backend. Which is another super difficult thing about the frontend. Rolling your own way to tie together every service instead of having that be the “normal” thing that self discovers the connections is, imo, a sign of immaturity in the space.&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>Values of Convenience: Why Do We Not Make Life Better For Others?</title>
    <link href="https://hazelweakly.me/blog/values-of-convenience/" />
    <updated>2023-05-16T00:00:00Z</updated>
    <id>https://hazelweakly.me/blog/values-of-convenience/</id>
    <content type="html">&lt;p&gt;I was asked recently for my thoughts on a wonderful article about software correctness, human convenience, and flossing, and I ended up dumping out an entire blog post worth of thoughts. So, this blog post serves as both a reminder to myself to write more, and also a sincere apology to my wonderfully patient friend, &lt;a href=&quot;https://kellyshortridge.com&quot;&gt;Kelly&lt;/a&gt;, who graciously puts up with me dumping absolutely unholy amounts of text into their phone at all hours of the day.&lt;/p&gt;&lt;p&gt;&lt;a href=&quot;https://www.hillelwayne.com/post/flossing/&quot;&gt;I really liked the blog post, by the way&lt;/a&gt;. Hillel is an excellent writer, and I find myself agreeing with just about everything he’s ever written. He’s got some fascinating takes, and I find them so grounded in reality and experience. One thing that can be pretty difficult, especially with Formal Methods or other “Big Math” computer science topics is that it can become so easy to get so deeply inside your realm that you become wholly divorced from the concept of anyone ever having to actually &lt;em&gt;learn&lt;/em&gt; it, much less &lt;em&gt;apply&lt;/em&gt; it. Not all things need to be applied, of course, or even learned; but there’s this extraordinary clarity that comes from having polished an idea to a fine shine on the frustrated tears of students or inexperienced engineers that is very difficult to replicate in any other manner. Consequently, his work really resonates with me. “Proving systems right” is so inherently human; after all, what is a proof other than a miserable pile of arguments, and what is “correctness” other than a human ideal laced with emotion, not yet sullied by the ravages of reality.&lt;/p&gt;&lt;p&gt;One takeaway that I have from the post is that there’s an idea that I don’t see explored a lot, and it’s one of what exactly “smoothness” &lt;em&gt;looks&lt;/em&gt; like. What does it mean for something to be convenient to use? And, more importantly, why the fuck does it matter at all. If it’s good for you, why don’t you floss? If it’s healthy for you, why don’t you eat salad more often?&lt;/p&gt;&lt;p&gt;I want to take a moment here and think about this from the other direction. Rather than thinking about what &lt;em&gt;smoothness&lt;/em&gt; is, what does inconvenience look like? What does friction feel like? I think we, as humans, really want to experience friction like a hill; we really want to feel like there’s a smoothly rising slope and you can sort of calculate how much friction you’re willing to endure in order to get a certain trade-off. “Oh, that’s 2 frictions? But 4 goody-goody-yum-yum points? Sure, I’ll take that; it’s within my friction budget for the week.”&lt;/p&gt;&lt;p&gt;&lt;em&gt;Pffh.&lt;/em&gt;&lt;/p&gt;&lt;p&gt;In reality, I think friction is a lot more like a thousand cliffs of varying size. But, not only are the cliffs of varying size, you are cursed with a few inconvenient truths.&lt;/p&gt;&lt;ol&gt;&lt;li&gt;Everyone friction cliff will be a different size to each individual&lt;/li&gt;&lt;li&gt;The second you scale your cliff, you will immediately forget how high it was&lt;/li&gt;&lt;li&gt;Every single cliff, no matter how tiny, can completely derail your ability to progress&lt;/li&gt;&lt;li&gt;Every time you attempt to estimate the size of the cliff you scaled, you will underestimate it&lt;/li&gt;&lt;li&gt;You will eventually forget that people are not on the same journey of cliffs as you&lt;/li&gt;&lt;/ol&gt;&lt;p&gt;So, really, you’re fucked; completely and utterly fucked. Forget the curse of knowledge, or expert blindness; you’re doomed to eventually be the cliff that someone else must scale.&lt;/p&gt;&lt;p&gt;Joy! And now that we’ve thought about that cheerful note, let’s go to the next part of the article that really stood out to me, which is helpfully depressing to me for entirely separate reasons.&lt;/p&gt;&lt;blockquote class=&quot;flow&quot;&gt;&lt;p&gt;Similarly, in academia, UI/UX is low prestige work. […] The incentive structures are all messed up.&lt;/p&gt;&lt;/blockquote&gt;&lt;p&gt;I think there’s potentially some very interesting implications here, and I want to unpack those. For the record, I agree with Hillel here, but this immediately brought to mind for me a single scorching thought&lt;/p&gt;&lt;p&gt;&lt;em&gt;“That’s because it’s woman’s work.”&lt;/em&gt;&lt;/p&gt;&lt;p&gt;We’re currently trapped in a &lt;s&gt;dystopian hellscape&lt;/s&gt; patriarchal society where the undercurrent of the internet and technological cultures are one of egocentric bias and rugged individualism.&lt;/p&gt;&lt;p&gt;On the egocentric side, the incentives around removing barriers seem to be non-existent or even dis-incentivised entirely. Why make life easier for the next person if it “devalues” your own resume of achievements? For people to stand on &lt;em&gt;your&lt;/em&gt; shoulders in an egocentric zero-sum society, it would imply that you had not achieved greatness yourself. Far from the “standing on the shoulders of giants” ideals that we love to pretend we believe in, we seem to tend towards viewing that sort of progress as being at the expense of those who came before, as if the very act of forging ahead diminishes the path itself and those who laid it.&lt;/p&gt;&lt;p&gt;On the patriarchal side of things, I notice a similar pattern around work that’s “feminine” (and thus codified as inferior in nature) being work that focuses on community, empathy, helping others, and enriching culture. You can see this in how we value the work of nurses and teachers; when their work shifted from being one of dominance and control to that of nurturing and care, the careers became associated with women and with that came a lowering of respect and pay.&lt;/p&gt;&lt;p&gt;What does it mean for something to be convenient, and what does it mean for something to have friction? If we think of convenience as making life better for others, as working to build that which breathes wholeness into the soul of a community, then friction is merely the absence of that life. The cliffs of friction are the same as the cliffs of neglect; malevolence isn’t required, disinterest alone can build cliffs that no one could ever hope to scale.&lt;/p&gt;&lt;p&gt;What is this convenience? This enriching of the other? I want to tug on that a bit, and not in the least because I’ve been thinking a lot lately about Christopher Alexander and adrienne maree brown. (Warning: In the interest of brevity, I am about to condense hundreds of pages of nuanced literature into a few sentences and I humbly beg forgiveness for doing so, and as an aside: I will be eternally grateful to &lt;a href=&quot;https://erinkissane.com&quot;&gt;Erin Kissane&lt;/a&gt; for writing this &lt;a href=&quot;https://erinkissane.com/patterns-prophets-and-priests&quot;&gt;amazing blog post&lt;/a&gt;, among others, that really exposed me for the first time to the duality of these two writers).&lt;/p&gt;&lt;p&gt;One thing that’s currently fascinating about them, to me, is that they seem to approach the same problem from opposite sides. They both want to build a healthier world that gives life to humans: Alexander through the question of how to build community-producing structure, and adrienne through the question of how to build structure-producing community. But the longer I think about it, the less coincidental I find it that Alexander–a white man–focuses on the systemic structure, approaching the problem from the top down; while adrienne–a mixed-race black queer woman–focuses on the community, seeing it as an essential and necessary prerequisite to the very idea of being able to build a structure in the first place, and consequently approaches the problem from the bottom up.&lt;/p&gt;&lt;p&gt;What is convenience?&lt;br&gt; What is convenience but a miserable pile of humanity?&lt;br&gt; What is convenience but a mirror that reflects purely the ability of one to reach another and thereby forge raw human connection from the aethers of desire?&lt;/p&gt;&lt;p&gt;What is convenience but the idea that in order to build a tower to reach the heavens, you must first reach into the heart of humanity itself.&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>Mother of All Outages</title>
    <link href="https://hazelweakly.me/blog/mother-of-all-outages/" />
    <updated>2023-04-19T00:00:00Z</updated>
    <id>https://hazelweakly.me/blog/mother-of-all-outages/</id>
    <content type="html">&lt;p&gt;Y’all ready for a story about one of the wildest &lt;s&gt;fuckups&lt;/s&gt; production outages I ever took part in? Buckle up; we’re going for a ride far, far away from any security cameras.&lt;/p&gt;&lt;h2 id=&quot;setting-the-scene&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://hazelweakly.me/blog/mother-of-all-outages/#setting-the-scene&quot;&gt;&lt;span&gt;Setting the Scene&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;At a previous job we had some fairly intense mismanagement. No tech debt was ever allowed to be handled. No good deed was ever unpunished. No non-white-male person was paid a market salary.&lt;/p&gt;&lt;p&gt;Y’know, the usual.&lt;/p&gt;&lt;p&gt;We had all of our infrastructure set up by one lonely SRE person for years. Then I came on, and two engineers from other teams joined the SRE team.&lt;/p&gt;&lt;p&gt;Our tech stack for the backend servers? VMs with Nomad, AWS, and sparkles. Amazingly cost effective, quite honestly.&lt;/p&gt;&lt;p&gt;Because business, the company had recently gone through a massive round of layoffs; they were contrite, they were distraught, they were thorough in their assurances to everyone that there wouldn’t be any more layoffs. Naturally, I knew they were lying; I knew it before they did, but I saw it plain as day.&lt;/p&gt;&lt;p&gt;Due to all of &lt;em&gt;*gestures*&lt;/em&gt; this, the engineering department scored MASSIVELY badly in happiness. They were looking at staggeringly terrible end-of-year attrition rates.&lt;/p&gt;&lt;blockquote class=&quot;flow&quot;&gt;&lt;p&gt;I’m sure this had absolutely nothing to do whatsoever with the encrypted anonymous spreadsheet that “Someone Who Isn’t Me” started and spread around the entire engineering org to bring some salary transparency to light.&lt;/p&gt;&lt;/blockquote&gt;&lt;blockquote class=&quot;flow&quot;&gt;&lt;p&gt;The fact that fem presenting people were massively underpaid and that quite a few people living in lower cost of living areas got extremely bad salaries also had nothing to do with this, I’m sure.&lt;/p&gt;&lt;/blockquote&gt;&lt;p&gt;Naturally the solution to impending attrition woes was to do nothing. Haha. Business. &lt;strong&gt;&lt;em&gt;BIZZ. NIZZ.&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;&lt;h2 id=&quot;drain-in-the-membrane&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://hazelweakly.me/blog/mother-of-all-outages/#drain-in-the-membrane&quot;&gt;&lt;span&gt;Drain in the Membrane&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;About a month before PAIN day, the person who setup all the infrastructure tech… Left.&lt;/p&gt;&lt;p&gt;I totally get it; greener pastures, better pay, less &lt;s&gt;illegal corporate exploitation&lt;/s&gt; drama. Excellent choice, really; who could blame him? Looking back, I kinda wish I had made the same choice at the time. But now, behold! I was now the expert in the stack (kind of).&lt;/p&gt;&lt;p&gt;I mean, I wasn’t the expert they wanted or needed, but I &lt;em&gt;was&lt;/em&gt; “The Person Who Is Currently Here” which is kind of the same thing except for where it’s not.&lt;/p&gt;&lt;p&gt;That said, everything continued to work flawlessly for a very very long time until one fateful day (about one month after The Expert left).&lt;/p&gt;&lt;blockquote class=&quot;flow&quot;&gt;&lt;p&gt;One small side tangent: our nomad servers looked like this&lt;/p&gt;&lt;ul&gt;&lt;li&gt;3 controller nodes + N worker nodes.&lt;/li&gt;&lt;li&gt;The controllers also ran consul and vault.&lt;/li&gt;&lt;li&gt;All of the observability infrastructure, integration with AWS, cron jobs, timers, event processing, etc, ran on the nomad worker nodes&lt;/li&gt;&lt;/ul&gt;&lt;/blockquote&gt;&lt;h2 id=&quot;the-fateful-day-of-pain&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://hazelweakly.me/blog/mother-of-all-outages/#the-fateful-day-of-pain&quot;&gt;&lt;span&gt;The Fateful Day of PAIN&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;I get a message from a coworker that our observability stuff is just dead. Completely gone. Can’t get it working.&lt;/p&gt;&lt;p&gt;So I looked at the clusters … Turns out everything was down. Consul just shat the bed, and nothing could reach each other.&lt;/p&gt;&lt;p&gt;“How do we fix it?” they asked.&lt;/p&gt;&lt;p&gt;“Fuck if I know. None of us set it up” I replied, helpfully, like a broken Clippy from a pirated Word install.&lt;/p&gt;&lt;p&gt;However, I did get a debriefing on maintenance, and got to learn some of the quirks of the system.&lt;/p&gt;&lt;p&gt;One was you had to be very careful when restarting the nomad servers, but it was generally fine if you did an expand + cycle + shrink.&lt;/p&gt;&lt;p&gt;So, I made the decision to try that. And here’s where I fucked up:&lt;/p&gt;&lt;ol&gt;&lt;li&gt;I learned later that the expand cycle shrink mentioned by the former coworker was for the &lt;em&gt;worker&lt;/em&gt; nodes only. (Obvious miscommunication in retrospect)&lt;/li&gt;&lt;li&gt;For controllers going from 3 to 4 causes split brain.&lt;/li&gt;&lt;/ol&gt;&lt;p&gt;The second point was also obvious in retrospect. I was working in a broken system that nobody understood in a toxic company under pressure from people who never once prioritised doing the right thing or addressing tech debt or, forbid, prevention of issues. Of course my choices were bad&lt;/p&gt;&lt;p&gt;Long story short: I split brained the cluster and then cycled it. &lt;em&gt;&lt;strong&gt;WOO!&lt;/strong&gt;&lt;/em&gt;&lt;/p&gt;&lt;p&gt;This caused a very important thing to happen.&lt;/p&gt;&lt;p&gt;You remember me saying “oh Consul and Vault are also on those nodes”?&lt;/p&gt;&lt;p&gt;When you split the brain, the new nodes don’t join a quorum. Thus, state isn’t transferred.&lt;/p&gt;&lt;p&gt;… oh &lt;em&gt;fuck&lt;/em&gt;.&lt;/p&gt;&lt;p&gt;We lost 3 years of secrets, credentials, configurations, etc.&lt;br&gt; Some of which didn’t exist anywhere else.&lt;br&gt; State replication of Consul had never been setup.&lt;br&gt; State replication of Vault had never been setup.&lt;br&gt; We had no backups of anything and no way to get them back.&lt;/p&gt;&lt;p&gt;✨ Gone ✨&lt;/p&gt;&lt;p&gt;Not only that. But now &lt;em&gt;everything&lt;/em&gt; was on fire because the controller nodes were &lt;em&gt;completely&lt;/em&gt; broken (they were already 90% broken but now they were 100% broken).&lt;/p&gt;&lt;p&gt;Luckily we had infrastructure as code! We can fix this! Right??&lt;/p&gt;&lt;p&gt;No.&lt;br&gt; We needed to bootstrap.&lt;br&gt; Nothing can help now.&lt;/p&gt;&lt;h2 id=&quot;the-strap-and-the-boot&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://hazelweakly.me/blog/mother-of-all-outages/#the-strap-and-the-boot&quot;&gt;&lt;span&gt;The Strap and the Boot&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;I spent the next week, the next weekend, and through into the week after that rebuilding everything and reverse engineering stuff. I poured over chat lots, buried secrets, git histories, and hidden AWS configs. We got 90% of it back, but the other stuff was gone forever.&lt;/p&gt;&lt;p&gt;14 days after the incident and nothing had been fixed yet, despite the clusters now having been rebuilt and made fully operational. Why?&lt;/p&gt;&lt;p&gt;&lt;em&gt;&lt;strong&gt;BOOTSTRAPPING&lt;/strong&gt;&lt;/em&gt;.&lt;/p&gt;&lt;p&gt;We had health checks, crash loops, healthz, all that shit. None of it is ever calibrated from a cold start. You can’t.&lt;/p&gt;&lt;p&gt;We had dependency loops, cycles in services, we had missing stuff that wasn’t in the code, we had code that had never been run, we had code that was for future use and code that was retroactively added to guess how things were set up.&lt;/p&gt;&lt;p&gt;We got about 25% of everything back up, kind of, sorta; if you squinted, you could see where things were supposed to go, vaguely, that is.&lt;/p&gt;&lt;h2 id=&quot;pissing-on-faces-and-pretending-it&#39;s-rain&quot; tabindex=&quot;-1&quot;&gt;&lt;a href=&quot;https://hazelweakly.me/blog/mother-of-all-outages/#pissing-on-faces-and-pretending-it&#39;s-rain&quot; class=&quot;header-anchor&quot;&gt;&lt;span&gt;Pissing on Faces and Pretending it’s Rain&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;Some of this restoration actually took place during an Enginering on-site. Myself and the one other SRE person left on the team worked during the entire on-site when we were supposed to be having fun; we dug into logs, poured over services, and attempted to baby things back to life one by one.&lt;/p&gt;&lt;p&gt;Then it was Friday. What happened that Friday? AWS released Serverless RDS and somehow, our RDS cluster got completely corrupted. No health checks failed, no alarms went off; pure, silent, deadly corruption. I had two options: try to fix the database, or restore it from a snapshot. Normally, this doesn’t matter; however, this database was years old, it was ancient, it was one of the first things ever setup.&lt;/p&gt;&lt;p&gt;And restoring from a snapshot means changing the ARN. But that ARN? It was a string that was hardcoded into almost every piece of infrastructure. Changing that ARN would take days of pleading with the gods of chaos. So naturally I tried really fuckin hard to not need to; unfortunately, I lost that battle.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Monday, 8am PST&lt;/strong&gt;, I restored the database from a snapshot.&lt;br&gt; &lt;strong&gt;Monday, 9am PST&lt;/strong&gt;, people start disappearing from slack.&lt;br&gt; &lt;strong&gt;Monday, 9:30am PST&lt;/strong&gt;, calendar appointments with HR, the CTO, and CEO, start popping up on calendars of engineers.&lt;br&gt; &lt;strong&gt;Monday, 10:00am PST&lt;/strong&gt;, people are posting titanic gifs in chats, frantically sending each other email addresses and phone numbers.&lt;br&gt; &lt;strong&gt;Monday, 10:30am PST&lt;/strong&gt;, we figure out that about 90-95% of the engineering org is going to be laid off.&lt;br&gt; &lt;strong&gt;Monday, 11am PST&lt;/strong&gt;, I have a meeting with HR and cheerfully explain that they’re now spending $10k a month in idle CI machines, and $6k a month for an RDS instance that isn’t connected to anything. I helpfully offer that they can call me or The Expert if they need help with repairing the current dumpster fire in the future, and say that we would be more than happy to give our consulting rates if asked.&lt;br&gt; &lt;strong&gt;Monday, 12:00pm PST&lt;/strong&gt;, my work laptops are wiped remotely.&lt;/p&gt;&lt;h2 id=&quot;the-wailing-of-cassandra&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://hazelweakly.me/blog/mother-of-all-outages/#the-wailing-of-cassandra&quot;&gt;&lt;span&gt;The Wailing of Cassandra&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;Now, fuck-ups happen, incidents happen, that’s fine. Why is this one the Mother of All Outages, for me? I don’t even know; I suppose it’s because the whole thing felt so fucking pointless. The whole damned thing, from beginning to end. Pointless. And to lay off &lt;em&gt;everyone&lt;/em&gt; with the system broken beyond repair? To assume that you can keep on limping, wasting slowly away while clinging to the dying embers of a celestial god for eternity? I, to this day, have never understood the decisions that were made; there were many, but this was by far the least understandable.&lt;/p&gt;&lt;p&gt;It just kills me because I saw this coming. I actually made bets with The Expert about how long it would take for this to happen after he left. We guessed 2 weeks to 2 months (we were right). We both underestimated the severity, though. By a &lt;em&gt;lot&lt;/em&gt;&lt;/p&gt;&lt;p&gt;But we knew it was coming.&lt;/p&gt;&lt;p&gt;The other thing is that the entire thing could’ve been avoided had we been given time to move our vault over to the managed enterprise vault licence.&lt;/p&gt;&lt;p&gt;That.&lt;br&gt; We.&lt;br&gt; Already.&lt;br&gt; Fucking.&lt;br&gt; Bought.&lt;br&gt; To.&lt;br&gt; Prevent.&lt;br&gt; This.&lt;/p&gt;&lt;h2 id=&quot;epilogue&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://hazelweakly.me/blog/mother-of-all-outages/#epilogue&quot;&gt;&lt;span&gt;Epilogue&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;The company severely downsized the engineering org later. And, from back-channel news, I discovered that almost every engineer that made the cut left soon after.&lt;/p&gt;&lt;p&gt;The observability stack? Still not working.&lt;/p&gt;&lt;p&gt;But also so is nothing else.&lt;/p&gt;&lt;p&gt;A month or two after that, they re-hired The Expert to bring the system back up; the consultant fees he charged were nearly the same as his original salary, but for 10% of the hours. Once he did that and documented it, the company apparently had planned to migrate the system to Kubernetes so that external consultants could maintain it. As far as I know, this was never completed.&lt;/p&gt;&lt;p&gt;To this day, The Expert occasionally consults for them here and there, reaping the ghosts of horrors past.&lt;/p&gt;&lt;p&gt;No one really seemed to do the math and realise that this cost the company more than what they saved by laying off 90% of their engineers. Having never once learned their lessons, they weren’t about to start now, I suppose.&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>Scaling Mastodon: The Compendium</title>
    <link href="https://hazelweakly.me/blog/scaling-mastodon/" />
    <updated>2022-11-27T00:00:00Z</updated>
    <id>https://hazelweakly.me/blog/scaling-mastodon/</id>
    <content type="html">&lt;p&gt;This blog post will be kept up to date as I find out more information and publish my findings. It’s currently organised in no particular order, as a collection of several fragmented thoughts.&lt;/p&gt;&lt;div class=&quot;py-4&quot;&gt;&lt;hr&gt;&lt;/div&gt;&lt;h2 id=&quot;nginx&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://hazelweakly.me/blog/scaling-mastodon/#nginx&quot;&gt;&lt;span&gt;Nginx&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;&lt;h3 id=&quot;nginx-config-for-object-storage&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://hazelweakly.me/blog/scaling-mastodon/#nginx-config-for-object-storage&quot;&gt;&lt;span&gt;Nginx config for object storage&lt;/span&gt;&lt;/a&gt;&lt;/h3&gt;&lt;p&gt;The nginx config used &lt;a href=&quot;https://stanislas.blog/2018/05/moving-mastodon-media-files-to-wasabi-object-storage/#setting-up-a-nginx-reverse-proxy-with-cache-for-the-bucket&quot;&gt;to proxy to an object storage with a cache&lt;/a&gt;&lt;/p&gt;&lt;div class=&quot;py-4&quot;&gt;&lt;hr&gt;&lt;/div&gt;&lt;blockquote class=&quot;flow&quot;&gt;&lt;p&gt;you will have to tune nginx by increasing its &lt;code&gt;worker_rlimit_nofile&lt;/code&gt; and &lt;code&gt;worker_connections&lt;/code&gt; values.&lt;/p&gt;&lt;p&gt;– &lt;a href=&quot;https://gist.github.com/Gargron/aa9341a49dc91d5a721019d9e0c9fd11&quot;&gt;scaling a mastodon server&lt;/a&gt;&lt;/p&gt;&lt;/blockquote&gt;&lt;p&gt;ok kewl, good to remember I suppose.&lt;/p&gt;&lt;p&gt;You may also need to remediate &lt;a href=&quot;https://github.com/mastodon/mastodon/pull/21840&quot;&gt;https://github.com/mastodon/mastodon/pull/21840&lt;/a&gt; via setting your response timeout to 300s in nginx instead of 30 or even 60s.&lt;/p&gt;&lt;p&gt;Edit: That should hopefully no longer be the case, sweet.&lt;/p&gt;&lt;h2 id=&quot;postgres&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://hazelweakly.me/blog/scaling-mastodon/#postgres&quot;&gt;&lt;span&gt;Postgres&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;&lt;h3 id=&quot;the-sobbing-sysadmin&#39;s-guide-to-postgres-tuning&quot; tabindex=&quot;-1&quot;&gt;&lt;a href=&quot;https://hazelweakly.me/blog/scaling-mastodon/#the-sobbing-sysadmin&#39;s-guide-to-postgres-tuning&quot; class=&quot;header-anchor&quot;&gt;&lt;span&gt;The Sobbing SysAdmin’s Guide to Postgres Tuning&lt;/span&gt;&lt;/a&gt;&lt;/h3&gt;&lt;blockquote class=&quot;flow&quot;&gt;&lt;p&gt;IF THE MASTODON INSTANCE BANS ME FOR FUCKING UP THE HARD DRIVE I WILL FACE GOD AND WALK BACKWARDS INTO HELL&lt;/p&gt;&lt;p&gt;– postgres&lt;/p&gt;&lt;/blockquote&gt;&lt;p&gt;There are a few rules of tuning postgres: The first is that you have to do it. The second is that nobody knows how to do it.&lt;/p&gt;&lt;p&gt;Now that we know the rules, let’s go forth and explain how to deal with &lt;code&gt;max_connections&lt;/code&gt; specifically. This section is gonna be written as if I know what I am talking about but please be assured that I most certainly do not.&lt;/p&gt;&lt;p&gt;A rule of thumb for rails and postgres: Thou shalt not &lt;em&gt;ever&lt;/em&gt; fuck up and manage to get more DB connections going than we have in &lt;code&gt;max_connections&lt;/code&gt; for postgres. However, thou shalt &lt;em&gt;also&lt;/em&gt; keep max_connections as low as fucking possible because absolutely everything in postgres falls over and shits the bed if you start getting hard contention due to trying to have more connections than is allowed.&lt;/p&gt;&lt;p&gt;In this case postgres won’t literally shit the bed, but your sidekiq queues will be unable to connect to postgres until you’re below &lt;code&gt;max_connections&lt;/code&gt; again. “Oh that’s fine” says the clueless person. “I will just set &lt;code&gt;max_connections&lt;/code&gt; to above 9000” says the fool.&lt;/p&gt;&lt;p&gt;New rule of thumb: If you have to set postgres &lt;code&gt;max_connections&lt;/code&gt; to above 512, don’t.&lt;/p&gt;&lt;p&gt;Why? Well, why do you need that many? You probably don’t and adding more will cause latent system instability later on. What can be the case for us is that, to the best of my understanding, there’s a few things going on.&lt;/p&gt;&lt;p&gt;Here’s what I think we keep running into:&lt;/p&gt;&lt;ol&gt;&lt;li&gt;a mastodon sysadmin says “oh wow the sidekiq queues are slow, we need to add more workers”&lt;/li&gt;&lt;li&gt;this adds more connections to postgres, which &lt;a href=&quot;https://brandur.org/postgres-connections&quot;&gt;degrades performance slightly&lt;/a&gt;&lt;/li&gt;&lt;li&gt;postgres starts “doing more IO”&lt;/li&gt;&lt;li&gt;performance counterintuitively goes down because queries start taking longer&lt;/li&gt;&lt;li&gt;&lt;code&gt;GOTO 10&lt;/code&gt;&lt;/li&gt;&lt;/ol&gt;&lt;p&gt;At some point you’re going to run out of &lt;code&gt;max_connections&lt;/code&gt;. If you raise it to an absurd number like above 1024, the &lt;em&gt;next&lt;/em&gt; issue you’re probably going to run into is that your storage system probably can’t actually handle the IO demands you’re theoretically placing on it.&lt;/p&gt;&lt;p&gt;Here’s what the above sequence looks like from the system’s point of view:&lt;/p&gt;&lt;ol&gt;&lt;li&gt;just &lt;em&gt;having&lt;/em&gt; connections will slowly cause more and more slowdown over time&lt;/li&gt;&lt;li&gt;Which means more of those connections will slowly become active as things take longer and longer&lt;/li&gt;&lt;li&gt;More active connections hammers the IO &lt;em&gt;way&lt;/em&gt; harder&lt;/li&gt;&lt;li&gt;Which slows things down&lt;/li&gt;&lt;li&gt;*the server sobbing* “please please im already dying”&lt;/li&gt;&lt;/ol&gt;&lt;p&gt;So what number do you actually want to set it to? Luckily, &lt;a href=&quot;https://www.cybertec-postgresql.com/en/tuning-max_connections-in-postgresql/&quot;&gt;this postgres tuning guide&lt;/a&gt; has a “helpful” formula that explains how to find an ideal limit:&lt;/p&gt;&lt;link as=&quot;font&quot; crossorigin=&quot;&quot; data-helmet=&quot;VictorMono-v1.5.3-Regular&quot; href=&quot;https://hazelweakly.me/fonts/VictorMono-v1.5.3-Regular.woff2&quot; rel=&quot;preload&quot; type=&quot;font/woff2&quot;&gt;&lt;link as=&quot;font&quot; crossorigin=&quot;&quot; data-helmet=&quot;VictorMono-v1.5.3-Italic&quot; href=&quot;https://hazelweakly.me/fonts/VictorMono-v1.5.3-Italic.woff2&quot; rel=&quot;preload&quot; type=&quot;font/woff2&quot;&gt;&lt;pre&gt;max_connections &lt; max(num_cores, parallel_io_limit) /
                  (session_busy_ratio * avg_parallelism)
&lt;/pre&gt;&lt;p&gt;So clearly, don’t set your postgres &lt;code&gt;max_connections&lt;/code&gt; to anything more than *insert magic numbers*.&lt;br&gt; OBVIOUSLY.&lt;br&gt; EASY.&lt;/p&gt;&lt;p&gt;Ever tried to figure out the performance characteristics and “average parallelism” of a rails application?&lt;/p&gt;&lt;p&gt;&lt;strong&gt;AN ERRAND FOR FOOLS WHO DRINK THE MILK OF INNOCENCE.&lt;/strong&gt;&lt;/p&gt;&lt;p&gt;If you use a db pool like &lt;code&gt;pgbouncer&lt;/code&gt; you get to conveniently avoid this most of the time by naturally not really needing to set postgresql connections beyond 500-ish. However, &lt;em&gt;why&lt;/em&gt; you need to do so is never really explained. So here’s the explanation: because any value of &lt;code&gt;max_connections&lt;/code&gt; over 999 will cause your children will be devoured by Australian evil spirits.&lt;/p&gt;&lt;p&gt;(But seriously, you can get as much as a &lt;a href=&quot;https://aws.amazon.com/blogs/database/performance-impact-of-idle-postgresql-connections/&quot;&gt;46% drop in queries per second&lt;/a&gt; in some cases)&lt;/p&gt;&lt;div class=&quot;py-4&quot;&gt;&lt;hr&gt;&lt;/div&gt;&lt;h3 id=&quot;db_pool-notes-from-nora&#39;s-blog&quot; tabindex=&quot;-1&quot;&gt;&lt;a href=&quot;https://hazelweakly.me/blog/scaling-mastodon/#db_pool-notes-from-nora&#39;s-blog&quot; class=&quot;header-anchor&quot;&gt;&lt;span&gt;&lt;code&gt;DB_POOL&lt;/code&gt; notes from Nora’s blog&lt;/span&gt;&lt;/a&gt;&lt;/h3&gt;&lt;blockquote class=&quot;flow&quot;&gt;&lt;p&gt;The &lt;code&gt;DB_POOL&lt;/code&gt; variable controls how many database connections a Ruby on Rails process will use. (&lt;code&gt;MAX_THREADS&lt;/code&gt; controls this for Puma, the server used in web.) In addition, the web service takes a variable called &lt;code&gt;WEB_CONCURRENCY&lt;/code&gt; to control how many processes it runs. Similarly, streaming has &lt;code&gt;STREAMING_CLUSTER_NUM&lt;/code&gt; to control the number of processes.&lt;/p&gt;&lt;p&gt;The sum of &lt;code&gt;MAX_THREADS&lt;/code&gt; times &lt;code&gt;WEB_CONCURRENCY&lt;/code&gt; in web, &lt;code&gt;STREAMING_CLUSTER_NUM&lt;/code&gt; times &lt;code&gt;DB_POOL&lt;/code&gt; in streaming, and all the sidekiq &lt;code&gt;DB_POOL&lt;/code&gt; variables, must be less than &lt;code&gt;max_connections&lt;/code&gt; in your Postgres config. If it’s more, you’ll experience database contention.&lt;/p&gt;&lt;p&gt;In the example above, assuming the rest of the configuration is default and you have 200 database connections available, I’d set the following:&lt;/p&gt;&lt;ul&gt;&lt;li&gt;web: &lt;code&gt;MAX_THREADS = 10&lt;/code&gt;, &lt;code&gt;WEB_CONCURRENCY=3&lt;/code&gt; for 30 connections&lt;/li&gt;&lt;li&gt;streaming: &lt;code&gt;STREAMING_CLUSTER_NUM = 3&lt;/code&gt;, &lt;code&gt;DB_POOL = 15&lt;/code&gt; for 45 connections&lt;/li&gt;&lt;li&gt;sidekiq-default-push-pull: &lt;code&gt;DB_POOL = 25&lt;/code&gt;, &lt;code&gt;-c 25&lt;/code&gt; for 25 connections&lt;/li&gt;&lt;li&gt;sidekiq-default-pull-push: &lt;code&gt;DB_POOL = 25&lt;/code&gt;, &lt;code&gt;-c 25&lt;/code&gt; for 25 connections&lt;/li&gt;&lt;li&gt;sidekiq-pull-default-push: &lt;code&gt;DB_POOL = 25&lt;/code&gt;, &lt;code&gt;-c 25&lt;/code&gt; for 25 connections&lt;/li&gt;&lt;li&gt;sidekiq-push-default-pull: &lt;code&gt;DB_POOL = 25&lt;/code&gt;, &lt;code&gt;-c 25&lt;/code&gt; for 25 connections&lt;/li&gt;&lt;li&gt;sidekiq-push-scheduler: &lt;code&gt;DB_POOL = 5&lt;/code&gt;, &lt;code&gt;-c 5&lt;/code&gt; for 5 connections&lt;/li&gt;&lt;li&gt;sidekiq-push-mailers: &lt;code&gt;DB_POOL = 5&lt;/code&gt;, &lt;code&gt;-c 5&lt;/code&gt; for 5 connections&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;For a sum of 185 connections. This means there will be 15 loose database connections for things like migrations and manually connecting to the database to do queries and maintenance.&lt;/p&gt;&lt;p&gt;– &lt;a href=&quot;https://nora.codes/post/scaling-mastodon-in-the-face-of-an-exodus/&quot;&gt;Scaling Mastodon in the face of an exodus&lt;/a&gt;&lt;/p&gt;&lt;/blockquote&gt;&lt;div class=&quot;py-4&quot;&gt;&lt;hr&gt;&lt;/div&gt;&lt;h3 id=&quot;when-to-pgbouncer&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://hazelweakly.me/blog/scaling-mastodon/#when-to-pgbouncer&quot;&gt;&lt;span&gt;When to pgbouncer&lt;/span&gt;&lt;/a&gt;&lt;/h3&gt;&lt;blockquote class=&quot;flow&quot;&gt;&lt;p&gt;If you start running out of available Postgres connections (the default is 100) then you may find PgBouncer to be a good solution.&lt;/p&gt;&lt;p&gt;– &lt;a href=&quot;https://docs.joinmastodon.org/admin/scaling/#pgbouncer-why&quot;&gt;why pgbouncer&lt;/a&gt;&lt;/p&gt;&lt;/blockquote&gt;&lt;p&gt;note: this implies that nobody actually tries to run past 100 connections without pgbouncer. There’s probably a reason for this, annoying as it is. (Ruby + &lt;code&gt;activerecord&lt;/code&gt; in particular seems to be quite prone to doing blocking IO inside a database transaction cause why not).&lt;/p&gt;&lt;blockquote class=&quot;flow&quot;&gt;&lt;p&gt;When you reach the point where it makes sense to move Postgres to its own physical machine, I recommend maintaining pgBouncer on each machine that wants to connect to it, rather than putting pgBouncer on the same machine as Postgres.&lt;/p&gt;&lt;p&gt;– &lt;a href=&quot;https://gist.github.com/Gargron/aa9341a49dc91d5a721019d9e0c9fd11&quot;&gt;scaling a mastodon server&lt;/a&gt;&lt;/p&gt;&lt;/blockquote&gt;&lt;p&gt;note: read replicas are suggested to be unneeded even at 128k active users.&lt;/p&gt;&lt;div class=&quot;py-4&quot;&gt;&lt;hr&gt;&lt;/div&gt;&lt;h3 id=&quot;idle-hands-are-the-devil&#39;s-workshop&quot; tabindex=&quot;-1&quot;&gt;&lt;a href=&quot;https://hazelweakly.me/blog/scaling-mastodon/#idle-hands-are-the-devil&#39;s-workshop&quot; class=&quot;header-anchor&quot;&gt;&lt;span&gt;Idle Hands are the Devil’s Workshop&lt;/span&gt;&lt;/a&gt;&lt;/h3&gt;&lt;p&gt;A happy postgres is one where the amount of idle transactions is low (but not constantly &lt;em&gt;zero&lt;/em&gt;). Think of the 80 20 rule as a nice rule of thumb; if more than 20% of your connections are idle… That’s not great.&lt;/p&gt;&lt;p&gt;If you want to look at this, &lt;a href=&quot;https://stackoverflow.com/a/53208173&quot;&gt;stack overflow&lt;/a&gt; has an example of a useful SQL query you can run in postgres.&lt;/p&gt;&lt;link as=&quot;font&quot; crossorigin=&quot;&quot; data-helmet=&quot;VictorMono-v1.5.3-Regular&quot; href=&quot;https://hazelweakly.me/fonts/VictorMono-v1.5.3-Regular.woff2&quot; rel=&quot;preload&quot; type=&quot;font/woff2&quot;&gt;&lt;link as=&quot;font&quot; crossorigin=&quot;&quot; data-helmet=&quot;VictorMono-v1.5.3-Italic&quot; href=&quot;https://hazelweakly.me/fonts/VictorMono-v1.5.3-Italic.woff2&quot; rel=&quot;preload&quot; type=&quot;font/woff2&quot;&gt;&lt;pre class=&quot;highlight highlight-sql&quot;&gt;&lt;span class=&quot;pl-k&quot;&gt;select&lt;/span&gt;  &lt;span class=&quot;pl-k&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;pl-k&quot;&gt;from&lt;/span&gt;
    (&lt;span class=&quot;pl-k&quot;&gt;select&lt;/span&gt; state, &lt;span class=&quot;pl-c1&quot;&gt;count&lt;/span&gt;(&lt;span class=&quot;pl-k&quot;&gt;*&lt;/span&gt;) &lt;span class=&quot;pl-k&quot;&gt;from&lt;/span&gt; pg_stat_activity  &lt;span class=&quot;pl-k&quot;&gt;where&lt;/span&gt; pid &lt;span class=&quot;pl-k&quot;&gt;&lt;&gt;&lt;/span&gt; pg_backend_pid() &lt;span class=&quot;pl-k&quot;&gt;group by&lt;/span&gt; &lt;span class=&quot;pl-c1&quot;&gt;1&lt;/span&gt; &lt;span class=&quot;pl-k&quot;&gt;order by&lt;/span&gt; &lt;span class=&quot;pl-c1&quot;&gt;1&lt;/span&gt;) q1,
    (&lt;span class=&quot;pl-k&quot;&gt;select&lt;/span&gt; setting::&lt;span class=&quot;pl-k&quot;&gt;int&lt;/span&gt; res_for_super &lt;span class=&quot;pl-k&quot;&gt;from&lt;/span&gt; pg_settings &lt;span class=&quot;pl-k&quot;&gt;where&lt;/span&gt; name&lt;span class=&quot;pl-k&quot;&gt;=&lt;/span&gt;$$superuser_reserved_connections$$) q2,
    (&lt;span class=&quot;pl-k&quot;&gt;select&lt;/span&gt; setting::&lt;span class=&quot;pl-k&quot;&gt;int&lt;/span&gt; max_conn &lt;span class=&quot;pl-k&quot;&gt;from&lt;/span&gt; pg_settings &lt;span class=&quot;pl-k&quot;&gt;where&lt;/span&gt; name&lt;span class=&quot;pl-k&quot;&gt;=&lt;/span&gt;$$max_connections$$) q3;
&lt;/pre&gt;&lt;p&gt;This will return a table looking something like this:&lt;/p&gt;&lt;link as=&quot;font&quot; crossorigin=&quot;&quot; data-helmet=&quot;VictorMono-v1.5.3-Regular&quot; href=&quot;https://hazelweakly.me/fonts/VictorMono-v1.5.3-Regular.woff2&quot; rel=&quot;preload&quot; type=&quot;font/woff2&quot;&gt;&lt;link as=&quot;font&quot; crossorigin=&quot;&quot; data-helmet=&quot;VictorMono-v1.5.3-Italic&quot; href=&quot;https://hazelweakly.me/fonts/VictorMono-v1.5.3-Italic.woff2&quot; rel=&quot;preload&quot; type=&quot;font/woff2&quot;&gt;&lt;pre&gt;        state        | count | res_for_super | max_conn
---------------------+-------+---------------+----------
 active              |    12 |             3 |     300
 idle                |   127 |             3 |     300
 idle in transaction |     6 |             3 |     300
                     |     6 |             3 |     300
(4 rows)
&lt;/pre&gt;&lt;p&gt;The state column shows has a count for each state (active, idle, idle in transaction). &lt;code&gt;res_for_super&lt;/code&gt; is for connections reserved for superuser access, and &lt;code&gt;max_conn&lt;/code&gt; is the max connections you’ve specified in your settings. They’re duplicated since they’re their own column but its fine; I’m sure there’s a prettier query that can give you this information but this one works.&lt;/p&gt;&lt;p&gt;If you’ll notice, there’s quite a few idle transactions happening here. That’s because the server is in a state of low database usage. This is why you want to use something like pgbouncer so that you can keep the amount of idle connections as low as possible in order to prevent overprovisioning your &lt;code&gt;max_connections&lt;/code&gt;.&lt;/p&gt;&lt;div class=&quot;py-4&quot;&gt;&lt;hr&gt;&lt;/div&gt;&lt;h3 id=&quot;postgres-calculator-math&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://hazelweakly.me/blog/scaling-mastodon/#postgres-calculator-math&quot;&gt;&lt;span&gt;Postgres Calculator Math&lt;/span&gt;&lt;/a&gt;&lt;/h3&gt;&lt;p&gt;Here’s the calculator math that I’ve &lt;s&gt;stolen from &lt;a href=&quot;https://nora.codes/&quot;&gt;nora&lt;/a&gt;&lt;/s&gt; come up with.&lt;/p&gt;&lt;p&gt;Let’s assume the following systemd services (annotated with every setting that causes a connection to postgres). The &lt;code&gt;@Nx&lt;/code&gt; here denotes a systemd unit template file where &lt;code&gt;N&lt;/code&gt; is the number of units you’ve started that correspond to this sidekiq queue. There are also several &lt;code&gt;DB_POOL&lt;/code&gt; variables. Since they are &lt;em&gt;all&lt;/em&gt; different yet called the same environment variable, I am changing them to be unique here so that it makes sense in a calculation formula.&lt;/p&gt;&lt;p&gt;So, here’s the list of services:&lt;/p&gt;&lt;ul&gt;&lt;li&gt;&lt;code&gt;mastodon-web&lt;/code&gt; &lt;ul&gt;&lt;li&gt;&lt;code&gt;WEB_CONCURRENCY&lt;/code&gt;&lt;/li&gt;&lt;li&gt;&lt;code&gt;MAX_THREADS&lt;/code&gt;&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;code&gt;mastodon-streaming&lt;/code&gt; &lt;ul&gt;&lt;li&gt;&lt;code&gt;STREAMING_CLUSTER_NUM&lt;/code&gt;&lt;/li&gt;&lt;li&gt;&lt;code&gt;DB_POOL_streaming&lt;/code&gt;&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;code&gt;mastodon-sidekiq-push@N1&lt;/code&gt; &lt;ul&gt;&lt;li&gt;&lt;code&gt;DB_POOL_push&lt;/code&gt;&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;code&gt;mastodon-sidekiq-pull@N2&lt;/code&gt; &lt;ul&gt;&lt;li&gt;&lt;code&gt;DB_POOL_pull&lt;/code&gt;&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;code&gt;mastodon-sidekiq-scheduler@N3&lt;/code&gt; &lt;ul&gt;&lt;li&gt;&lt;code&gt;DB_POOL_scheduler&lt;/code&gt;&lt;/li&gt;&lt;li&gt;note: you should &lt;em&gt;never&lt;/em&gt; have more than one scheduler running, however you may set &lt;code&gt;DB_POOL&lt;/code&gt; and concurrency to whatever you want it to be.&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;code&gt;mastodon-sidekiq-mailing@N4&lt;/code&gt; &lt;ul&gt;&lt;li&gt;&lt;code&gt;DB_POOL_mailing&lt;/code&gt;&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;code&gt;mastodon-sidekiq-default@N5&lt;/code&gt; &lt;ul&gt;&lt;li&gt;&lt;code&gt;DB_POOL_default&lt;/code&gt;&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;code&gt;mastodon-sidekiq-ingress@N6&lt;/code&gt; &lt;ul&gt;&lt;li&gt;&lt;code&gt;DB_POOL_ingress&lt;/code&gt;&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;And, of course, you’re running postgres somewhere. Postgres has a &lt;code&gt;max_connections&lt;/code&gt; set in its configuration somewhere.&lt;/p&gt;&lt;p&gt;The formula for total connections is:&lt;/p&gt;&lt;link as=&quot;font&quot; crossorigin=&quot;&quot; data-helmet=&quot;VictorMono-v1.5.3-Regular&quot; href=&quot;https://hazelweakly.me/fonts/VictorMono-v1.5.3-Regular.woff2&quot; rel=&quot;preload&quot; type=&quot;font/woff2&quot;&gt;&lt;link as=&quot;font&quot; crossorigin=&quot;&quot; data-helmet=&quot;VictorMono-v1.5.3-Italic&quot; href=&quot;https://hazelweakly.me/fonts/VictorMono-v1.5.3-Italic.woff2&quot; rel=&quot;preload&quot; type=&quot;font/woff2&quot;&gt;&lt;pre&gt;total_mastodon_connections =
  (WEB_CONCURRENCY * MAX_THREADS) +
  (STREAMING_CLUSTER_NUM * DB_POOL_streaming) +
  (N1 * DB_POOL_push) +
  (N2 * DB_POOL_pull) +
  (N3 * DB_POOL_scheduler) +
  (N4 * DB_POOL_mailing) +
  (N5 * DB_POOL_default) +
  (N6 * DB_POOL_ingress)
&lt;/pre&gt;&lt;p&gt;Now, if this number is over &lt;code&gt;max_connections&lt;/code&gt; in your postgres configuration, you lost. In fact, if this number is more than 90% of &lt;code&gt;max_connections&lt;/code&gt;, you’re probably much closer to IMPENDING DOOM than you would ever feel comfortable in public.&lt;/p&gt;&lt;p&gt;Last miscellaneous note: you want postgresql behind a proxy even if it’s on a single node. It’s just too liable to be painful otherwise. Have to stop all clients to get the database back online.&lt;/p&gt;&lt;div class=&quot;py-4&quot;&gt;&lt;hr&gt;&lt;/div&gt;&lt;h3 id=&quot;bouncey-bouncy-bounce-bounce-bounce&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://hazelweakly.me/blog/scaling-mastodon/#bouncey-bouncy-bounce-bounce-bounce&quot;&gt;&lt;span&gt;Bouncey Bouncy Bounce Bounce Bounce&lt;/span&gt;&lt;/a&gt;&lt;/h3&gt;&lt;blockquote class=&quot;flow&quot;&gt;&lt;p&gt;PgBouncer is a single-threaded process which means it only uses a single CPU. […]&lt;/p&gt;&lt;p&gt;In general, a single PgBouncer can process up to 10,000 connections. 1,000 or so can be active at one time. […]&lt;/p&gt;&lt;p&gt;Adjusting connection counts may also require you to adjust some system limits to allow PgBouncer to utilise the number of sockets required […]&lt;/p&gt;&lt;p&gt;– &lt;a href=&quot;https://www.crunchydata.com/blog/postgres-at-scale-running-multiple-pgbouncers&quot;&gt;postgres at scale&lt;/a&gt;&lt;/p&gt;&lt;/blockquote&gt;&lt;p&gt;When do you need more than one pgbouncer?&lt;/p&gt;&lt;blockquote class=&quot;flow&quot;&gt;&lt;ul&gt;&lt;li&gt;PgBouncer’s CPU usage is 100%.&lt;/li&gt;&lt;li&gt;Application queries through PgBouncer wait times increase while Postgres itself is not similarly loaded.&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;– &lt;a href=&quot;https://www.crunchydata.com/blog/postgres-at-scale-running-multiple-pgbouncers&quot;&gt;postgres at scale&lt;/a&gt;&lt;/p&gt;&lt;/blockquote&gt;&lt;p&gt;Likely causes are pgbouncer can’t keep up with number of connections to the database, or the size of result set being returned is too much.&lt;/p&gt;&lt;p&gt;How to test: Run this SQL query on the postgres database.&lt;/p&gt;&lt;link as=&quot;font&quot; crossorigin=&quot;&quot; data-helmet=&quot;VictorMono-v1.5.3-Regular&quot; href=&quot;https://hazelweakly.me/fonts/VictorMono-v1.5.3-Regular.woff2&quot; rel=&quot;preload&quot; type=&quot;font/woff2&quot;&gt;&lt;link as=&quot;font&quot; crossorigin=&quot;&quot; data-helmet=&quot;VictorMono-v1.5.3-Italic&quot; href=&quot;https://hazelweakly.me/fonts/VictorMono-v1.5.3-Italic.woff2&quot; rel=&quot;preload&quot; type=&quot;font/woff2&quot;&gt;&lt;pre class=&quot;highlight highlight-sql&quot;&gt;&lt;span class=&quot;pl-k&quot;&gt;select&lt;/span&gt; state, &lt;span class=&quot;pl-c1&quot;&gt;count&lt;/span&gt;(&lt;span class=&quot;pl-k&quot;&gt;*&lt;/span&gt;)
&lt;span class=&quot;pl-k&quot;&gt;from&lt;/span&gt; pg_stat_activity
&lt;span class=&quot;pl-k&quot;&gt;where&lt;/span&gt; backend_type &lt;span class=&quot;pl-k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;pl-s&quot;&gt;&lt;span class=&quot;pl-pds&quot;&gt;&#39;&lt;/span&gt;client backend&lt;span class=&quot;pl-pds&quot;&gt;&#39;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;pl-k&quot;&gt;group by&lt;/span&gt; &lt;span class=&quot;pl-c1&quot;&gt;1&lt;/span&gt;;
&lt;/pre&gt;&lt;p&gt;If your idle connections is zero (or very close to zero) pgbouncer is bottlenecked.&lt;/p&gt;&lt;div class=&quot;py-4&quot;&gt;&lt;hr&gt;&lt;/div&gt;&lt;h3 id=&quot;select-&#39;bottle&#39;-from-&#39;neck&#39;-where-id-unknown&quot; tabindex=&quot;-1&quot;&gt;&lt;a href=&quot;https://hazelweakly.me/blog/scaling-mastodon/#select-&#39;bottle&#39;-from-&#39;neck&#39;-where-id-unknown&quot; class=&quot;header-anchor&quot;&gt;&lt;span&gt;SELECT ‘bottle’ FROM ‘neck’ WHERE id = unknown&lt;/span&gt;&lt;/a&gt;&lt;/h3&gt;&lt;p&gt;If you want to find some bottlenecks in your database, according to &lt;a href=&quot;https://mastodon.social/@AndresFreundTec&quot;&gt;@AndresFreundTec@mastodon.social&lt;/a&gt;, you can run the below query and analyse its output as a starting point.&lt;/p&gt;&lt;link as=&quot;font&quot; crossorigin=&quot;&quot; data-helmet=&quot;VictorMono-v1.5.3-Regular&quot; href=&quot;https://hazelweakly.me/fonts/VictorMono-v1.5.3-Regular.woff2&quot; rel=&quot;preload&quot; type=&quot;font/woff2&quot;&gt;&lt;link as=&quot;font&quot; crossorigin=&quot;&quot; data-helmet=&quot;VictorMono-v1.5.3-Italic&quot; href=&quot;https://hazelweakly.me/fonts/VictorMono-v1.5.3-Italic.woff2&quot; rel=&quot;preload&quot; type=&quot;font/woff2&quot;&gt;&lt;pre class=&quot;highlight highlight-sql&quot;&gt;&lt;span class=&quot;pl-k&quot;&gt;SELECT&lt;/span&gt; backend_type, state, wait_event_type, wait_event, &lt;span class=&quot;pl-c1&quot;&gt;count&lt;/span&gt;(&lt;span class=&quot;pl-k&quot;&gt;*&lt;/span&gt;)
  &lt;span class=&quot;pl-k&quot;&gt;FROM&lt;/span&gt; pg_stat_activity
    &lt;span class=&quot;pl-k&quot;&gt;WHERE&lt;/span&gt; pid &lt;span class=&quot;pl-k&quot;&gt;&lt;&gt;&lt;/span&gt; pg_backend_pid()
      &lt;span class=&quot;pl-k&quot;&gt;AND&lt;/span&gt; wait_event_type IS DISTINCT &lt;span class=&quot;pl-k&quot;&gt;FROM&lt;/span&gt; &lt;span class=&quot;pl-s&quot;&gt;&lt;span class=&quot;pl-pds&quot;&gt;&#39;&lt;/span&gt;Activity&lt;span class=&quot;pl-pds&quot;&gt;&#39;&lt;/span&gt;&lt;/span&gt;
  &lt;span class=&quot;pl-k&quot;&gt;GROUP BY&lt;/span&gt; &lt;span class=&quot;pl-c1&quot;&gt;1&lt;/span&gt;, &lt;span class=&quot;pl-c1&quot;&gt;2&lt;/span&gt;, &lt;span class=&quot;pl-c1&quot;&gt;3&lt;/span&gt;, &lt;span class=&quot;pl-c1&quot;&gt;4&lt;/span&gt;
  &lt;span class=&quot;pl-k&quot;&gt;ORDER BY&lt;/span&gt; &lt;span class=&quot;pl-c1&quot;&gt;count&lt;/span&gt;(&lt;span class=&quot;pl-k&quot;&gt;*&lt;/span&gt;) &lt;span class=&quot;pl-k&quot;&gt;DESC&lt;/span&gt;;
&lt;/pre&gt;&lt;p&gt;Here’s an example of what that would look like:&lt;/p&gt;&lt;link as=&quot;font&quot; crossorigin=&quot;&quot; data-helmet=&quot;VictorMono-v1.5.3-Regular&quot; href=&quot;https://hazelweakly.me/fonts/VictorMono-v1.5.3-Regular.woff2&quot; rel=&quot;preload&quot; type=&quot;font/woff2&quot;&gt;&lt;link as=&quot;font&quot; crossorigin=&quot;&quot; data-helmet=&quot;VictorMono-v1.5.3-Italic&quot; href=&quot;https://hazelweakly.me/fonts/VictorMono-v1.5.3-Italic.woff2&quot; rel=&quot;preload&quot; type=&quot;font/woff2&quot;&gt;&lt;pre&gt;  backend_type  |        state        | wait_event_type |      wait_event      | count
----------------+---------------------+-----------------+----------------------+-------
 client backend | idle                | Client          | ClientRead           |    52
 client backend | active              | Lock            | relation             |    13
 client backend | idle in transaction | Client          | ClientRead           |    11
 client backend | active              | Client          | ClientRead           |     2
 checkpointer   |                     | Timeout         | CheckpointWriteDelay |     1
 client backend | active              |                 |                      |     1
(6 rows)
&lt;/pre&gt;&lt;p&gt;If you’re write latency bound, the query will show a lot of &lt;code&gt;WALWrite&lt;/code&gt; wait events.&lt;/p&gt;&lt;p&gt;Setting &lt;code&gt;synchronous_commit = off&lt;/code&gt; can alleviate that (although understand roughly what it’s doing first). &lt;a href=&quot;https://www.percona.com/blog/2020/08/21/postgresql-synchronous_commit-options-and-synchronous-standby-replication/&quot;&gt;Here’s a nice explainer&lt;/a&gt;&lt;/p&gt;&lt;p&gt;One particular warning, also from @AndresFreundTec, is that setting &lt;code&gt;synchronous_commit = off&lt;/code&gt; means your transactions aren’t immediately guaranteed to be durable. That… Should be fine for Mastodon… I think&lt;/p&gt;&lt;div class=&quot;py-4&quot;&gt;&lt;hr&gt;&lt;/div&gt;&lt;h3 id=&quot;memory-x-memory-the-not-anime&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://hazelweakly.me/blog/scaling-mastodon/#memory-x-memory-the-not-anime&quot;&gt;&lt;span&gt;Memory X Memory the not-anime&lt;/span&gt;&lt;/a&gt;&lt;/h3&gt;&lt;blockquote class=&quot;flow&quot;&gt;&lt;p&gt;Database tip: if you need to increase &lt;code&gt;max_connections&lt;/code&gt; on PostgreSQL, make sure to check what &lt;code&gt;work_mem&lt;/code&gt; is set to. If &lt;code&gt;max_connections X work_mem&lt;/code&gt; is more than double the RAM you have on the server, maybe lower &lt;code&gt;work_mem&lt;/code&gt;.&lt;/p&gt;&lt;p&gt;– &lt;a href=&quot;https://m6n.io/@fuzzychef/109366609465907548&quot;&gt;@fuzzychef@m6n.io&lt;/a&gt;&lt;/p&gt;&lt;/blockquote&gt;&lt;h2 id=&quot;redis&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://hazelweakly.me/blog/scaling-mastodon/#redis&quot;&gt;&lt;span&gt;Redis&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;&lt;h3 id=&quot;lies-damned-lies-and-redis&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://hazelweakly.me/blog/scaling-mastodon/#lies-damned-lies-and-redis&quot;&gt;&lt;span&gt;Lies, damned lies, and redis&lt;/span&gt;&lt;/a&gt;&lt;/h3&gt;&lt;blockquote class=&quot;flow&quot;&gt;&lt;p&gt;Mastodon uses Redis. [It supports], &lt;code&gt;SIDEKIQ_REDIS_URL&lt;/code&gt;, &lt;code&gt;CACHE_REDIS_URL&lt;/code&gt; and just &lt;code&gt;REDIS_URL&lt;/code&gt;. (Actually, Mastodon supports &lt;code&gt;REDIS_HOST&lt;/code&gt;, &lt;code&gt;REDIS_PORT&lt;/code&gt; etc variants separately for all three).&lt;/p&gt;&lt;p&gt;– &lt;a href=&quot;https://docs.joinmastodon.org/admin/scaling/#redis&quot;&gt;mastodon scaling docs for redis&lt;/a&gt;&lt;/p&gt;&lt;/blockquote&gt;&lt;p&gt;note: redis is used for BOTH volatile cache and persistent data. sidekiq, list feeds, home feeds, and streaming API are all needed to be in persistent redis which shouldn’t be lost.&lt;/p&gt;&lt;p&gt;note: This is written as if using a separate redis for cache and persistent data is optional. It is not.&lt;/p&gt;&lt;p&gt;&lt;a href=&quot;https://github.com/mperham/sidekiq/wiki/Using-Redis#multiple-redis-instances&quot;&gt;read and weep&lt;/a&gt;.&lt;/p&gt;&lt;blockquote class=&quot;flow&quot;&gt;&lt;p&gt;[…] it’s important that Sidekiq be run against a Redis instance that is not configured as a cache but as a persistent store. […] I recommend using two separate Redis instances, each configured appropriately, if you wish to use Redis for caching and Sidekiq. Redis namespaces do not allow for this configuration and come with &lt;a href=&quot;https://www.mikeperham.com/2015/09/24/storing-data-with-redis&quot;&gt;many other problems&lt;/a&gt;, so using discrete Redis instances is always preferred.&lt;/p&gt;&lt;/blockquote&gt;&lt;p&gt;Speaking of which, the blog post that was linked from the redis wiki is very nice. You should read it: &lt;a href=&quot;https://www.mikeperham.com/2015/09/24/storing-data-with-redis&quot;&gt;storing data with redis&lt;/a&gt;.&lt;/p&gt;&lt;p&gt;This one is also potentially useful. &lt;a href=&quot;https://severalnines.com/blog/performance-tuning-redis/&quot;&gt;performance tuning for redis&lt;/a&gt;&lt;/p&gt;&lt;div class=&quot;py-4&quot;&gt;&lt;hr&gt;&lt;/div&gt;&lt;h3 id=&quot;how-to-redis-correctly&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://hazelweakly.me/blog/scaling-mastodon/#how-to-redis-correctly&quot;&gt;&lt;span&gt;How to redis correctly&lt;/span&gt;&lt;/a&gt;&lt;/h3&gt;&lt;p&gt;So, from the &lt;a href=&quot;https://www.mikeperham.com/2015/09/24/storing-data-with-redis&quot;&gt;storing data with redis&lt;/a&gt; post:&lt;/p&gt;&lt;blockquote class=&quot;flow&quot;&gt;&lt;p&gt;There are several questions to answer when determining how to use Redis for different datasets:&lt;/p&gt;&lt;ul&gt;&lt;li&gt;Can I flush the dataset without affecting other datasets?&lt;/li&gt;&lt;li&gt;Can I tune the persistence strategy per dataset? For transactional data, you want real-time persistence with AOF. For cache, you want infrequent RDB snapshots or no persistence at all.&lt;/li&gt;&lt;li&gt;Can I scale Redis per dataset? Redis is single-threaded […] Datasets in the same Redis instance will share that budget. What happens when your traffic spikes and the cache data uses the entire budget? Now your job queue slows to a crawl.&lt;/li&gt;&lt;/ul&gt;&lt;/blockquote&gt;&lt;p&gt;The conclusions are that, for mastodon’s two needs (cache + storage), you &lt;em&gt;must&lt;/em&gt; use two separate redis instances or you’re not going to be able to actually change your persistence strategy. Everything else is practically irrelevant; if you can’t change the persistence strategy, there’s no point in using redis for both usecases.&lt;/p&gt;&lt;blockquote class=&quot;flow&quot;&gt;&lt;p&gt;Redis itself can be scaled using either Redis Sentinel or Redis Cluster.&lt;/p&gt;&lt;p&gt;For Sidekiq, only the Sentinel option is viable, as Sidekiq uses a small number of frequently updated keys. With Sentinel, we get fail-over, but we won’t increase the server’s throughput.&lt;/p&gt;&lt;p&gt;For the home feed caches, we might use Redis Cluster, which will distribute the many cache keyes across available nodes.&lt;/p&gt;&lt;p&gt;&lt;a href=&quot;https://softwaremill.com/the-architecture-of-mastodon/&quot;&gt;the architecture of mastodon&lt;/a&gt;&lt;/p&gt;&lt;/blockquote&gt;&lt;h2 id=&quot;storage&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://hazelweakly.me/blog/scaling-mastodon/#storage&quot;&gt;&lt;span&gt;Storage&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;&lt;h3 id=&quot;object-storage&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://hazelweakly.me/blog/scaling-mastodon/#object-storage&quot;&gt;&lt;span&gt;Object Storage&lt;/span&gt;&lt;/a&gt;&lt;/h3&gt;&lt;blockquote class=&quot;flow&quot;&gt;&lt;p&gt;At this point it’s worth mentioning that if you want to go further [beyond ~20k users], you’ll need to be using object storage (S3 or similar) for user file uploads, or else manually figure out a shared filesystem between all of the machines in your cluster (very likely possible, but probably not worth it compared to even just self-hosting Minio)&lt;/p&gt;&lt;p&gt;– &lt;a href=&quot;https://gist.github.com/Gargron/aa9341a49dc91d5a721019d9e0c9fd11&quot;&gt;scaling a mastodon server&lt;/a&gt;&lt;/p&gt;&lt;/blockquote&gt;&lt;p&gt;note: we have learned that a shared file system does not actually work unless it is a &lt;em&gt;local&lt;/em&gt; shared file system. Sidekiq is too latency sensitive otherwise.&lt;/p&gt;&lt;p&gt;note: the parenthesis gives it away. Nobody running mastodon at scale has ever tried to do it without an object database and we have unwittingly ran into edgecases where scaling advice leads us astray here. Remediation is to move to an object storage solution sooner rather than later.&lt;/p&gt;&lt;div class=&quot;py-4&quot;&gt;&lt;hr&gt;&lt;/div&gt;&lt;h3 id=&quot;nfs:-no-fucking-scale&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://hazelweakly.me/blog/scaling-mastodon/#nfs:-no-fucking-scale&quot;&gt;&lt;span&gt;NFS: No FUCKING Scale&lt;/span&gt;&lt;/a&gt;&lt;/h3&gt;&lt;p&gt;That’s it, that’s the &lt;s&gt;tweet&lt;/s&gt; toot.&lt;/p&gt;&lt;p&gt;Don’t use NFS for anything; the mastodon documentation claims you can use it. You cannot. Don’t even think about it. Run an object storage locally if you have to; it’s simpler now with projects like seaweedfs and a very good idea.&lt;/p&gt;&lt;h2 id=&quot;sidekiq-and-ruby&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://hazelweakly.me/blog/scaling-mastodon/#sidekiq-and-ruby&quot;&gt;&lt;span&gt;Sidekiq and Ruby&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;&lt;h3 id=&quot;sidekiq-scaling-indications&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://hazelweakly.me/blog/scaling-mastodon/#sidekiq-scaling-indications&quot;&gt;&lt;span&gt;Sidekiq scaling indications&lt;/span&gt;&lt;/a&gt;&lt;/h3&gt;&lt;blockquote class=&quot;flow&quot;&gt;&lt;p&gt;These jobs are split into as many tiny jobs as we can manage, because that’s how you can make parallelise them best and thus make the most optimal use of hardware and horizontal scaling. But if you’ve got 10 threads and 22,000 followers, do not be surprised that there are delays. In fact, that is how the need for scaling Sidekiq shows itself: the dreaded backlog.&lt;/p&gt;&lt;p&gt;– &lt;a href=&quot;https://gist.github.com/Gargron/aa9341a49dc91d5a721019d9e0c9fd11&quot;&gt;scaling a mastodon server&lt;/a&gt;&lt;/p&gt;&lt;/blockquote&gt;&lt;p&gt;note: this is a sidekiq scaling indicator. However, we can’t scale sidekiq beyond the database and filesystem allows&lt;/p&gt;&lt;blockquote class=&quot;flow&quot;&gt;&lt;p&gt;Actually, there are more reasons that the backlog can grow, such as if there’s a technical issue causing individual jobs to take longer than they normally would, or getting stuck indefinitely reducing the effective number of threads available for processing&lt;/p&gt;&lt;p&gt;– &lt;a href=&quot;https://gist.github.com/Gargron/aa9341a49dc91d5a721019d9e0c9fd11&quot;&gt;scaling a mastodon server&lt;/a&gt;&lt;/p&gt;&lt;/blockquote&gt;&lt;p&gt;note: this is very buried but very important. Indicators of sidekiq backlog growing can &lt;em&gt;also&lt;/em&gt; be jobs getting stuck. We encountered this with NFS.&lt;/p&gt;&lt;p&gt;note: Hypothesis. We ended up wanting to scale workers up because we were getting a lot of stuck workers due to file system issues. Then when things resolved, we actually had too &lt;em&gt;many&lt;/em&gt; workers hitting the database all at once, then we got too much database contention which locked up &lt;em&gt;those&lt;/em&gt; workers, leading us to reduce workers, causing a vicious cycle depending on which was misbehaving more, postgres or NFS.&lt;/p&gt;&lt;div class=&quot;py-4&quot;&gt;&lt;hr&gt;&lt;/div&gt;&lt;h3 id=&quot;sidekiq-queues-and-how-they-hate-you&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://hazelweakly.me/blog/scaling-mastodon/#sidekiq-queues-and-how-they-hate-you&quot;&gt;&lt;span&gt;Sidekiq queues and how they hate you&lt;/span&gt;&lt;/a&gt;&lt;/h3&gt;&lt;blockquote class=&quot;flow&quot;&gt;&lt;p&gt;One thing to remember is that there should only be one scheduler in your entire cluster, and it doesn’t need many threads (5 is fine). […] It’s just that default is the most important one, with push and ingress being close second. mailers is also important but even just 25 threads will get you very far because the rate of sending e-mails isn’t that high.&lt;/p&gt;&lt;p&gt;– &lt;a href=&quot;https://gist.github.com/Gargron/aa9341a49dc91d5a721019d9e0c9fd11&quot;&gt;scaling a mastodon server&lt;/a&gt;&lt;/p&gt;&lt;/blockquote&gt;&lt;p&gt;note: the ellipsis here is frustrating. There is an entire paragraph that sums up to “you can setup your queues a bunch of ways. Nobody’s ever done performance measurements on them lol good luck bro”&lt;/p&gt;&lt;p&gt;note: my personal hypothesis is as follows. Given the math calculation from Nora’s blog post, each thread in each process has its own separate database connection. As such, &lt;code&gt;thread * process&lt;/code&gt; is always the math we need to use for everything. With all of that in mind, we should experience an irrelevant amount of overhead from &lt;code&gt;sidekiq -q single-queue&lt;/code&gt; (&lt;code&gt;xN&lt;/code&gt;) vs &lt;code&gt;-q q1,q2,q3,q4&lt;/code&gt; (&lt;code&gt;xN&lt;/code&gt;). The difference washes out and database connections are not necessarily used more efficiently unless we can somehow use less sidekiq processes.&lt;/p&gt;&lt;p&gt;I fleshed this math out more in the &lt;a href=&quot;https://hazelweakly.me/blog/scaling-mastodon/#postgres-calculator-math&quot;&gt;postgres math section&lt;/a&gt;&lt;/p&gt;&lt;p&gt;note: tl;dr, single queue for each service. use systemd service templates. ramp them up as needed. rely entirely on pgbouncer to not cause database contention even though it’s fucking ridiculous that we would need to do that.&lt;/p&gt;&lt;div class=&quot;py-4&quot;&gt;&lt;hr&gt;&lt;/div&gt;&lt;h3 id=&quot;sidekiq-memory-fragmentation&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://hazelweakly.me/blog/scaling-mastodon/#sidekiq-memory-fragmentation&quot;&gt;&lt;span&gt;Sidekiq memory fragmentation&lt;/span&gt;&lt;/a&gt;&lt;/h3&gt;&lt;blockquote class=&quot;flow&quot;&gt;&lt;p&gt;One of the most important things we’ve learned over the years about Sidekiq is that a bad interaction between the C-Ruby runtime and the malloc memory allocator included in Linux’s glibc can cause extremely high memory usage. I’ll talk about what causes this bad interaction in a later email, but for now, let’s just concentrate on the effects.&lt;/p&gt;&lt;p&gt;Sidekiq with high concurrency settings, when running on Linux, can have what looks like a “memory leak”. A single Sidekiq process can slowly grow from 256MB of memory usage to 1GB in less than 24 hours. However, rather than a leak, this is actually memory fragmentation.&lt;/p&gt;&lt;p&gt;– &lt;a href=&quot;https://us11.campaign-archive.com/?u=1aa0f43522f6d9ef96d1c5d6f&amp;id=997fbd1c2c&quot;&gt;wisdom of the ancients from a mailing list&lt;/a&gt;&lt;/p&gt;&lt;/blockquote&gt;&lt;div class=&quot;py-4&quot;&gt;&lt;hr&gt;&lt;/div&gt;&lt;h3 id=&quot;locks-that-bind-and-binds-that-lock&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://hazelweakly.me/blog/scaling-mastodon/#locks-that-bind-and-binds-that-lock&quot;&gt;&lt;span&gt;Locks That Bind and Binds that Lock&lt;/span&gt;&lt;/a&gt;&lt;/h3&gt;&lt;p&gt;Rails effectively does this&lt;/p&gt;&lt;ol&gt;&lt;li&gt;Start db transaction&lt;/li&gt;&lt;li&gt;Upload image to media storage&lt;/li&gt;&lt;li&gt;INSERT or UPDATE statement&lt;/li&gt;&lt;li&gt;Commit the transaction&lt;/li&gt;&lt;/ol&gt;&lt;p&gt;Let’s walk through the implications of this briefly. But first, go ahead and scream into the void; it’ll be helpful.&lt;/p&gt;&lt;p&gt;Now wasn’t that refreshing?&lt;/p&gt;&lt;p&gt;Ok, so the implications here:&lt;/p&gt;&lt;ul&gt;&lt;li&gt;If your media storage is the file system, you can hit the file system with the database &lt;em&gt;and&lt;/em&gt; the media storage at the same time.&lt;/li&gt;&lt;li&gt;If your file system is the &lt;em&gt;same&lt;/em&gt; file system you can cause slowdowns on that disk from two separate directions that are both now mutually related.&lt;/li&gt;&lt;li&gt;If that file system is NFS, now the network is involved &lt;em&gt;inside&lt;/em&gt; that database transaction&lt;/li&gt;&lt;li&gt;Oh also this is in sidekiq so it’s all parallel and concurrent&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;The lesson here is use an object storage from day one if you can. Preferably one that doesn’t live on the same disk as postgres. NFS in particular is going to be a &lt;em&gt;very&lt;/em&gt; poor choice here. It’s bad enough that, honestly, mastodon documentation should warn against using it rather than presenting it as an option.&lt;/p&gt;&lt;div class=&quot;py-4&quot;&gt;&lt;hr&gt;&lt;/div&gt;&lt;h3 id=&quot;mastodon-web-and-mastodon-streaming&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://hazelweakly.me/blog/scaling-mastodon/#mastodon-web-and-mastodon-streaming&quot;&gt;&lt;span&gt;mastodon-web and mastodon-streaming&lt;/span&gt;&lt;/a&gt;&lt;/h3&gt;&lt;blockquote class=&quot;flow&quot;&gt;&lt;p&gt;At some point you will definitely want Puma to be on a separate machine from Sidekiq, and then have more machines with Puma, and more machines with Sidekiq. […] Just don’t forget that once your Puma isn’t on the same machine as your nginx, you will need to specify &lt;code&gt;TRUSTED_PROXY_IP&lt;/code&gt; with the internal IP of the load balancer so that Puma can correctly parse users’ IP addresses for stuff like rate limiting. […] use an upstream block in your nginx configuration to list these Pumas and nginx will do the load balancing between them.&lt;/p&gt;&lt;/blockquote&gt;&lt;p&gt;note: separate machines with just puma and just sidekiq are what we need to start moving towards.&lt;/p&gt;&lt;blockquote class=&quot;flow&quot;&gt;&lt;p&gt;The streaming API will get you pretty far on default configuration, but at some point it too will not be able to answer all of the connections. […] The moment when this becomes necessary can be difficult to detect, because for people who’ve already connected, the streaming API will continue to work, it’s new connections that will be rejected.&lt;/p&gt;&lt;p&gt;– &lt;a href=&quot;https://gist.github.com/Gargron/aa9341a49dc91d5a721019d9e0c9fd11&quot;&gt;scaling a mastodon server&lt;/a&gt;&lt;/p&gt;&lt;/blockquote&gt;&lt;div class=&quot;py-4&quot;&gt;&lt;hr&gt;&lt;/div&gt;&lt;h3 id=&quot;you-can-read-me-but-you&#39;ll-never-clock-me&quot; tabindex=&quot;-1&quot;&gt;&lt;a href=&quot;https://hazelweakly.me/blog/scaling-mastodon/#you-can-read-me-but-you&#39;ll-never-clock-me&quot; class=&quot;header-anchor&quot;&gt;&lt;span&gt;You can read me but you’ll never clock me&lt;/span&gt;&lt;/a&gt;&lt;/h3&gt;&lt;blockquote class=&quot;flow&quot;&gt;&lt;p&gt;Note, however, that the Sidekiq jobs will need to perform both reads &amp; writes from the main node, hence the scaled-up [read only replicas] are only for the other clients (web, mobile, streaming).&lt;/p&gt;&lt;p&gt;&lt;a href=&quot;https://softwaremill.com/the-architecture-of-mastodon/&quot;&gt;the architecture of mastodon&lt;/a&gt;&lt;/p&gt;&lt;/blockquote&gt;&lt;div class=&quot;py-4&quot;&gt;&lt;hr&gt;&lt;/div&gt;&lt;h3 id=&quot;sidekiq-cool-triq&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://hazelweakly.me/blog/scaling-mastodon/#sidekiq-cool-triq&quot;&gt;&lt;span&gt;Sidekiq cool triq&lt;/span&gt;&lt;/a&gt;&lt;/h3&gt;&lt;p&gt;Your sidekiq command in a systemd service should look a little different than most guidelines actually show.&lt;/p&gt;&lt;p&gt;The most important things are using systemd templates and structuring the sidekiq start command to only have one queue. Here’s a truncated example of a pull queue sidekiq:&lt;/p&gt;&lt;link as=&quot;font&quot; crossorigin=&quot;&quot; data-helmet=&quot;VictorMono-v1.5.3-Regular&quot; href=&quot;https://hazelweakly.me/fonts/VictorMono-v1.5.3-Regular.woff2&quot; rel=&quot;preload&quot; type=&quot;font/woff2&quot;&gt;&lt;link as=&quot;font&quot; crossorigin=&quot;&quot; data-helmet=&quot;VictorMono-v1.5.3-Italic&quot; href=&quot;https://hazelweakly.me/fonts/VictorMono-v1.5.3-Italic.woff2&quot; rel=&quot;preload&quot; type=&quot;font/woff2&quot;&gt;&lt;pre&gt;# mastodon-sidekiq-pull@.service
[Unit]
Description=mastodon-sidekiq-pull
After=network.target

[Service]
Type=simple
User=mastodon
# ... snip
Environment=&quot;DB_POOL=10&quot;
ExecStart=/usr/bin/bundle exec sidekiq -c $DB_POOL -q pull
# ... snip
&lt;/pre&gt;&lt;p&gt;A few numbers:&lt;/p&gt;&lt;ul&gt;&lt;li&gt;&lt;code&gt;DB_POOL&lt;/code&gt; and the &lt;code&gt;-c NN&lt;/code&gt; number need to match up. They don’t &lt;em&gt;have&lt;/em&gt; to, but… they should.&lt;/li&gt;&lt;li&gt;&lt;strong&gt;ONLY ONE QUEUE IN THE SYSTEMD SERVICE&lt;/strong&gt;&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;Here’s why.&lt;/p&gt;&lt;p&gt;This starts one process and creates 10 connections to the database. The overhead between this vs 2 systemd units with 10 threads is basically zero. There are reasons to do it (concurrency vs parallelism and ruby has a GIL which limits parallelism capabilities), but DB connection number is the same. So, realistically, there’s almost zero downside to having more systemd services.&lt;/p&gt;&lt;p&gt;BUT. You get the ability to log out various sidekiq queues and quickly narrow down which one is erroring. You also get the ability to scale up an individual queue better on demand. Keep the &lt;code&gt;c&lt;/code&gt; number smaller (no more than 25) and make more as needed, it’s fine. That’s why these are systemd templates.&lt;/p&gt;&lt;p&gt;If you want to look a very nice approach in more detail, see this &lt;a href=&quot;https://www.eigenmagic.com/2022/12/11/better-mastodon-sidekiq-scaling-with-systemd-environmentfile/&quot;&gt;blog post by Justin Warren&lt;/a&gt;. It does this quite well but lets you reuse the same template and modify paramaters via environment files rather than by editing the systemd templates themselves. (I would suggest following the other advice I’ve given here: one queue per systemd service, don’t do weighted queues, etc; keep sidekiq as simple as possible).&lt;/p&gt;&lt;p&gt;Last bits of advice for sidekiq systemd services. Here are the magical numbers for the queues to use that have been tested by &amp;LThachyderm.io&gt; for you to use.&lt;/p&gt;&lt;ul&gt;&lt;li&gt;For the default, push, and pull sidekiq queues: Set &lt;code&gt;DB_POOL&lt;/code&gt; to 10 and set &lt;code&gt;-c&lt;/code&gt; to the value of &lt;code&gt;$DB_POOL&lt;/code&gt;.&lt;/li&gt;&lt;li&gt;For the ingress and scheduler queue: Set &lt;code&gt;DB_POOL&lt;/code&gt; to 5 and set &lt;code&gt;-c&lt;/code&gt; to the value of &lt;code&gt;$DB_POOL&lt;/code&gt;.&lt;/li&gt;&lt;li&gt;For the mailer queue: Set &lt;code&gt;DB_POOL&lt;/code&gt; to 1 and set &lt;code&gt;-c&lt;/code&gt; to the value of &lt;code&gt;$DB_POOL&lt;/code&gt;.&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;Ingress is very very CPU bound and will suck up a whole CPU core with very little threads, so be prepared to spin up multiple processes for the ingress queue. I used to say set the &lt;code&gt;DB_POOL&lt;/code&gt; to 20, but I have changed this to 10 to reflect real world usage; shit just runs nicer at 10.&lt;/p&gt;&lt;div class=&quot;py-4&quot;&gt;&lt;hr&gt;&lt;/div&gt;&lt;h3 id=&quot;ruby-knobs&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://hazelweakly.me/blog/scaling-mastodon/#ruby-knobs&quot;&gt;&lt;span&gt;Ruby Knobs&lt;/span&gt;&lt;/a&gt;&lt;/h3&gt;&lt;link as=&quot;font&quot; crossorigin=&quot;&quot; data-helmet=&quot;VictorMono-v1.5.3-Regular&quot; href=&quot;https://hazelweakly.me/fonts/VictorMono-v1.5.3-Regular.woff2&quot; rel=&quot;preload&quot; type=&quot;font/woff2&quot;&gt;&lt;link as=&quot;font&quot; crossorigin=&quot;&quot; data-helmet=&quot;VictorMono-v1.5.3-Italic&quot; href=&quot;https://hazelweakly.me/fonts/VictorMono-v1.5.3-Italic.woff2&quot; rel=&quot;preload&quot; type=&quot;font/woff2&quot;&gt;&lt;pre&gt;WEB_CONCURRENCY controls the number of worker processes
MAX_THREADS controls the number of threads per process
&lt;/pre&gt;&lt;p&gt;Those above environment variables apply to mastodon-web and &lt;em&gt;only&lt;/em&gt; mastodon-web. The sidekiq queue has two knobs: processes and threads. Each &lt;code&gt;mastodon-sidekiq-${queue}@N&lt;/code&gt; creates 1 new process. Each process can allocate &lt;code&gt;X&lt;/code&gt; threads according to the &lt;code&gt;-c X&lt;/code&gt; setting in the ExecStart of the systemd service.&lt;/p&gt;&lt;p&gt;As a further annoyance, &lt;code&gt;DB_POOL&lt;/code&gt; is the third hidden and extremely fucked up knob you have for the sidekiq services. &lt;code&gt;DB_POOL&lt;/code&gt; can be different from the concurrency but it should always be &lt;code&gt;DB_POOL &gt;= X&lt;/code&gt; (where &lt;code&gt;X&lt;/code&gt; is the concurrency in &lt;code&gt;sidekiq -c X -q ...&lt;/code&gt;). As a simplification, I don’t really ever see anyone ever set &lt;code&gt;DB_POOL&lt;/code&gt; to anything other than exactly &lt;code&gt;X&lt;/code&gt;.&lt;/p&gt;&lt;p&gt;So, &lt;code&gt;DB_POOL&lt;/code&gt; is local to sidekiq and also applies only to sidekiq (and also mastodon-streaming because surprise! Chuckles, that’s why). However, you have two notions of pool here. One that’s local to that particular sidekiq queue, and one that’s relevant to postgres. Postgres has a setting &lt;code&gt;max_connections&lt;/code&gt; that is the global &lt;code&gt;max_connections&lt;/code&gt;.&lt;/p&gt;&lt;div class=&quot;py-4&quot;&gt;&lt;hr&gt;&lt;/div&gt;&lt;h3 id=&quot;crawling-up-the-elephant&#39;s-trunk&quot; tabindex=&quot;-1&quot;&gt;&lt;a href=&quot;https://hazelweakly.me/blog/scaling-mastodon/#crawling-up-the-elephant&#39;s-trunk&quot; class=&quot;header-anchor&quot;&gt;&lt;span&gt;Crawling Up the Elephant’s Trunk&lt;/span&gt;&lt;/a&gt;&lt;/h3&gt;&lt;p&gt;What’s going on &lt;em&gt;in&lt;/em&gt; there? Where do duplicate queues come from? How does that get resolved? Some helpful people on discord have given me some clues :)&lt;/p&gt;&lt;blockquote class=&quot;flow&quot;&gt;&lt;p&gt;Mastodon does a series of operations synchronously, but none of them are atomic. one of those operations is setting the deduplication key, another one is creating a local db entry, another one is scheduling a background task to blast out the post to all the instances you have followers on&lt;/p&gt;&lt;p&gt;&lt;a href=&quot;https://hachyderm.io/@untitaker&quot;&gt;@untitaker@hachyderm.io&lt;/a&gt;&lt;/p&gt;&lt;/blockquote&gt;&lt;p&gt;In theory, these are atomic, but none of them are atomic, so every single step has the ability to cause duplications and errors.&lt;/p&gt;&lt;p&gt;The order of operations here is (thanks &lt;a href=&quot;https://hachyderm.io/@unlambda&quot;&gt;@unlambda@hachyderm.io&lt;/a&gt;):&lt;/p&gt;&lt;ol&gt;&lt;li&gt;Check deduplication key&lt;/li&gt;&lt;li&gt;Write status to db&lt;/li&gt;&lt;li&gt;Process tags and update tag tables (in separate transaction)&lt;/li&gt;&lt;li&gt;Process mentions and update mention stats (in a separate transaction)&lt;/li&gt;&lt;li&gt;Schedule background jobs&lt;/li&gt;&lt;li&gt;Do some updates to tables which indicate “potential friendships” (mastodon will suggest people that you might want to follow based on who you have replied to)&lt;/li&gt;&lt;li&gt;Set deduplication key in redis&lt;/li&gt;&lt;/ol&gt;&lt;p&gt;This will cause issues in several scenarios, but the following scenario is the one you’ll notice if your DB is overloaded:&lt;/p&gt;&lt;ol&gt;&lt;li&gt;You have a large backlog of DB transactions and it takes a while&lt;/li&gt;&lt;li&gt;Everything works, but nginx gives up after 60 seconds&lt;/li&gt;&lt;li&gt;However rails does not cancel the request and continues to process it&lt;/li&gt;&lt;li&gt;The app/webserver/etc retries after getting a timeout indication &lt;ul&gt;&lt;li&gt;repeat until the app stops retrying&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;/ol&gt;&lt;p&gt;The summary here is that there’s a hidden component to scaling here:&lt;/p&gt;&lt;ul&gt;&lt;li&gt;Your server API response should not be shorter than the connection timeout&lt;/li&gt;&lt;li&gt;If it is you’ll add a bunch of duplication work for yourself and exacerbate the issue&lt;/li&gt;&lt;/ul&gt;&lt;div class=&quot;py-4&quot;&gt;&lt;hr&gt;&lt;/div&gt;&lt;h3 id=&quot;eating-rce-and-porridge-for-breakfast&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://hazelweakly.me/blog/scaling-mastodon/#eating-rce-and-porridge-for-breakfast&quot;&gt;&lt;span&gt;Eating RCE and porridge for breakfast&lt;/span&gt;&lt;/a&gt;&lt;/h3&gt;&lt;p&gt;Are you seeing a lot of &lt;code&gt;Mastodon::RaceConditionError&lt;/code&gt;s in your logs for sidekiq?&lt;/p&gt;&lt;p&gt;According to &lt;a href=&quot;https://github.com/mastodon/mastodon/issues/15525#issuecomment-898671270&quot;&gt;this issue&lt;/a&gt; that might be totally expected.&lt;/p&gt;&lt;p&gt;This &lt;em&gt;particular&lt;/em&gt; issue has been fixed:&lt;/p&gt;&lt;blockquote class=&quot;flow&quot;&gt;&lt;p&gt;I am constantly getting this as well, and from what I can tell it’s because the retry timeout for RedisLock is set to a default of 10 seconds while the default expiration time for RedisLock is 15 minutes, per #16291&lt;/p&gt;&lt;p&gt;When called from ActivityPub::Activity::Announce this causes Sidekiq to retry until its timeout and then throw Mastodon::RaceConditionError which causes another Sidekiq retry.&lt;/p&gt;&lt;/blockquote&gt;&lt;p&gt;But only by making the timeout and expiration the same, 15 minutes. The issue itself still remains and you will run into it if your sidekiq queue times ever go above 15 minutes.&lt;/p&gt;&lt;h2 id=&quot;references&quot; tabindex=&quot;-1&quot;&gt;&lt;a class=&quot;header-anchor&quot; href=&quot;https://hazelweakly.me/blog/scaling-mastodon/#references&quot;&gt;&lt;span&gt;references&lt;/span&gt;&lt;/a&gt;&lt;/h2&gt;&lt;ul&gt;&lt;li&gt;&lt;a href=&quot;https://gist.github.com/Gargron/aa9341a49dc91d5a721019d9e0c9fd11&quot;&gt;https://gist.github.com/Gargron/aa9341a49dc91d5a721019d9e0c9fd11&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href=&quot;https://docs.joinmastodon.org/admin/scaling/&quot;&gt;https://docs.joinmastodon.org/admin/scaling/&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href=&quot;https://nora.codes/post/scaling-mastodon-in-the-face-of-an-exodus/&quot;&gt;https://nora.codes/post/scaling-mastodon-in-the-face-of-an-exodus/&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href=&quot;https://gist.github.com/Gargron/40afa9dc37629dfc78d6656f0ca33293&quot;&gt;https://gist.github.com/Gargron/40afa9dc37629dfc78d6656f0ca33293&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href=&quot;https://blog.joinmastodon.org/2017/04/scaling-mastodon&quot;&gt;https://blog.joinmastodon.org/2017/04/scaling-mastodon&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href=&quot;https://github.com/mperham/sidekiq/wiki/Kubernetes&quot;&gt;https://github.com/mperham/sidekiq/wiki/Kubernetes&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href=&quot;https://us11.campaign-archive.com/?u=1aa0f43522f6d9ef96d1c5d6f&amp;id=997fbd1c2c&quot;&gt;https://us11.campaign-archive.com/?u=1aa0f43522f6d9ef96d1c5d6f&amp;id=997fbd1c2c&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href=&quot;https://github.com/mperham/sidekiq/wiki/Using-Redis#multiple-redis-instances&quot;&gt;https://github.com/mperham/sidekiq/wiki/Using-Redis#multiple-redis-instances&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href=&quot;https://pgtune.leopard.in.ua/&quot;&gt;https://pgtune.leopard.in.ua/&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href=&quot;https://www.percona.com/blog/2020/08/21/postgresql-synchronous_commit-options-and-synchronous-standby-replication/&quot;&gt;https://www.percona.com/blog/2020/08/21/postgresql-synchronous_commit-options-and-synchronous-standby-replication/&lt;/a&gt;&lt;/li&gt;&lt;/ul&gt;</content>
  </entry>
</feed>