The most expensive engineering work I’ve done was cutting an API’s response time from 94 seconds to 2.4. The least valuable part was the code change. The valuable part was the week before, when I sat in a conference room with a marker and drew the wallet-analysis pipeline on the wall. Which boxes were the database. Which were the queue. Which were the cache. Where the round-trips fanned out and where they merged again. By the time the diagram was finished, the 97 percent latency cut was already obvious. The diff was a formality.
That’s the work nobody pays for and a lot of engineers underestimate. The senior part of senior engineering isn’t about knowing more languages or picking the right database. It’s about thinking in a way that survives contact with production load. What follows is the working set of mental models I keep coming back to. None of it is original. All of it has earned its way into how I read systems.
What a system actually is
Russell Ackoff’s shortest definition is the one I trust: a system is never the sum of its parts; it’s the product of their interactions. That phrasing rewires what you’re looking at. Your pipeline isn’t the database plus the queue plus the workers plus the cache plus the load balancer. It’s what they do to each other when traffic spikes. When one of them lies about being healthy. When the queue grows faster than the workers drain it. None of those interactions live in any individual component, which is why an architecture diagram of the boxes doesn’t actually describe the system.
Donella Meadows wrote the cleanest version of this for non-engineers in Leverage Points: Places to Intervene in a System. Her vocabulary is stocks (what accumulates) and flows (what moves). Translated for us, queue depth is a stock, request rate is a flow, and the relationship between them is where backpressure, retries, and drainage live. That relationship is the part most architecture diagrams don’t draw. They can’t. The rectangles-and-arrows language doesn’t support it. So we keep drawing the cast list and calling it a script.
Gall’s Law and the only architecture that ships
John Gall wrote one sentence in 1975 that I’d tape to the wall of every architecture meeting if I could: “A complex system that works is invariably found to have evolved from a simple system that worked. A complex system designed from scratch never works and cannot be patched up to make it work.” This is why CORBA died and HTTP won. It’s also why every successful platform you can name was, at some point in its past, embarrassingly small.
I think about PandaTerminal’s wallet indexer a lot. Version one was a Python script that wrote CSVs and ran by hand. By the time it was indexing 250 million Ethereum addresses across four trillion rows, it had become a fan-out of workers behind a job queue, a partitioned store, a hot-path cache, and a streaming WebSocket layer. The architecture diagram of v15 looks intimidating. The path from v1 to v15 was a sequence of incremental, working systems. Nobody could have designed v15 in a planning document. We tried. We failed. Then we shipped v2.
Where leverage actually lives
Meadows ranks twelve places where you can intervene in a system, from shallow to deep. The shallow end is the stuff most engineers spend their careers on: parameters (autoscaling thresholds, cache TTLs), buffers (pool sizes, KV-cache limits), stocks and flows (request paths, retry storms). Then it gets interesting. Negative feedback loops, which is where circuit breakers and backpressure live. Information flows, which is observability. Rules, which is your retry budget and idempotency contract. Self-organisation, which is autoscaling and leader election. Goals, which is what you measure and what you choose to ignore. Paradigms, which is whether your team believes reliability is a feature of code or a feature of the org chart. Most teams I’ve worked with do not agree on that last question, and that disagreement is the source of about half their incidents.
The 57% AWS bill cut at RonianAlytics looked, from the outside, like a parameter intervention. Fewer ECS tasks. Smaller RDS class. Tighter Lambda concurrency. The actual unlock was at level eight, goals. The team had been optimising for “don’t page the on-call,” which produces generous over-provisioning, because over-provisioning is the cheapest insurance against pages. We changed the goal to “cost per successful request.” Once that was the number on the dashboard, the parameter changes wrote themselves. Nobody had to be smart. The system told us what to do.
What to ignore
Hyrum’s Law is the systems-thinking version of an inside joke that turned out to be true: with a sufficient number of users of an API, it does not matter what you promise in the contract; all observable behaviours will be depended on by somebody. That’s not a complaint about your users. It’s a constraint on what “your API” means. Every error message format you ship is a public API. Every ordering you didn’t mean to guarantee is a public API. The timing distribution of your responses is a public API. The system you maintain is the one your users observed, not the one you wrote.
Dan McKinley’s Choose Boring Technology is the corollary I keep returning to. You get about three innovation tokens per project. Spend them on the part of the system where novelty actually pays for itself. The reason boring technology wins isn’t the feature list. It’s that ten thousand other engineers already know how it fails. PostgreSQL is boring. SQS is boring. NGINX is boring. Their failure modes are common knowledge, and common knowledge is the asset.
The opposite trap, which I’ve fallen into more than once, is an aesthetic attachment to formal methods, exotic consistency models, or hand-rolled consensus. Marc Brooker runs serverless inside AWS, and his post Formal methods: just good engineering practice? is unusually honest about this. Formal methods solved half his problems. The other half was operability, which no theorem prover catches and which is where everything actually goes wrong.
How systems actually fail
If you read one thing in this whole essay, make it Richard Cook’s How Complex Systems Fail. Eighteen short theses, dense as a haiku, that took me three re-reads to actually internalise. The bits that matter most for engineering teams: complex systems run as broken systems all the time. Production has more degraded components than you think. The system is up because the operators are continuously compensating. Catastrophe requires multiple failures, never one; outages are always compositions. “Root cause” is a story we tell ourselves after the fact, mostly to feel like adults. And every practitioner action is a gamble. Hindsight bias makes good decisions look reckless and bad decisions look obvious, and we mostly only review them after the outcome.
Lorin Hochstein extends this in the only useful direction. On Surfing Complexity he treats incidents as field studies, not bugs. The interesting question isn’t what broke. It’s why it looked like the right thing to do at the time. If your incident reviews keep ending in answers that sound obvious, your reviews aren’t doing the work.
Kelly Shortridge frames the operational implication in Security Chaos Engineering as safe-to-fail, not fail-safe. You can’t prevent failure in a system this complex. You can only shape the blast radius. Once you accept that, a lot of architecture conversations get easier.
How to operate one
Charity Majors has spent a decade arguing the same thing in different rooms: you can’t not test in production; the only question is whether you do it deliberately. Observability, in her framing, is the ability to ask new questions of your system without shipping new code. Dashboards are the answers to yesterday’s questions. If a fixed dashboard is your only window in, you’re blind to whatever broke that you didn’t already know to look for, and the breakage you didn’t predict is exactly the breakage that takes you down.
On the production-engineering side, my most-bookmarked working set is the AWS Builders’ Library and Marc Brooker’s personal writing on retries. The headline lesson: retries are a load amplifier by default. They become a load governor only when you pair them with backoff, jitter, and a budget. A retry without a budget is a self-inflicted DDoS waiting for an excuse, and most outages I’ve been on the inside of had one of those somewhere in the call graph.
Pat Helland’s essays are the most underread engineering writing of the past decade. The two that earned permanent space in my head are Data on the Outside vs Data on the Inside and Immutability Changes Everything. The summary that took me years to actually believe: a service boundary is a temporal boundary, not a network one. Anything you receive from outside is in the past. Anything you send is also in the past by the time it arrives. Most of the dumb mistakes I’ve watched myself make in distributed systems came from pretending otherwise.
Closing
The Bezos API mandate, captured by Steve Yegge’s 2011 rant, is the closing example I keep going back to. Bezos forced Conway’s Law into a known shape. Every team exposes its data through a service interface, no back doors, no shared databases, no exceptions. On the surface, it’s an org-design memo. Underneath, it’s a goal-level intervention, Meadows’ eighth leverage point, that made every internal capability externalisable by default. That’s what AWS became. That’s what senior engineers do, in less dramatic forms: notice that what looks like an architecture decision is usually a goal decision in disguise, and that what looks like an org problem is usually a feedback-loop problem two levels down.
Stay close to the failures. Draw the timing diagram. Cut innovation tokens with prejudice. The system you maintain is the one your users observe, not the one you wrote. Everything else is downstream.