The Cost of Instability Is Invisible Until It Isn't

The incident that finally breaks an engineering organization is almost never the first time the system has been under that kind of stress. It’s just the first time the stress exceeded the organization’s tolerance for looking away. Every system that collapses spectacularly had a long history of smaller collapses that were caught, patched, and quietly filed away as lessons-learned that nobody actually learned from.

This is the thing about the move-fast culture that the “move fast and break things is bad at scale” critique never quite captures. The critique focuses on the big failures — the outage, the data loss, the incident that gets a postmortem. But the real damage accumulates invisibly, in a thousand small moments where the system degraded below its intended state and the organization chose — explicitly or by default — to treat that degradation as acceptable. The P2 that wasn’t a P0. The latency that crept up 40ms over two quarters. The error rate that doubled but stayed below the alert threshold. Each of these is a lesson, and the lesson the org learns from each one is: this level of degradation is fine.

That’s a training problem, and it compounds. The team that resolves a dozen P2s without ever pushing the reliability conversation upward learns that degradation is a local problem, not an organizational one. The manager who accepts “it’s not affecting customers” as a sufficient answer to a stability question learns that degradation only matters when someone is yelling. The leadership team that reviews a clean incident dashboard every quarter while running a degraded-but-not-down system learns that the dashboard is the reality. Over time, the organization has systematically calibrated itself to tolerate more degradation than it intends to, without anyone having made that decision explicitly.

The Alert Threshold Is a Policy Decision

One of the most consequential decisions in engineering operations gets made with almost no executive visibility: where to set the alert thresholds. This is where the normalization of degradation begins. When the threshold for a latency alert is set at p95 = 2 seconds because that’s where it was when someone first set it up, and the actual user experience degrades significantly at p95 = 800ms, the organization has made a policy decision that it will not know when the experience becomes bad. It will only know when the experience becomes catastrophically bad. Everything between “working as designed” and “catastrophically bad” is invisible.

The teams that avoid this failure don’t just set better alert thresholds — though they do that. They treat the threshold decisions as visible, documented policies that have a stated rationale and a review cadence. The threshold is not wherever the line happened to land when someone configured the monitoring tool. It represents an explicit claim about what constitutes acceptable behavior, who decided that, and when it was last revisited. That distinction sounds operational but it’s actually cultural. It’s the difference between an engineering organization that has a relationship with its own standards and one that doesn’t.

The executive who hasn’t asked “what would have to be true for our monitoring to miss a significant user experience degradation?” probably doesn’t know the answer. And not knowing the answer is how you end up on a call with a major customer who tells you their experience has been bad for six weeks, while your dashboard shows green.

Degradation as a Leading Indicator

The subtler problem with invisible degradation is what it does to engineering capacity. A system that is chronically slightly broken generates a constant tax on the team — alerts that get acknowledged and ignored, bugs that get triaged and deferred, workarounds that get built into production code because nobody wants to fix the root cause. That tax doesn’t show up in velocity metrics. It shows up in the team’s diminishing ability to take on new work, and in the growing length of the list of things that everyone knows are wrong but that never quite rise to the level of being prioritized.

This is the connection to innovation that usually gets missed. The argument is typically framed as a velocity question — should we move fast or be stable — as if velocity and stability are on a tradeoff curve. They’re not. Velocity is downstream of engineering capacity, and engineering capacity is slowly destroyed by accumulated degradation. The team that is perpetually firefighting a half-broken system is not shipping features at the pace of a team operating on a healthy foundation. The cost just isn’t visible in any of the usual places you look for it.

Fixing this requires something genuinely uncomfortable: regular, explicit conversations about system health at a level of detail that most leadership teams avoid because it feels operational rather than strategic. What is the current degraded-but-not-down list? What’s the user-facing impact? What’s the engineering tax? What would it take to resolve each item? Not as a one-time audit, but as a standing part of the engineering rhythm. The teams that do this consistently end up with cleaner systems not because they dedicate heroic effort to it but because the visibility itself changes the incentives. Degradation that is visible and discussed is degradation that gets fixed. Degradation that is invisible is degradation that accumulates.

The organization that waits for the spectacular failure to take stability seriously has already made it much more expensive to fix. By then, the debt is structural, the team is trained to tolerate it, and the cleanup is a multi-quarter project competing against a product roadmap. Better to have the conversation while the cost is still a P2.

The Cost of Instability Is Invisible Until It Isn't

The Alert Threshold Is a Policy Decision

Degradation as a Leading Indicator

Share this article

Tags:

Keep reading

Your Best Engineers Should Spend Less Time Coding: The Unpopular Truth About Scaling Technical Leverage

You Probably Built a Platform for the Wrong Reason

The On-Call Rotation the CPO Never Sees

The things nobody writes on LinkedIn