heading · body

Transcript

Mathematics Of The Second Attempt System Design From First Principles

read summary →

Okay, um, so everything you build is going to fail. I want to just I want to start there. No fluff, no like here’s a diagram with happy arrows, right? If you’re a junior engineer, you’re probably spending most of your time thinking about the the happy path. You write code, you’re like, “Okay, the database is up, the network is fast, my function returns what it says it returns.” No, that’s No, that’s not how any of this works. In the real world, the network is a minefield, and I don’t mean that metaphorically, I mean like cables literally get cut, routers overheat, a data center in Ohio gets hit by a thunderstorm, and suddenly 10% of your packets are just like vanishing into nothing. Or maybe Stripe is having what they call a micro outage, which is a a very polite way of saying, “We broke something, and you’re going to find out.” The first thing I want you to internalize, like really internalize, is this: Failure is not an exception. Failure is a feature of the system. It’s load-bearing. If you build a system that sort of assumes success, you’ve built a glass house, and as soon as a single pebble, a 503 error, a timeout, a disk that’s just slow, hits that house, the whole thing shatters. A senior architect doesn’t build a system that doesn’t fail. They build a system that fails gracefully, that heals itself. That’s the shift, right? Okay, so today we’re going to talk about what I call, and I think this is a really useful framing, the mathematics of the second attempt. We’re going to look at why naive retries can like literally act like a DDoS attack against your own infrastructure. We’re going to talk about exponential backoff, why you need jitter on top of that, and why idempotency is just It’s not optional, it’s the foundation. If you don’t have it, you can’t safely retry anything. Let me show you the most basic reaction to failure, the retry. You call an API, it returns a 500. Your code says, “Oh, glitch, let me just try again immediately.” This seems, I get why it seems, reasonable. Maybe it was a transient thing, right? But let me walk you through what actually happens here, because the physics of this are kind of fascinating, and also terrifying. Imagine you have a service that’s struggling. It’s at 95% CPU. It’s barely hanging on. A small spike in traffic hits, and now the service is failing, say 10% of requests, because it’s overloaded. If all your clients have simple retries, like immediate retries, what happens? The 10% of failed requests just doubled. You’ve increased the load on an already dying server, which causes more failures, which causes more retries, which causes more failures. This is what we call a retry storm, and AWS’s own engineering documentation talks about this extensively. The observation from teams working on things like Amazon EBS is that retries, when not controlled, can keep load high long after the original issue is resolved. They literally delay recovery. So, within seconds, your 10% failure rate becomes 100% outage. You have, with your very polite-sounding retry logic, basically finished off your own server. Okay, so the solution to the retry storm is exponential backoff. The idea is almost embarrassingly simple. If you fail, don’t try again immediately. Wait. And if you fail again, wait longer. We use a power of two curve, right? Attempt one fails, you wait two seconds. Attempt two fails, wait four seconds. Attempt three fails, wait eight seconds, up to some cap. AWS’s guidance puts this cap typically somewhere in the 30 to 60-second range. You don’t want exponential functions running uncapped, because, and this is a fun math fact, 2 to the 10th power is about 1,000 seconds. That’s like 17 minutes. So, you cap it. By backing off, you’re giving the downstream service room to breathe. You’re giving the engineers at AWS time to fix the router. You’re being what I’d call a good citizen of the network. But, and this is the part people miss, and I think it’s the most interesting part, exponential backoff alone is not enough. And the reason is subtle, and honestly beautiful. Picture this. A thousand clients all hit a server at the same time. The server crashes. All 1,000 clients fail at exactly t = 0. They all have the same backoff algorithm, so they all wait exactly 2 seconds. Then, at t = 2.0000 seconds, all 1,000 clients retry at the exact same millisecond. The server, which was trying to recover, bless its heart, just gets punched in the face by a thousand requests simultaneously. It crashes again. They all wait four seconds. At t = 6.000 seconds, they all hit it again. This is called the thundering herd, and here’s what makes it so insidious. You did everything right. You added backoff, but your clients are still synchronized. They’re acting like a coordinated army. AWS engineers who worked on EBS and Lambda actually observed this in production. Clients on regular intervals would line up and trigger at the same moment, like the first few seconds of a minute, or right after midnight for daily jobs. The solution is jitter, which is really just a fancy word for randomness. Instead of sleeping exactly 2 seconds, you sleep 2 seconds plus some random variance. One client waits 1.8 seconds, another waits 2.1 seconds, another waits 2.4 seconds. This smears the requests across time. Instead of a spike of a thousand requests in one millisecond, the server sees, and I like this metaphor, a gentle rain of requests over a full second. Much easier to handle. AWS published a detailed analysis of different jitter strategies. Full jitter, equal jitter, decorrelated jitter, and the empirical finding was that the no jitter exponential backoff approach was, {quote} the clear loser. It takes more work and more time than any of the jittered approaches. The return on implementation complexity is just huge. So, if you are writing a retry loop, and you don’t have random somewhere in your sleep timer, you’re doing it wrong. Randomness is the secret sauce of stability, which is a weird sentence, but it’s true. Okay, this is the most important concept in this whole video, and I want you to really sit with it. Idempotency. Here’s the scenario. You’re building a payment system. User clicks buy. Your server sends a request to Stripe, “Charge this card a hundred dollars.” The request goes out, Stripe receives it, charges the card, Stripe sends back a success response. But, um, the network is flaky. The success packet gets lost. Your server times out. It thinks the request failed, so it retries. If you simply send the charge a hundred dollars request again, Stripe charges the card a second time. The user is now out two hundred dollars. They’re angry, your support queue is on fire. This happened because your charge operation was not idempotent. An idempotent operation is one where, no matter how many times you perform it, the result is the same as the first time. In math, people write this as like f of f of x = f of x. In software, delete user 10 is idempotent. Whether you delete once or 10 times, they’re gone. But charge a hundred dollars is not idempotent by default. Create user is not idempotent by default. The fix is an idempotency key, sometimes called a nonce. Stripe actually implements this. The flow is: One, your app generates a unique ID for this transaction, rec-abc-123. Two, you send charge a hundred dollars idempotency key = rec-abc-123 to Stripe. Stripe checks its database, “Have I seen rec-abc-123 before?” No, Stripe charges the card, records rec-abc-123 is done. Network fails, you retry with the same key. Stripe checks, “Yes, I’ve seen this. Here’s the cached success response. No double charge.” This is a first principle of API design. Every write operation in a distributed system must be idempotent. If it’s not, you cannot safely retry, and if you can’t safely retry, you can’t survive a network glitch. The AWS Well-Architected Framework calls this out explicitly. Idempotency is a precondition for implementing retry logic safely. Sometimes retrying is a bad idea, full stop. Imagine you’re calling a third-party service, and it’s not just glitchy, it’s gone. Its data center is underwater, metaphorically, hopefully. If you keep retrying, even with backoff and jitter, you’re still holding threads open. You’re waiting for timeouts. You’re wasting your own resources being polite to a corpse. This is where we use the circuit breaker pattern. It works exactly like the electrical circuit breaker in your house. Closed state, everything’s normal, requests flow through. Failure tracking, the circuit breaker counts failures. If the failure rate hits say 50% over the last minute, open state, the circuit trips. For the next 30 seconds, it doesn’t even try to call the service. It immediately returns an error, service is down, don’t bother. This is called failing fast. It protects your system from being dragged down by a dead dependency. Users get an immediate answer instead of waiting 30 seconds for a timeout they’ll never see resolved. Half open state, after 30 seconds, it lets one request through, a scout. If that succeeds, circuit closes, everything’s back to normal. If it fails, it trips again. The circuit breaker is how you build a system that bends and snaps back, rather than one that just stays broken. What happens when you’ve retried 10 times? The backoff is maxed out and the message still won’t process. You can’t delete it. If it’s a new order message, deleting it means you just lost money. But you can’t keep it in the main queue forever either. It’ll block all the healthy messages behind it like a car accident on a highway. The solution is the dead letter queue or DLQ. When a message fails its final retry, the system moves it out of the main queue into the DLQ, quarantine. The main system keeps humming along. Later, an engineer or a script can look at the DLQ and ask, “Why did these fail?” Bug in the code? Fix the bug, replay the messages. User’s credit card expired? Send them an email. One-time weirdness? Replay them manually. A DLQ is basically garbage collection for your business logic. No data is ever truly lost even in total failure. This is incredibly important and honestly dramatically underrated. I have to talk about timeouts because they’re honestly, this might be the number one thing that kills distributed systems, a missing timeout. If you call a service without specifying a timeout, your thread will wait forever. If that service hangs, maybe it’s deadlocked, maybe it’s just slow, your thread stays open. Another thread stays open. Soon, all 100 of your web server threads are stuck waiting for something that will never answer. Your server is now a brick. First principle, never make a network call without a timeout. And I mean every layer, connection timeout, read timeout, total request timeout. There’s a pro pattern called deadline propagation that I really like. If the user’s browser is only going to wait 10 seconds for a response and your web server calls service A, tell service A, “You have 9 seconds.” If service B calls service B, “You have 8 seconds.” This prevents what’s called zombie processing, where service B is working really hard computing a result that the user has already given up on. The computation is just happening in a vacuum. Let me talk about a mindset shift that I think separates junior from senior engineers. Junior engineers focus on MTBF, mean time between failures. They try to make the system never fail. They spend weeks writing perfect code. Senior engineers focus on MTTR, mean time to recovery. They accept that failure is inevitable and they spend their time making the system recover faster. Think about the math. If your system fails once a year but takes 24 hours to fix, that’s a bad system. If your system fails every single day but heals itself in 100 milliseconds via circuit breakers and retries, that’s a world-class system. The question isn’t, “How do I prevent this from ever breaking?” It’s, “If this database goes away, what does the user see? If this API call fails, does the whole page go blank or do we just hide one small widget? Can we degrade gracefully?” Netflix is the canonical example here. If the recommendation service is down, Netflix doesn’t show you an error screen. They show you a generic popular movies list. You don’t even know the system is partially broken. That is the ceiling of system design, invisible failure handling. So let me synthesize. What does the architecture of a resilient system actually look like? It’s a system that treats every interaction as a negotiation with a potentially broken partner. Specifically, one, expect failure. Don’t be surprised when an API returns a 500, be surprised when it doesn’t. Two, retry smartly. Exponential backoff to give the system space. Jitter, always jitter to prevent synchronization. AWS’s empirical research shows jittered backoff consistently outperforms non-jittered, no exceptions found. Three, be idempotent. Every write operation needs a key so you can retry without fear. This is the foundation of distributed trust. Four, trip the circuit. Use circuit breakers to fail fast when a dependency is gone. Stop trying to reach a dead service. Five, quarantine misfits. Dead letter queues ensure no data is lost even in total failure. Six, set deadlines. Timeouts everywhere. Propagate them down the call stack. When you combine these patterns, you stop being a victim of the network and start being its master. You stop building software and start engineering systems. Next up, we’re going to move away from the machine and talk about the human side. Video 13 is about the microservice trap. We’ve talked about how to build big distributed systems, but the next question is, should we? What does decoupling actually cost you versus the simplicity of just keeping it as a monolith? See you there.