Production goes sideways. It’s part of the job.
Over time, I’ve found that the biggest difference between a chaotic incident and a controlled one isn’t the technology—it’s the mindset of the people involved.
A few principles I try to keep in front of me when something is on fire:
- Assume the system is guilty, not the person. Most “human error” is really a design or process issue.
- Narrate what you’re doing. Saying “I’m going to restart service X on node Y” keeps the team aligned and avoids conflicting changes.
- Prefer reversible experiments. If we can roll it back, we can afford to move fast.
- Capture breadcrumbs. A few quick notes during the incident are worth hours of reconstructed guesses later.
Outages are painful, but they’re also where we learn what our systems really do—versus what the diagram says they do.