How to gain from an incident

27 Feb, 2025

Recently, I was involved in a couple of incidents at my company. No, it wasn't my fault, but I did get roped in since I was on call. It's a bit of a black box since

incidents don't happen often
the language of an SRE is slightly foreign to me

Sometimes, I feel like I don't have much to contribute other than, "Let me check if the latest deployment has anything to do with this." This time around, I tried to make an effort to actually write down what helped resolve our issues and what you can do to make the most of your next incident.

FYI, we use Kubernetes at my place, so just know that heavily influences my suggestions.

A good starting checklist

1. Check if a recent deployment has anything to do with it

2. Double-check if a recent deployment has anything to do with it

Seriously, start here first. If things are not acting as they used to, it's probably due to some new code that has been introduced. This is also the easiest to remedy. You should first start by reverting the deployment to the previous image. The benefit of doing this first is that you don't have to go through any checks that may occur with you CI/CD pipeline if you were to merge a revert first. At my company, rolling back to the last image is just a simple button press, but you may have to consult an SRE to do this step for you. Next, make a revert PR, get the right eyes on it, and merge that puppy ASAP.

3. Check to make sure that the pods are healthy

For the services that are affected, check to see if those pods are healthy. Have they restarted recently? Are they stuck in some crash loop? From there, start digging. I personally love to use stern which allows me to run something like stern <name-of-service> | grep "some error". Sometimes you can forego using grep and just drink from a firehose to see if things are normal. It's a little bit like reading the Matrix, but it can be helpful for spot checking. Now that you're looking at those logs, are you noticing any errors or discrepancies? If you do, that's probably a huge clue as to why your service(s) are failing.

4. What logging is in place that can help you determine and narrow down the issue?

There are an endless number of monitoring dashboards you can use. Figure out what tools you have at your disposal. More than likely, there's someone else who has asked the question you're attempting to answer! Use those dashboards and related metrics that they have built to your advantage. Beyond just looking at graphs, you may have traces for your services. Check to see how those requests are traveling between your services using traces.

This is also a great time to figure out any lack of monitoring you have. "Man, I really wish I could figure out the answer to this." Well, get on it! Whether it's creating a new dashboard or adding some additional metrics/logs to your code, your future self and your team will thank you for your hard work.

5. Check the status pages of other vendors

Your services depend on other services. This could range from hosting services like gcp/aws to other SaaS companies you're making API calls to. If you've exhausted all options, be sure that you check any related vendors that are a part of your critical path.

6. I've done everything, now what?

This is probably the point where you're going to need your local wizard to bail you out.

Some tips

Usually, incidents end up with a bunch of folks on a Zoom call. How can you best use this time even if you don't have much to contribute?

Stay calm, listen for instruction

There are some smart folks on the call who breath dashboards, know the manifests like the back of their hand, and are not scared of anything. And I think that's the most important part: they aren't scared, so why should you be? This stuff happens so, put any emotion to the side and be willing to listen and do what you're told.

Don't be a hero, be a helper

I can't tell you how many times I've been a part of these calls just looking for the "silver bullet". It does exist but it doesn't come instantly. Don't worry about looking great in front of everyone, just report on the things that you think will be the most helpful. Your small advice could turn into the solution!

Have some confidence

Speak up and offer help. Again, don't worry about how you look, just get it done as effectively as possible which includes saying things like "I don't know where that lives".

Stay on the call for as long as you can

This is where the growing happens! There is so much information that you can take in to understand and learn from the situation. Even when the dust has settled and you're no longer needed, you should take advantage of being in the room with the smartest people. These events (should) happen rarely so take advantage of every opportunity!

Be apart of every call

Piggybacking off the above, even if you've not been asked to be on the call, get on the call! Again, take advantage of being around people who are smarter than you and are using their maximum brainpower.

When appropriate, ask questions

You should not do this when everyone is still figuring out a solution. But once things have come to a lull, ask a question to help with your own understanding of the problem. "Can someone explain why we've decided on this solution?", "What are the reasons we can't find the root cause?", "What does mean?" Even if you haven't been an active participant, ask a question to get a better understanding of the problem (and mainly to learn).

Write down anything you do learn

Kinda like I'm doing right now.

Conclusion

Incidents aren't necessarily fun, but they can be incredibly advantageous to your development. If you want more responsibility, you need to understand how things work under the hood. And when problems do arise, you need to be able to solve them quickly. The best way to do that is with practice and learning from those who know more than you. Take advantage of the moments when the smartest people are in the room.