Incident Response at Zapnito
As a SaaS provider, when things go wrong you try to get them fixed as quickly as possible. In addition to technical troubleshooting there’s a lot of coordination and communication required. At Zapnito we’ve documented our practices into an incident response framework. This framework is influenced and modelled on the incident response framework at Heroku.
Heroku's incident response framework is based on the Incident Command System (ICS) and we have adapted this ourselves here at Zapnito. The Incident Command System is used in natural disaster response scenarios. Much of this governs the procedures for effective communication and can be applied to any urgent incident management process regardless of technology or problem domain. If you need to get a problem fixed urgently and connect geographically dispersed team members then the ICS should help.
When an incident occurs, we follow these steps:
- Move to a central chat room. Before starting work on the incident, move to a shared “Incident Response” Slack room. This ensures everyone is on the same page about the initial response and messages not spread across SMS, email or other Slack channels. Everyone can see what everyone is saying or has said.
- Designate IC. The Incident Commander (“IC”) is the leader of the response effort. The IC doesn’t fix issues directly or communicate personally with customers. Instead they’re responsible for the health of the incident response: ensuring that the right responders are involved, that everyone has the information they need, that all issues are covered, and that incident resolution is proceeding well overall. By default the IC is the first person to notice the problem, but for significant incidents the role is usually transferred to a dedicated IC, usually the Zapnito CTO.
- Update Customers. Customers are updated by a designated Customer Communications Leader. This is different role to the IC and is responsible for all customer communications. Customer communications will typically be sent via email and/or customer dedicated Slack channels.
- Send out internal SitRep. Next the IC send out the first Situation Report ("SitRep") to the internal team describing the incident. It includes what we know about the problem, who is working on it and in what roles, and open issues. As the incident evolves, the sitrep acts as a concise description of the current state of the incident and our response to it. A good sitrep provides information to active incident responders, helps new responders get quickly up to date about the situation, and gives context to other observers like customer support staff. When a sitrep is created or updated, it’s distributed internally via email and via the Incident Slack channel.
- Assess problem. The next step is to assess the problem in more detail. The goals here are to gain better information (e.g. what users are affected and how, what they can do to work around the problem) and more detail that will help engineers fix the problem (e.g. what internal components are affected, the underlying technical cause). The IC collects this information and reflects it in the sitrep so that everyone involved can see it.
- Mitigate problem. Once the response team has some sense of the problem, it will try to mitigate customer-facing effects if possible. For example, we may put the platform in maintenance mode to reduce load on infrastructure systems, or boot additional instances in our fleet to temporarily compensate for capacity issues. A successful mitigation will reduce the impact of the incident on our customers and end users, or at least prevent the customer-facing issues from getting worse.
- Coordinate response. In coordinating the response, the IC focuses on bringing in the right people to solve the problem and making sure that they have the information they need. The IC may also create a shared Google Doc for the team to collect notes together in real time, or start a high-bandwidth video call for more quickly working through issues than is possible with text chat.
- Manage ongoing response. As the response evolves, the IC acts as an information radiator to keep the team informed about what’s going on. The IC will keep track of who’s active on the response, what problems have been solved and are still open, the current resolution methods being attempted, when we last communicated with customers, and reflect this back to the team regularly with the sitrep mechanism. Finally, the IC is making sure that nothing falls through the cracks: that no problems go unaddressed and that decisions are made in a timely manner.
- Post-incident cleanup. Once the immediate incident has been resolved, the IC calls for the team to unwind any temporary changes made during the response. For example, alerts may have been silenced and need to be turned back on. The team double-checks that all monitors are green and that all incidents in Pingdom have been resolved.
- Post-incident follow-up. Finally, the IC will tee up post-incident follow up. Depending on the severity of the incident, this could be a quick discussion in our normal weekly operational review or a dedicated internal post-mortem with associated public post-mortem post. The post-mortem process often informs changes that we should make to our infrastructure, testing, and process; these are tracked over time within engineering as incident remediation items.
Zapnito's Incident Response procedure is strongly influenced by Heroku's own highly regarded process, it helps us quickly resolve issues while keeping customers informed about what’s happening.