Postmortem culture: how you can learn from failure
As site reliability engineers, we know just how important reacting to mistakes can be. At Google, fixing bugs and getting systems back up and running after they break is a part of our job. When an incident occurs, we fix the underlying issue and return services to their normal operating conditions. Unless we have some formalized process of learning from these incidents, they may recur ad infinitum. Left unchecked, incidents can multiply in complexity or even cascade, overwhelming a system and its operators, and ultimately impacting our users. This is why we use postmortems to carefully document and disseminate learnings from any mistakes.
A postmortem is the process our team undertakes to reflect on the learnings from our most significant undesirable events. Incidents happen, but not all require a postmortem. That’s why our first step in our process is making sure we define when we need one, by setting up our criteria. Some postmortem scenarios we look out for include: visible service disruptions, data integrity impacts, slow customer resolutions, or failed error detections. Our next step is to work together to create a written record for what happened, why, its impact, how the issue was mitigated or resolved, and what we’ll do to prevent the incident from recurring. We create postmortems for any event we wish to avoid in the future, or if a partner team wishes to document the root cause of a breakdown (or a close call).
For us, it’s not about pointing fingers at any given person or team, but about using what we’ve learned to build resilience and prepare for future issues that may arise along the way. By discussing our failures in public and working together to investigate their root causes, everyone gets the opportunity to learn from each incident and to be involved with any next steps. Documentation of this process provides our team and future teams with a lasting resource that they can turn to whenever necessary.
And while our team has used postmortems primarily to understand engineering problems, organizations everywhere — tech and non-tech — can benefit from postmortems as a critical analysis tool after any event, crisis, or launch. We believe a postmortem’s influence extends beyond that of any document and singular team, and into the organization’s culture itself. Some of the cultural tenets within its process that we find particularly valuable are:
- Encouraging blameless and constructive feedback. Removing blame from a postmortem can enable team members to feel greater psychological safety to escalate issues without fear. Check out the manager actions for psychological safety for more suggestions for how to facilitate team discussions.
- Focusing on improvement and resilience. Centering on the importance of improvement and learning can reposition failure as an opportunity for growth and development rather than as a setback.
- Promoting an iterative and collaborative process. Real-time collaboration and an open commenting system for your postmortems can enable the rapid collection of data, ideas, and solutions. Regularly recognizing postmortems with your team and with senior management can additionally increase the support and effectiveness of the solutions you develop in response.
Whether you circulate a monthly newsletter with useful postmortem examples or you role play the events of a postmortem with your team each quarter (both things Google has done), there are many different ways of introducing a postmortem culture to your organization. Consider experimenting with an exercise from our list of example activities or initiating a discussion with your team to kick off a healthy postmortem practice this year.