There’s no such thing as a perfect software product. No matter how stable your application is, there’s bound to be occasions where things go wrong in production. To make the most and learn from each incident, it’s crucial that engineering teams regularly commit to doing post-mortem investigations.
This is especially important as companies grow and teams increasingly transition to a remote working environment. Even something that seems small can be analyzed and learned from in order to prevent future, and potentially more serious, vulnerabilities.
Having best practices in place for how to conduct a post-mortem software investigation around an incident is something that cannot be overlooked by technology providers.
Also see: The Best Project Management Software
Fixing Software Problems: Key Steps
While there’s no one-size-fits-all solution for every team, there are several fundamental steps that should be taken to make it an effective process and ensure that incidents remain rare.
- Collect data during the incident. It’s important to collect as much data as you can in a single location, as the incident goes on. This includes server graphs, snippets from logs, and screenshots showing what was going on at each point in the incident. It doesn’t all end up being useful, but it’s good to have everything collected when you start going through the investigation in detail.
- Start the investigation right away. Get one of the developers/managers involved to take on the role of lead investigator, which means they’re in charge of making sure the investigation gets done, the post-mortem document gets filled in, and the debrief gets held. Starting it right away makes sure nothing gets lost.
- Review the results within a week. While things are still fresh, hold a debrief to review the post-mortem document as a group, discuss the action items, and make any edits needed. This can be a 30-60 minute video session with the team involved in the incident, as well as representatives from other departments (primarily the customer support team, but any impacted department should attend).
- Share the results. As soon as the debrief is done, everyone should get a chance to learn from it. Post it where the whole company has access to it for transparency – incidents shouldn’t be hidden away.
Also see: Digital Transformation: Definition, Types & Strategies
Additional Measures for Efficient Software Fixes
These best practices will set up teams for success, but as the future of work evolves, there are new challenges in following them.
For instance, companies are now facing employees working in all sorts of time zones, and a mix of remote and hybrid teams makes scheduling and coordination much more complicated. There are several additional measures that can help ensure that a post-mortem investigation remains effective, regardless of the environment:
- Assume async. Scheduling the debrief more quickly means that it’s harder to find a spot in everyone’s calendars. Rather than pushing the meeting further and further out, do more of the work asynchronously. Make sure the document can stand on its own, and use the quickest communications channels to ask people for their contributions. Also consider recording the debrief (easy with Zoom) so that anyone who couldn’t attend is also able to watch it later, so nobody has to worry about missing out.
- Complete the investigation quickly. It’s important to shorten the timeline expectations on the investigation. Collecting the data early avoids having multiple ongoing investigations, and allows everyone involved to get back to their sprint work sooner.
- Simplify the incident document template. Consider simplifying the template so that there are less sections to worry about, and make each section as easy as possible to fill in. In order to still be complete, this document should include sections for:
- Impact and Scope
- Trigger (what started the incident)
- Resolution (what ended up fixing it)
- Timeline of events
- Root Cause
- What went well
- What didn’t go well
- Action items
- Data & Analysis (all the charts)
- Ask for input from customer-facing teams right away. A customer success team always has great input and is able to help fill in gaps in the timeline. Reach out to them early so there’s time for their input to be added into the post-mortem document before the debrief. Waiting for the debrief is too late!
- Track action items in backlogs. Why track action item progress in an incident document when there is already a standard tool for tracking work? As soon as you can, get all action items from post-mortems so they can be assigned to backlogs and don’t get lost. It’s also beneficial to have automated reports set up to view the list of outstanding post-mortem actions—driven by a post-mortem label on the items.
- Have a section for “things we should do if we have time.” Realistically, not all action items are actually actionable—some are more aspirational or something everyone should keep in mind. In order to keep the action items clearer, include this section as a spot to put the things you think are important but you couldn’t turn into assignable/trackable work. It’s better to have a smaller set of action items that you actually do than a giant list of things you would like to do given infinite time.
- Keep it Blameless. This one isn’t actually new, but it’s well worth repeating! Be interested in what happened and what you’re going to do to fix it going forward, not in pointing fingers.
Remote work and fast-paced development don’t have to make incidents complicated. By following these best practices, software engineers and team managers can make the most of an incident post-mortem and focus on what matters most: learning from it and making things better for the future.
Also see: 7 Digital Transformation Trends Shaping 2022
About the Author:
Jesse van Herk, Senior Manager of Product Engineering, Jobber
The post Best Practices for Fixing Software Problems appeared first on eWEEK.