For the past few days, I have been giving some training to my team about troubleshooting production issues. I would like to share some of my experiences and tips in this article about troubleshooting production issues. I usually have this as one of the interview questions as well. I usually ask all the candidates on what will be the first step they will be doing to troubleshoot a production issue. I get funny answers and sometimes sensible answers as well.
Reproducing the issue in the local environment
When we come across an issue in the production environment, we will not be able to investigate so much in the production environment directly. Only in rare cases, we will be able to investigate directly on the production environment itself. Otherwise, the first step towards troubleshooting would be to reproduce the issue in the local environment. This step is to ensure that we definitely have the issue. And, when we also reproduce the issue in the local environment, we get some confidence that it is traceable.
There have been instances where the issue will be available only in higher environments and if we try to reproduce the issue in the local environment, it may not happen. In such instances, the issue can either be because of data discrepancies or because of production configurations. That also sometimes becomes easier to track if we can foresee what the issue could be. I have worked on some of the production issues where there will be no trace anywhere about the issue. It will take several weeks to identify the root cause of the issue and fix it.
Any developer should have this as one of the mandatory things. When people write code, sometimes they go with the flow and forget to log the errors. Some developers are also very confident that their code will always work in all the scenarios and they miss to write error logging in the code. The mistake that we did during the development to not log the errors can become a big pain point when we troubleshoot production issues. So, best practice would be to review every line of code that we write and look for a possibility to foresee if there can be a failing point for that code and write error logging.
The production support team will also find it very comfortable to check the error logs and try and see if they can fix some of the errors. If the error logs are not very friendly, then for all the issues the production support team should be dependent on the developers to troubleshoot the issues. That is where logging plays a vital role. In the applications I code, I make sure I log most of the errors. My production support team is now familiar with some of the error messages and they themselves sort it out if they identify it to be a configuration issue or data issue.
Always find the Root cause of the issue
Sometimes people have this habit of providing quick fixes to the issues they investigate. For the business continuity reasons, we might end up in situations where we have to provide immediate fixes. But it is advisable to always do a Root Cause Analysis (RCA) to identify the exact reason for the failure. If we don’t find the exact reason for the failure, the fix that we provide today can also become a breaking point for the application in the future. Interim fixes are good but it is always advisable to take some more time to investigate and provide a full-fledge solution or fix.