Surviving A Critical System Outage

From an IT pro’s perspective, a system outage is the time when that cold sweat is running down your spine because a system you are managing and your company depends on is not functioning. It can be e-mail, enterprise resource planning (ERP) application, booking application, web site, or something else and all the eyes are turned to you asking a simple question: when will it be up? Your stomach is tingling in addition to the cold sweat.

We have all been there before and will revisit from time to time; this is life. However managing the situation and getting out of it cleverly is often more important than fixing the incident. Here is my step by step list on surviving such a critical situation.

Don’t panic. Easier said than done, I agree but don’t panic. Even if you did it, don’t panic. Don’t try to convince people that you had no bad intention in causing the incident. If you did something, admit it to your supervisor, to your friends and stay factual. Concentrate on the event, what has caused the situation and what you will do to diagnose and to cure it. Remember that, this is not a life threatening situation (well, even if it is, such as in a hospital, about a patient) and you need to be calm. This may cost your career, but this is another discussion. Cast away all the emotions, think technical and stick to the facts.

Make sure that you reach the users and tell them about the situation. If you have an instant messaging platform, use it. If you have a public announcement system, use it. Tell your users that the system X is down as of … and you are working on it to restore it. Unless you are a masochist who wants to receive phone calls and cubicle visits asking when the issue will be fixed in such a situation, it is better to tell your users in advance and stop them before bugging you.

RELATED:   How to Run Your Own Private Cloud

Isolate yourself as much as possible. Although you have told your colleagues, your supervisor(s) and your users that there is a critical outage, the phone calls will not cease. Ask a colleague to receive your phone calls and respond to the questions when you are concentrating on the issue at hand. This will have another benefit, when the users make a contact and have someone calmly replying them that they are aware of the situation and working on it, they will satisfy their curiosity and will remain calm themselves. You can even designate an intern to receive the phone calls.

In such an outage, you will see different faces of the people. Your colleague you are having your lunch everyday may convert to a sadist watching you struggle, other colleague that you are keeping away may be voluntarily taking your calls before you ask for it and the sweet, charming Vice President may turn to a witch. To overcome the politics, know that they are upset about the situation, not you. And since you are the one working on it, the frustration is turned to you. The only thing you can control is yourself; not the upset people. It is so easy to lose control in such a scenario and fire back, but don’t do that. Never. Instead use a sarcastic humor; say “After we fix the problem we will see what has caused it and who is to blame, but for now we need to fix it.” If you do not do that, you will regret losing yourself when everything is over.

Meantime, stop giving updates to your supervisor every minute. This not only breaks your concentration but also builds increases your stress. If your supervisor is causing the stress, politely as to have some time to see what is going on.

RELATED:   Updating Your Company’s Mobile Collaboration Policy

Keep a pen and a paper. Write down what you have discovered so far, what you have heard (may be relevant to what you are doing), what commands did you run, what happened as a result, what are the error messages, which service/server you have restarted, what has happened, which registry key/configuration file you have edited, and which actions you backed out. This process saves you, your department and your company.

Do not try to cover anything. First, this is not ethical. Second, it will slow down the resolution process, in some cases make it even worse. Third, when nothing is under cover, a different eye may read between the lines and see what may be the problem. Fourth, when everything is over and it is found out that you have covered something, you will at least be embarrassed, at most find yourself in front of the judge. Systems log events, keep logs and auditors search for clues. Don’t try to cover anything for any reason.

Don’t try to do everything at once. If there are more than one solution to the problem, don’t rush trying to fix the situation. Write down your actions, try one and see the outcome. If it fixes the issue, then it’s OK. If not, see if you need to back out from this action before you proceed to the next one on the list. This will also save you from a worsening situation, where two or more actions affect each other and turn the incident to a more complex one. Plus, when everything is over, this step-by-step approach will help you document the action(s) that fixed the problem.

Don’t be afraid of escalation. Things are not working as you thought? Raise your flag and ask your supervisor to receive support from the vendor. It is better to pay a couple of hundred of dollars and a few hours/minutes to have the vendor fix the problem than having a determination of one and a half day to try to be the hero. And even one and a half day, you still have the risk of not correcting the situation. Remember, you are not there to satisfy your system administrator pride, you are there to do the right things at the right time. What is the point of saving your pride when you lost the company thousands of dollars and days when you could make the same with hundreds of dollars and hours?

RELATED:   Big Data #6: Before Giving Your Big Data Project to a Contractor ...

Make the autopsy. When everything is over and the dust settled, there is one more important thing for you to do. You have to analyze in detail why this incident has happened, what did you do to fix it, what measures you have taken so that such an event does not happen again, what monitoring tools you have employed to see the signals in advance, what can you do better to inform all parties when such an outage happens again? Remember the pen and paper I told you? Your notes will provide you almost all the information in your analysis and documentation.

Finally, the emotional side. An outage is not something to criticise yourself, your talents, capabilities, your determination etc.. Even if you did something wrong, you learned from it, you fixed the situation and now you have more experience. Remember that one of the reasons you are paid is to fix such a situation; if things would fix themselves without any intervention, there would be no justification for your existence in the company.

And, no matter what, your, somebody else’s or even the hardware’s fault, there will be system outages, you cannot escape from that. You cannot be sure when the next outage will hit, but you can be sure that it will. Think about how you can prepare yourself for the next one.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes:

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>