Retrospective Log Viewer & Analyzer

 

On-Call support made easy

On-Call support is exceptionally  stressful for an IT person. Why? First and foremost the phone call is unexpected. It doesn’t matter whether it happens when you are asleep, are on the bus or in the gym - you are totally taken by surprise and all on your own. The system you support is of high importance, probably a critical system which, if down, causes big revenue loss to your company. My first on-call support experience for a critical system was a real nightmare. It went more or less like this:

4:59AM [Phone is ringing]

Me: heellooo?
Voice on the phone: Hi, my name is Mike, I am an incident manager. I’m calling because you are listed as a support person for the XX system.  We just opened a severity situation because we noticed significant transactions failure. Approximately 10%.
Me: yeeeah, ok, hold on.
Mike: we already have database and network administrators if you need their assistance. Please check what is going on and tell us what to do.
Me: ok, let me just boot my laptop:

5:02

Ok, laptop is booted. Time to start the putty. Let’s check host prod001 first
login as: oncall
oncall@prod001's password:
Password. Hmm, what was the password? I think I know.

1st try
access denied.

hmm – typo? (it is pretty hard to hit the proper key at 5am :-))

2nd try
access denied.

What? I’m sure I typed it correctly. Should I do it again? Yeah, it must be this one.

3rd - last try.
access denied. Your account is locked. Please contact your administrator.

Oops…

Mike: well? Any updates?
Me: Give me few more minutes please. I need to unlock my account.

15 minutes later my account was unlocked and password reset.

Now, let’s check what is going on.

Login prod001
cd /opt/logs
grep ERROR *.log*

this takes ages…

wow, so many results. Let’s analyze... No, all really old ones. Need to refine my search.
grep ERROR *.log
no results
grep WARN *.log
no results

Mike: and?
Me: still checking
Mike: Just to let you know. 30 minutes have passed and according to our procedures we have to escalate this issue.

A few minutes later almost 30 people were dialed in on the call.
Checked on prod002,prod003,prod004, …., prod009

Oh, I should also check in catalina.out . Let’s try again:

Logging in to prod001
cd /opt/logs
grep ERROR catalina.out
no results
Logging in to prod002,…..

5:41

Logging in to prod007
oncall@prod007 password: ********
cd /opt/logs
grep ERROR catalina.out
2011-08-16 07:43:35:089 ERROR
java.lang.OutOfMemoryError: Java heap space

        at java.util.Arrays.copyOfRange(Arrays.java:3210)

        at java.lang.String.<init>(String.java:216)

        at java.lang.StringBuffer.toString(StringBuffer.java:585)

        at java.util.HashMap.get(HashMap.java:300)

That’s it. It looks like we released a bug a few days ago.

Me: I think I’ve found it. Please perform heap dump on prod007 and bounce the application server immediately after that. This will help for a while and we’ll analyze this problem in more detail today.

5:50

Mike: ok, looks like it’s working again. I am closing this incident. Thank you.

50 minutes after I received the call the problem was resolved. In the meantime approximately 10 000 transactions failed. Should I mention how stressful it was? Today this is how it might look like:

I get a phone call from the incident manager. 2 minutes later my laptop is booted. I start Retrospective, select previously defined profile (for on-call purposes I’ve already defined a profile with all relevant log files I have to monitor), type ERROR and hit search. Retrospective connects, logs in and search log files on all hosts with one single click. A few seconds later I see OutOfMemory error on host prod007. 3 minutes later I ask the incident manager to perform heap dump and restart the application server. Within the next 2 minutes the application is operational again. 10 minutes after receiving this call the incident is closed. In the meantime approximately 600 transactions failed but more than 9000 were rescued.

How about you? The next time you receive a support call in the middle of the night, which tool will you choose - this

or this?

?

 

 

 

 

Blog tags: 
Share/Save

User login

Try it out now!

We are confident that Retrospective is the best log viewer on the market and we think that you will agree with us after you’ve had a chance to try out all the great features that we’ve built in for you.

Please download the trial version, have a play and we are sure that Retrospective will soon become one of the most invaluable tools in your toolbox!

Twitter

Blog