<< Chapter < Page Chapter >> Page >
Troubleshooting systems and software is an art and a science - what hatchets can you put in your "bag o' hatchets" to help eliminate non-problems while diagnosing symptoms of failure?

Introduction

One of the best trainers I ever had taught the incoming crop of support engineers at Opsware (of which I was a member) that Support is all about applying hatchets to problems to make them easier to handle - when someone is calling for help, they [typically] have a major problems that is impacting their job, and need a solution to it last week. The product we were being trained on came on 2 full DVD iso images (it has since grown to three, dual-layer DVD iso images). That's a lot of potential area for errors to occur - whether from bugs in the application, or user mistakes. After a while, you start to see patterns in incoming issues, which allows for quicker resolution of customer complaints - when you've seen the same problem pop up at a dozen locations, as soon as a fix is found for one of them, you can, most likely, apply that same solution to the next 11, and solve all of those problems "at once".

You will learn about a host of hatchets you can use to narrow-down problems from the initial symptom of "it doesn't work" or "it broke" to the root cause, or viable workarounds.

The techniques described can be applied to other areas as well, but the focus will be on software systems.

Overview

    We will cover an array of hatchets:

  • Stop, Drop, and Roll
  • What Changed
  • Logging Output / Log Files
  • Effective Searching
  • Debugging Tools
  • User Error
  • Post-Mortem Data Collection
  • Pro-Active, Preventative Measures

Stop, drop, and roll

When encountering software issues, whether in the smallest of scripts, or in enterprise tools, is to Stop, Drop, and Roll. Yes, those same three words you learned from the fireman as a child for what to do if your clothes ever catch fire.

Famous last words in most cases are, "I know what I'm doing" - you may very well, but always guess first that you don't. This is not to insult your intelligence, but rather to remind you that everyone makes mistakes!

    Things to do before blindly going on:

  • Take note of all error messages returned from the failed process
  • See if the error is something you have seen before (such as "Permission denied")
  • Make sure you are running as the correct user / with proper privileges
  • Make sure you have enough space to continue the task

What changed

Were you able to successfully accomplish the task at hand before? Did the script run successfully yesterday, but not today? Can someone else run this correctly and I can't?

If the answer to these, and similar questions, is "yes", then you need to find out what changed between that last time you did this and now.

    Things that may have changed:

  • Your user's permissions
  • The contents of the script/tool
  • Free space you have access to (ie, maybe you're nearing your quota)
  • System changes (patches, updates, etc)
  • Remote resources are inaccessible (maybe it relies on a file server that is down for maintenance)

If you can undo any of the changes, does the tool work again? For example, if you are nearing your quota space but you delete some files, will it then run? If so, maybe you need your quota expanded. When the network file server is back up, does it run correctly?

Get Jobilize Job Search Mobile App in your pocket Now!

Get it on Google Play Download on the App Store Now




Source:  OpenStax, Debugging and supporting software systems. OpenStax CNX. Aug 29, 2011 Download for free at http://cnx.org/content/col11350/1.2
Google Play and the Google Play logo are trademarks of Google Inc.

Notification Switch

Would you like to follow the 'Debugging and supporting software systems' conversation and receive update notifications?

Ask