<< Chapter < Page Chapter >> Page >

Top

One of the quickest-to-use tools for a picture of the current state of a Linux system is top which displays current top processes running, system uptime, load average, CPU utilization, memory usage (real and swap), and other items.

Uptime

For a quick view of the system uptime and load averages, run uptime .

Iostat

iostat displays information about the current state of the disk I/O on the system

Netstat

To see what network ports are currently open and listening, use netstat . For example, netstat -an | grep 80 will display what is using port 80 (and 8080, and anything else that has '80' in its port number).

Lsof

lsof will show what process is holding open a network port or file. To use "list open files" to see what process is holding port 80, run lsof -i:80

Ps

To see a list of the process table, run ps . My favorite argument sequence is aux which gives lots of information back: ps aux

The similar call on a Solaris machine is: ps -ef

Strace

For a fuller diagnosis of what a given process is doing, strace can be a lifesaver. It essentially wraps around the process in question (either by running strace<program-name> , or by attaching to a running process with strace -p<pid>

On Solaris, the similar tool is truss .

Gdb

The GNU debugger, gdb , is a massively-useful tool in the right hands: tracking individual calls inside a program, setting breakpoints, etc: it should be learned by every developer, and known to advanced users.

User error

"User error" is among the most commonly-cited errors with software and systems: the operator did something the creators did not expect. To use a ubiquitous car analogy, it's "user error" if the driver hits the gas instead of the brake. One interesting article makes the claim that there is [almost] no such thing as "user error", and that instead it should be the developers who make tools not resilient enough to handle any user (no, a car manufacturer can't make the gas act like the brake when you "meant to stop", but maybe software developers can make their products less error-prone, or at least have them give better errors when they do have a problem).

    A spectrum of user-initiated errors:

  • Typos (misspellings, fat-fingering, generally mistyping something)
  • External environmental problems (eg unplugging a network cable)
  • Clickos (ie, misclicks - akin to mistyping)
  • Forgetfulness
  • Etc
From personal observation, I would guess user error accounts for 70-80% of all errors seen.

Post-mortem data collection

When something has gone so awry that it has violently crashed, or even taken out its host system, it's time for some post-mortem data collection - maybe even forensic analysis.

Core dumps, log files, and even images of whole drives can be investigated during a post-mortem analysis of problems seen: as your technical acumen grows, you'll be able to investigate more parts of these prior to escalating to the tool's support or development teams.

Pro-active, preventative measures

Ideally, we would all live and work in a world where nothing ever failed, and everyone acted the way they are "supposed" to. Sadly, that world does not exist. So what can we do to help prevent issues in the first place, or respond more adeptly when they [inevitably] occur?

Some solutions are simple: add more memory to the system; increase swap space; verify storage quotas; make sure all the resources I need are available; etc. Many can be more complex.

If there is a set of "Known Issues" or release notes that come with a particular product, make sure you read and are aware of them: there is almost nothing more frustrating than finding out there is a known issue, but you didn't check the manuals first!

Asking "why"

If you're on the administrative side of the technical world, and not just the end-user side, the other big thing to remember is to always ask "why". Why did it fail? Why did we miss the known issue? Why were we not notified a necessary resource was going to be down? Why was there no alert sent about resources nearing their limits? If you can ask (and answer) those, then you should be able to reduce the number of "why" questions you need to ask in the future - because hopefully you're solving problems before they arise.

"future-proofing" - is it possible?

The idea of "Future-Proofing" is to create an environment that can survive future developments without needing to be changed itself. A common example of this would be to look at the current and expected growth needs of the email infrastructure of an organization, and then size the mail servers to handle 15-25% more than the expected growth (ie 100 users today, adding 20% per year, size the environment today for 200 users in three years (173 expected, plus ~15%). Or it could mean ensuring that data you are working with today in version 4.3 of some tool will be accessible when upgrading to 7.2 in 4 years.

When relying on external vendors, guaranteeing your environment is future-proof may not be possible - they could decide to change database schemas, file formats, etc. Likewise, when relying on expected growth patterns, you may exceed those expectations (requiring additional licenses, hardware, etc), or you may not meet those plans, and have an unnecessarily oversized environment. Several mitigating strategies exist for these eventualities, but are beyond the scope of this lesson.

Closing thoughts

You've completed this module, and so now you're ready to troubleshoot the most ornery problems in the most obscure corners of your system, right? Don't let me discourage you from that lofty goal: but the reality is that becoming a good troubleshooter takes time, practice, lots of exposure, practice, skimming skills, practice, and patience. Oh, and did I mention: practice!

Lots of professions require troubleshooting skills, and each has their own tricks and tips to follow: auto mechanics will check the OBDII and listen to a rattle; electricians look for wiring faults; doctors look at symptoms to come up with a diagnosis. Skills learned in one field may not always translate into another, but if you can learn the basics (which DO all transfer), then gleaning insights from others can only improve your own personal Bag O' Hatchets.

Get Jobilize Job Search Mobile App in your pocket Now!

Get it on Google Play Download on the App Store Now




Source:  OpenStax, Debugging and supporting software systems. OpenStax CNX. Aug 29, 2011 Download for free at http://cnx.org/content/col11350/1.2
Google Play and the Google Play logo are trademarks of Google Inc.

Notification Switch

Would you like to follow the 'Debugging and supporting software systems' conversation and receive update notifications?

Ask