<< Chapter < Page | Chapter >> Page > |
One of the quickest-to-use tools for a picture of the current state of a Linux system is
top
which displays current top processes running, system uptime, load average, CPU utilization, memory usage (real and swap), and other items.
For a quick view of the system uptime and load averages, run
uptime
.
iostat
displays information about the current state of the disk I/O on the system
To see what network ports are currently open and listening, use
netstat
. For example,
netstat -an | grep 80
will display what is using port 80 (and 8080, and anything else that has '80' in its port number).
lsof
will show what process is holding open a network port or file. To use "list open files" to see what process is holding port 80, run
lsof -i:80
To see a list of the process table, run
ps
. My favorite argument sequence is
aux
which gives lots of information back:
ps aux
The similar call on a Solaris machine is:
ps -ef
For a fuller diagnosis of what a given process is doing,
strace
can be a lifesaver. It essentially wraps around the process in question (either by running
strace<program-name>
, or by attaching to a running process with
strace -p<pid>
On Solaris, the similar tool is
truss
.
The GNU debugger,
gdb
, is a massively-useful tool in the right hands: tracking individual calls inside a program, setting breakpoints, etc: it should be learned by every developer, and known to advanced users.
"User error" is among the most commonly-cited errors with software and systems: the operator did something the creators did not expect. To use a ubiquitous car analogy, it's "user error" if the driver hits the gas instead of the brake. One interesting article makes the claim that there is [almost] no such thing as "user error", and that instead it should be the developers who make tools not resilient enough to handle any user (no, a car manufacturer can't make the gas act like the brake when you "meant to stop", but maybe software developers can make their products less error-prone, or at least have them give better errors when they do have a problem).
When something has gone so awry that it has violently crashed, or even taken out its host system, it's time for some post-mortem data collection - maybe even forensic analysis.
Core dumps, log files, and even images of whole drives can be investigated during a post-mortem analysis of problems seen: as your technical acumen grows, you'll be able to investigate more parts of these prior to escalating to the tool's support or development teams.
Ideally, we would all live and work in a world where nothing ever failed, and everyone acted the way they are "supposed" to. Sadly, that world does not exist. So what can we do to help prevent issues in the first place, or respond more adeptly when they [inevitably] occur?
Some solutions are simple: add more memory to the system; increase swap space; verify storage quotas; make sure all the resources I need are available; etc. Many can be more complex.
If there is a set of "Known Issues" or release notes that come with a particular product, make sure you read and are aware of them: there is almost nothing more frustrating than finding out there is a known issue, but you didn't check the manuals first!
The idea of "Future-Proofing" is to create an environment that can survive future developments without needing to be changed itself. A common example of this would be to look at the current and expected growth needs of the email infrastructure of an organization, and then size the mail servers to handle 15-25% more than the expected growth (ie 100 users today, adding 20% per year, size the environment today for 200 users in three years (173 expected, plus ~15%). Or it could mean ensuring that data you are working with today in version 4.3 of some tool will be accessible when upgrading to 7.2 in 4 years.
When relying on external vendors, guaranteeing your environment is future-proof may not be possible - they could decide to change database schemas, file formats, etc. Likewise, when relying on expected growth patterns, you may exceed those expectations (requiring additional licenses, hardware, etc), or you may not meet those plans, and have an unnecessarily oversized environment. Several mitigating strategies exist for these eventualities, but are beyond the scope of this lesson.
You've completed this module, and so now you're ready to troubleshoot the most ornery problems in the most obscure corners of your system, right? Don't let me discourage you from that lofty goal: but the reality is that becoming a good troubleshooter takes time, practice, lots of exposure, practice, skimming skills, practice, and patience. Oh, and did I mention: practice!
Lots of professions require troubleshooting skills, and each has their own tricks and tips to follow: auto mechanics will check the OBDII and listen to a rattle; electricians look for wiring faults; doctors look at symptoms to come up with a diagnosis. Skills learned in one field may not always translate into another, but if you can learn the basics (which DO all transfer), then gleaning insights from others can only improve your own personal Bag O' Hatchets.
Notification Switch
Would you like to follow the 'Debugging and supporting software systems' conversation and receive update notifications?