LJ Archive

The Monitoring Issue

In 1935, Austrian physicist, Erwin Schrödinger, still flying high after his Nobel Prize win from two years earlier, created a simple thought experiment.

It ran something like this:

If you have a file server, you can not know if that server is up or down...until you check on it. Thus, until you use it, a file server is—in a sense—both up and down. At the same time.

This little brain teaser became known as Schrödinger's File Server, and it's regarded as the first known critical research on the intersection of Systems Administration and Quantum Superposition. (Though, why Erwin chose, specifically, to use a "file server" as an example remains a bit of a mystery—as the experiment works equally well with any type of server. It's like, we get it, Erwin. You have a nice NAS. Get over it.)

...

Okay, perhaps it didn't go exactly like that. But I'm confident it would have...you know...had good old Erwin had a nice Network Attached Storage server instead of a cat.

Regardless, the lessons from that experiment certainly hold true for servers. If you haven't checked on your server recently, how can you be truly sure it's running properly? Heck, it might not even be running at all!

Monitoring a server—to be notified when problems occur or, even better, when problems look like they are about to occur—seems, at first blush, to be a simple task. Write a script to ping a server, then email me when the ping times out. Run that script every few minutes and, shazam, we've got a server monitoring solution! Easy-peasy, time for lunch!

Whoah, there! Not so fast!

That server monitoring solution right there? It stinks. It's fragile. It gives you very little information (other than the results of a ping). Even for administering your own home server, that's barely enough information and monitoring to keep things running smoothly.

Even if you have a more robust solution in place, odds are there are significant shortcomings and problems with it. Luckily, Linux Journal has your back—this issue is chock full of advice, tips and tricks for how to keep your servers effectively monitored.

You know, so you're not just guessing of the cat is still alive in there.

Mike Julian (author of O'Reilly's Practical Monitoring) goes into detail on a bunch of the ways your monitoring solution needs serious work in his adorably titled "Why Your Server Monitoring (Still) Sucks" article.

We continue "telling it like it is" with Corey Quinn's treatise on Amazon's CloudWatch, "CloudWatch Is of the Devil, but I Must Use It". Seriously, Corey, tell us how you really feel.

With our cathartic, venting session behind us, we've got a detailed, hands-on walk-through of how to use Monit (an open-source process supervisor for Linux) coupled with RRDtool (a GPL'd tool for capturing data over long periods of time, such as from shell scripts, and graphing it) to monitor your server in a fairly simple, and very open-source, way.

Then, get this, we've got a sysadmin from the Computing Centre of the National Institute of Nuclear Physics and Particle Physics in France (Fabien Wernli)—seriously, how cool is that?—walking us through how to create a site-wide, low-latency (we're talking sub-millisecond here) log infrastructure.

Round that out with an interview with Steve Newman (one of the folks who created Writely, which you might know as Google Docs, following Google's acquisition in 2006) on his company, Scalyr, which handles server monitoring and log management—and you've got more server monitoring information than you can shake a stick at.

Or, you can go back to guessing if the cat is still alive. That's fun too.

About the Author

Bryan Lunduke is a former Software Tester, former Programmer, former VP of Technology, former Linux Marketing Guy (tm), former openSUSE Board Member...and current Deputy Editor of Linux Journal as well as host of the (aptly named) Lunduke Show.

Bryan Lunduke
LJ Archive