LJ Archive

Bare-Bones Monitoring with Monit and RRDtool

How to provide robust monitoring to low-end systems. By Andy Carlson

When running a critical system, it's necessary to know what resources the system is consuming, to be alerted when resource utilization reaches a specific level and to trend long-term performance. Zabbix and Nagios are two large-scale solutions that monitor, alert and trend system performance, and they each provide a rich user interface. Due to the requirements of those solutions, however, dedicated hardware/VM resources typically are required to host the monitoring solution. For smaller server implementations, options exist for providing basic monitoring, alerting and trending functionality. This article shows how to accomplish basic and custom monitoring and alerting using Monit. It also covers how to monitor long-term trending of system performance with RRDtool.

Initial Monit Configuration

On many popular Linux distros, you can install Monit from the associated software repository. Once installed, you can handle all the configuration with the monitrc configuration file. That file generally is located within the /etc directory structure, but the exact location varies based on your distribution.

The config file has two sections: Global and Services. The Global section allows for custom configuration of the Monit application. The Monit service contains a web-based front end that is fully configurable through the config file. Although the section is commented out by default, you can uncomment items selectively for granular customization. The web configuration block looks like this:


set httpd port 2812 and
    use address localhost
    allow localhost
    allow admin:monit

The first line sets the port number where you can access Monit via web browser. The second line sets the hostname (the HTTP host header) that's used to access Monit. The third line sets the host from which the Monit application can be accessed. Note that you also can do this using a local firewall access restriction if a firewall is currently in place. The fourth line allows the configuration of a user name/password pair for use when accessing Monit. There's also a section that allows SSL options for encrypted connections to Monit. Although enabling SSL is recommended when passing authentication data, you also could reverse-proxy Monit through an existing web server, such as nginx or Apache, provided SSL is already configured on the web server. For more information on reverse-proxying Monit through Apache, see the Resources section at the end of this article.

The next items you need to enable deal with configuring email alerts. To set up the email server through which email will be relayed to the recipient, add or enable the following line:


set mailserver mailserver.company.com

Note that if there's a local SMTP server running, the server name of mailserver.company.com in this example may be replaced with localhost.

The next block to enable sets the contents of the email alert messages that will be sent and will look similar to this:


set mail-format {
  from:    Monit <monit@$HOST>
  subject: Monit alert --  $EVENT $SERVICE
  message: $EVENT Service $SERVICE
                Date:        $DATE
                Action:      $ACTION
                Host:        $HOST
                Description: $DESCRIPTION

                Your faithful employee,
                Monit
}

Within this block, different predefined variables are used to provide alert-specific information (denoted by the $ sign). You can modify text within the from, subject or message fields, and you also can add additional data to the message field as desired.

To finish the alerting functionality, you can configure an email address that will receive all email alerts from Monit by adding the following line:


set alert user@domain.com

At this point, the specified email address will receive all alerts generated by Monit. However, so far, no alerts are configured. To begin configuring alerts, let's first look at the Services section mentioned earlier. That section provides some basic monitoring functionality for the local machine, including CPU, memory, swap, filesystem and basic network monitoring. Each of those configuration items provides for the definition of thresholds. After the thresholds are met, actions can be taken, including sending an alert. As an example, the out-of-the-box alert for CPU/memory/swap monitoring looks like this:


check system $HOST
   if loadavg (1min) > 4 then alert
   if loadavg (5min) > 2 then alert
   if cpu usage > 95% for 10 cycles then alert
   if memory usage > 75% then alert
   if swap usage > 25% then alert

Again, note the use of variables to define the host to be monitored. While all of the triggers defined here result in an alert, other actions also can be taken. For more information on these settings, consult the Monit documentation (see Resources).

Custom Configuration of Monit

Once initial configuration is complete, you can define custom alerts. It's best to define the custom alerts outside the monitrc file. You do this by defining an include directory in the monitrc file as follows:


include /opt/monit-custom/*

This line includes all configuration files located in the /opt/monit-custom folder.

Next, let's look at two types of monitoring: host checks and program checks. Host checks allow for the monitoring of TCP-based services running on remote hosts. Although you can do basic TCP port connection testing for simpler services, Monit also provides the ability to do HTTP-based content checks to a specific URL. Consider the following example:


check host linuxjournal-website with address www.linuxjournal.com
    if failed
        port 443 protocol https
        with request / with content = "Become a Patron"
    then alert

The first line of the host check defines the identifier within Monit for this host (linuxjournal-website) and the address with which the host will be accessed (www.linuxjournal.com). In this example, the trigger within the host definition contains multiple conditions: it must be accessed via port 443 using the https protocol, and when accessed at the root URL, the text "Become a Patron" shows up in the response body. This check could be reconfigured to use port 80 and the http protocol.

Along with host monitoring, Monit allows the definition of script-based monitors, which is called a program check. Once a script is configured within Monit, the script will be executed periodically, and based on the script's exit code, action may be taken.

Here's an example of a script that alerts when an SSL certificate expiration date is within a specified number of days:


#!/bin/bash

domainexpiredate() {
    openssl x509 -text -in <(echo -n | \
    openssl s_client -connect $1:$2 2>/dev/null | \
    sed -n '/-*BEGIN/,/-*END/p') 2>/dev/null | sed -n 's/ 
 ↪*Not After : *//p'
}

daysleft() {
    echo "((($(date -d "$(domainexpiredate $1 $2)" +%s)-$(date 
 ↪+%s))/24)/60)/60" | bc
}

defaultport() {
    if [ -z "$1" ]; then
        echo "443"
    else
        echo "$1"
    fi
}

[[ $(daysleft $2 $(defaultport $3)) -le $1 ]] && exit 1 || 
 ↪exit 0

This script is executed with two arguments: minimum number of days until expiration and the hostname of the server, with an optional third parameter for port number. Here's an example execution of the script:


$ checkcertexpire.sh 31 www.linuxjournal.com
$ echo $?
0

When the script is executed with the two required arguments, there is no console output. After the execution, if the return code is echoed (identified as $?), the value is 0, which indicates that the domain does not expire within 31 days. Configuring this item within Monit requires the following:


check program linuxjournal-ssl with path 
 ↪"/etc/monit/scripts/checkcertexpire.sh 31 www.linuxjournal.com"
    if status != 0 then alert

In the same way as the host check, the program check has an identifier within Monit (linuxjournal-ssl, in this case). In the first line of the program check, along with the identifier, is the script to be executed along with the command-line arguments. Note that the trigger indicates that if the exit code is not 0, an alert should be sent.

Collecting Data with RRDtool

RRDtool is a very robust tool that lets you collect data over a long period of time. Named after its database format (round-robin database), RRDtool saves time-based data to its database and then lets you retrieve and graph the data. RRDtool can graph any data that you can present through a command to a shell script.

Before capturing data, you must initialize the database. For this example, let's create a database to capture the five-minute load average. Here's the command to initialize this specific database:


rrdtool create loadavg_db.rrd --step 60 
 ↪DS:loadavg:GAUGE:120:0:10000 RRA:MAX:0.5:1:1500

The first two arguments indicate that a database named loadavg_db.rrd is being created. The --step argument defines the expected time gap between data samplings. In this case, 60 seconds are expected between samplings.

Let's look at two more arguments separately. The first of the two arguments begins with DS and defines a data set named loadavg. Note that the options for this data set are separated by colons. The GAUGE keyword says that when the data is read, it will be written to the database as is (unaltered). The 120 is the timeout in seconds to wait for data to be written to the database. If the data isn't written to the database within that window, zeros will be written to the database to indicate an error in the data feed. The 0 and 10000 are the minimum and maximum values that can be written to the database. The argument beginning with RRA defines the round-robin archive value. This defines how many values can be stored in the database and how long they'll be stored. The MAX indicates that the variable contains one value and shouldn't be modified in any way. The 0.5 indicates the initial resolution value. This is a standard value and shouldn't be changed. The 1 identifies how many steps should be averaged when storing a final value. In this case, there is one step value per value stored in the database. The final argument, 1440, is how many steps will be stored in the database. Since the step length is 60 seconds, this configuration will provide 25 hours of data to be stored in the database.

Now that the data is initialized, you can capture and store it in the database. To maintain accurate periodic data collection, it's best to create a crontab entry and have the data be collected at a desired interval. For this example, you would have the cron job run every minute. To collect data and put it in the database, use the following command:


rrdtool update loadavg_db.rrd --template loadavg N:$(cat 
 ↪/proc/loadavg | sed 's/^\([0-9\.]\+\) .*$/\1/g')

To perform the data collection, the update argument along with the database name was used. The --template argument allows you to specify the variable name to populate with data. This is the same loadavg variable that was defined when the database was initialized. The N argument defines the data to be put into the loadavg variable. In this case, the result of the command substitution will be put into the database, which will be the five-minute load average. This command could be placed in the crontab for minute-by-minute execution. The crotab entry would look like this:


* * * * * /path/to/rrdtool-script.sh

Since all of the time fields contain asterisks, the specified script will run every minute. Once the database has been populated, you can render a graph with the following command:


rrdtool graph loadavg_graph-$(date +"%m-%d-%Y").png \
-w 785 -h 120 -a PNG \
--slope-mode \
--start -86400 --end now \
--font DEFAULT:7: \
--title "5-minute load average" \
--watermark "`date`" \
--vertical-label "load average" \
--lower-limit 0 \
--right-axis 1:0 \
--x-grid MINUTE:10:HOUR:1:MINUTE:120:0:%R \
--alt-y-grid --rigid \
DEF:loadaverage=loadavg_db.rrd:loadavg:MAX \
LINE1:loadaverage#0000FF:"load" \
GPRINT:loadaverage:LAST:"Cur\: %5.2lf" \
GPRINT:loadaverage:AVERAGE:"Avg\: %5.2lf" \
GPRINT:loadaverage:MAX:"Max\: %5.2lf" \
GPRINT:loadaverage:MIN:"Min\: %5.2lf\t\t\t"

The first line calls the RRDtool graph function along with the filename of the image to create. In this instance, the image filename will contain the current date. All of the arguments beginning with -- set up the look and feel of the graph, including labels, axis configuration, image format and the time frame from which to pull the data. For detailed information on these arguments, see the RRDtool documentation.

The line beginning with DEF:loadaverage defines a graph variable named loadaverage, which will have the values from the loadavg variable you created in the database. The line beginning with LINE specifies the color of the graph line and the label to use in the legend. The GPRINT lines indicate various statistic details to be printed at the bottom of the graph. In this case, the last recorded value and the average, minimum and maximum values during the time frame will be displayed. Note that the %5.2lf specifies the value to be printed as a floating-point number with up to five digits to the left of the decimal point and two digits to the right.

For ease of capturing daily graphs, you also could add this command to the crontab to run daily with the following entry:


0 0 * * * /path/to/rrdtool-graph.sh

This will run the graph script every day at midnight. The images may now be placed in a folder that is accessible via a browser for easy viewing.

Although many monitoring solutions exist that provide robust graphical UIs, these solutions provide basic monitoring and trending functionality while using a minimum of system resources and providing a basic framework for disseminating the data collected.

Resources

About the Author

Andy Carlson has worked in IT for the past 15 years doing networking and server administration along with occasional coding. He is thankful to have chosen a career that he loves, grows in and learns from. He currently resides in Cincinnati, Ohio, with his wife, three daughters and his son. His family is currently in the process of adopting two children internationally. He enjoys playing the guitar, coding, and spending time with family and friends.

LJ Archive