Red Hat Enterprise Linux Cluster Suite

Khurram Shiraz

Issue #163, November 2007

Building a highly available solution using the RHEL cluster suite.

When mission-critical applications fail, so does your business. This often is a true statement in today's environments, where most organizations spend millions of dollars making their services available 24/7, 365 days a year. Organizations, regardless of whether they are serving external customers or internal customers, are deploying highly available solutions to make their applications highly available.

In view of this growing demand, almost every IT vendor currently is providing high-availability solutions for its specific platform. Famous commercial high-availability solutions include IBM's HACMP, Veritas' Cluster Server and HP's Serviceguard.

If you're looking for a commercial high-availability solution on Red Hat Enterprise Linux, the best choice probably is the Red Hat Cluster Suite.

In early 2002, Red Hat introduced the first member of its Red Hat Enterprise Linux family of products, Red Hat Enterprise Linux AS (originally called Red Hat Linux Advanced Server). Since then, the family of products has grown steadily, and it now includes Red Hat Enterprise Linux ES (for entry- and mid-range servers) and Red Hat Enterprise Linux WS (for desktops/workstations). These products are designed specifically for use in enterprise environments to deliver superior application support, performance, availability and scalability.

The original release of Red Hat Enterprise Linux AS version 2.1 included a high-availability clustering feature as part of the base product. This feature was not included in the smaller Red Hat Enterprise Linux ES product. However, with the success of the Red Hat Enterprise Linux family, it became clear that high-availability clustering was a feature that should be made available for both AS and ES server products. Consequently, with the release of Red Hat Enterprise Linux version 3 in October 2003, the high-availability clustering feature was packaged into an optional layered product called the Red Hat Cluster Suite, and it was certified for use on both the Enterprise Linux AS and Enterprise Linux ES products.

The RHEL cluster suite is a separately licensed product and can be purchased from Red Hat on top of Red Hat's base ES Linux license.

Red Hat Cluster Suite Overview

The Red Hat Cluster Suite has two major features. One is the Cluster Manager that provides high availability, and the other feature is called IP load balancing (originally called Piranha). The Cluster Manager and IP load balancing are complementary high-availability technologies that can be used separately or in combination, depending on application requirements. Both of these technologies are integrated in Red Hat's Cluster Suite. In this article, I focus on the Cluster Manager.

Table 1 shows the major components of the RHEL Cluster Manager.

Table 1. RHEL Cluster Manager Components

Software Subsystem	Component	Purpose
Fence	fenced	Provides fencing infrastructure for specific hardware platforms.
DLM	libdlm, dlm-kernel	Contains distributed lock management (DLM) library.
CMAN	cman	Contains the Cluster Manager (CMAN), which is used for managing cluster membership, messaging and notification.
GFS and related locks	Lock_NoLock	Contains shared filesystem support that can be mounted on multiple nodes concurrently.
GULM	gulm	Contains the GULM lock management user-space tools and libraries (an alternative to using CMAN and DLM).
Rgmanager	clurgmgrd, clustat	Manages cluster services and resources.
CCS	ccsd, ccs_test and ccs_tool	Contains the cluster configuration services dæmon (ccsd) and associated files.
Cluster Configuration Tool	System-config-cluster	Contains the Cluster Configuration Tool, used to configure the cluster and display the current status of the nodes, resources, fencing agents and cluster services graphically.
Magma	magma and magma-plugins	Contains an interface library for cluster lock management and required plugins.
IDDEV	iddev	Contains the libraries used to identify the filesystem (or volume manager) in which a device is formatted.

Shared Storage and Data Integrity

Lock management is a common cluster infrastructure service that provides a mechanism for other cluster infrastructure components to synchronize their access to shared resources. In a Red Hat cluster, DLM (Distributed Lock Manager) or, alternatively, GULM (Grand Unified Lock Manager) are possible lock manager choices. GULM is a server-based unified cluster/lock manager for GFS, GNBD and CLVM. It can be used in place of CMAN and DLM. A single GULM server can be run in standalone mode but introduces a single point of failure for GFS. Three or five GULM servers also can be run together, in which case the failure of one or two servers can be tolerated, respectively. GULM servers usually are run on dedicated machines, although this is not a strict requirement.

In my cluster implementation, I used DLM, and it runs in each cluster node. DLM is good choice for small clusters (up to two nodes), because it removes quorum requirements as imposed by the GULM mechanism).

Based on DLM or GLM locking functionality, there are two basic techniques that can be used by the RHEL cluster for ensuring data integrity in concurrent access environments. The traditional way is the use of CLVM, which works well in most RHEL cluster implementations with LVM-based logical volumes.

Another technique is GFS. GFS is a cluster filesystem that allows a cluster of nodes to access simultaneously a block device that is shared among the nodes. It employs distributed metadata and multiple journals for optimal operation in a cluster. To maintain filesystem integrity, GFS uses a lock manager (DLM or GULM) to coordinate I/O. When one node changes data on a GFS filesystem, that change is visible immediately to the other cluster nodes using that filesystem.

Hence, when you are implementing a RHEL cluster with concurrent data access requirements (such as, in the case of an Oracle RAC implementation), you can use either GFS or CLVM. In most Red Hat cluster implementations, GFS is used with a direct access configuration to shared SAN from all cluster nodes. However, for the same purpose, you also can deploy GFS in a cluster that is connected to a LAN with servers that use GNBD (Global Network Block Device) or two iSCSI (Internet Small Computer System Interface) devices.

Both GFS and CLVM use locks from the lock manager. However, GFS uses locks from the lock manager to synchronize access to filesystem metadata (on shared storage), while CLVM uses locks from the lock manager to synchronize updates to LVM volumes and volume groups (also on shared storage).

For nonconcurrent RHEL cluster implementations, you can rely on CLVM, or you can use native RHEL journaling-based techniques (such as ext2 and ext3). For nonconcurrent access clusters, data integrity issues are minimal; I tried to keep my cluster implementations simple by using native RHEL OS techniques.

Fencing Infrastructure

Fencing also is an important component of every RHEL-based cluster implementation. The main purpose of the fencing implementation is to ensure data integrity in a clustered environment.

In fact, to ensure data integrity, only one node can run a cluster service and access cluster service data at a time. The use of power switches in the cluster hardware configuration enables a node to power-cycle another node before restarting that node's cluster services during the failover process. This prevents any two systems from simultaneously accessing the same data and corrupting it. It is strongly recommended that fence devices (hardware or software solutions that remotely power, shut down and reboot cluster nodes) are used to guarantee data integrity under all failure conditions. Software-based watchdog timers are an alternative used to ensure correct operation of cluster service failover; however, in most RHEL cluster implementations, hardware fence devices are used, such as HP ILO, APC power switches, IBM BladeCenter devices and the Bull NovaScale Platform Administration Processor (PAP) Interface.

Note that for RHEL cluster solutions with shared storage, an implementation of the fence infrastructure is a mandatory requirement.

Step-by-Step Implementation of a RHEL Cluster

Implementation of RHEL clusters starts with the selection of proper hardware and connectivity. In most implementations (without IP load balancing), shared storage is used with two, or more than two, servers running the RHEL operating system and RHEL cluster suite.

A properly designed cluster, whether you are building a RHEL-based cluster or an IBM HACMP-based cluster, should not contain any single point of failure. Keeping this in mind, you have to remove any single point of failure from your cluster design. For this purpose, you can place your servers physically in two separate racks with redundant power supplies. You also have to remove any single point of failure from the network infrastructure used for the cluster. Ideally, you should have at least two network adapters on each cluster node, and two network switches should be used for building the network infrastructure for the cluster implementation.

Software Installation

Building a RHEL cluster starts with the installation of RHEL on two cluster nodes. My setup has two HP Proliant servers (DL740) with shared fiber storage (HP MSA1000 storage). I started with a RHEL v4 installation on both nodes. It's best to install the latest available operating system version and its updates. I selected v4 update 4 (which was the latest version of RHEL when I was building that cluster). If you have a valid software subscription from Red Hat, you can log in to the Red Hat network, and go to software channels to download the latest update available. Later, once you download the ISO images, you can burn it to CDs using any appropriate software. During the RHEL OS installation, you will go through various configuration selections, the most important of which are the date and time-zone configuration, the root user password setting, firewall settings and OS security level selection. Another important configuration option is network settings. Configuration of these settings can be left for a later stage, especially in building a high-availability solution with Ether-channel (or Ethernet bonding configuration).

You may need to install additional drivers after you install the OS. In my case, I downloaded the RHEL support package for the DL740 servers (the HP Proliant support pack, which is available from h18004.www1.hp.com/products/servers/linux/dl740-drivers-cert.html).

The next step is installing the cluster software package itself. This package, again, is available from the RHEL network, and you definitely have to select the latest available cluster package. I selected rhel-cluster-2.4.0.1 for my setup, which was the latest cluster suite available at the time.

Once downloaded, the package will be in tar format. Extract it, and then install at least the following RPMs, so that the RHEL cluster with DLM can be installed and configured:

Magma and magma-plugins
Perl-net-telnet
Rgmanager
System-config-cluster
DLM and dlm-kernel
DLM-kernel-hugemem and SMP support for DLM
Iddev and ipvsadm
Cman, cman-smp, cman-hugemem and cman-kernelheaders
Ccs

Restart both RHEL cluster nodes after installing vendor-related hardware support drivers and the RHEL cluster suite.

Network Configuration

For network configuration, the best way to proceed is to use the network configuration GUI. However, if you plan to use Ethernet channel bonding, the configuration steps are slightly different.

Ethernet channel bonding allows for a fault-tolerant network connection by combining two Ethernet devices into one virtual device. The resulting channel-bonded interface ensures that if one Ethernet device fails, the other device will become active. Ideally, connections from these Ethernet devices should go to separate Ethernet switches or hubs, so that the single point of failure is eliminated, even on the Ethernet switch and hub level.

To configure two network devices for channel bonding, perform the following on node 1:

1) Create bonding devices in /etc/modules.conf. For example, I used the following commands on each cluster node:

alias bond0 bonding
options bonding miimon=100 mode=1

Doing this loads the bonding device with the bond0 interface name and passes options to the bonding driver to configure it as an active-backup master device for the enslaved network interfaces.

2) Edit the /etc/sysconfig/network-scripts/ifcfg-eth0 configuration file for eth0 and the /etc/sysconfig/network-scripts/ifcfg-eth1 file for the eth1 interface, so that these files show identical contents, as shown below:

DEVICE=ethx
USERCTL= no
ONBOOT=yes
MASTER=bond0
SLAVE=yes
BOOTPROTO=none

This enslaves ethX (replace X with the assigned number of the Ethernet devices) to the bond0 master device.

3) Create a network script for the bonding device (for example, /etc/sysconfig/network-scripts/ifcfg-bond0), which would appear like the following example:

DEVICE=bond0
USERCTL=no
ONBOOT=yes
BROADCAST=172.16.2.255
NETWORK=172.16.2.0
NETMASK=255.255.255.0
GATEWAY=172.16.2.1
IPADDR=172.16.2.182

4) Reboot the system for the changes to take effect.

5) Similarly, on node 2, repeat the same steps with the only difference being that the file /etc/sysconfig/network-scripts/ifcfg-bond0 should contain an IPADDR entry with the value of 172.16.2.183.

As a result of these configuration steps, you will end up with two RHEL cluster nodes with IP addresses of 172.16.2.182 and 172.16.2.183, which have been assigned to virtual Ethernet channels (the underlying two physical Ethernet adapters for each Ethernet channel).

Now, you easily can use the network configuration GUI on the cluster nodes to set other network configuration details, such as hostname and primary/secondary DNS server configuration. I set Commsvr1 and Commsvr2 as the hostnames for the cluster nodes and also ensured that name resolution in both long names and short names would work fine from both the DNS server and the /etc/hosts file.

A RHEL cluster, by default, uses /etc/hosts for node name resolution. The cluster node name needs to match the output of uname -n or the value of HOSTNAME in /etc/sysconfig/network.

Listing 1. Contents of the /etc/hosts File on Each Server

# Do not remove the following line, or various programs
# that require network functionality will fail.
127.0.0.1       localhost.localdomain   localhost
172.16.2.182    Commsvr1        Commsvr1.kmefic.com.kw
172.16.2.183    Commsvr2
172.16.1.186    Commilo1        Commilo1.kmefic.com.kw
172.16.1.187    Commilo2        Commilo2.kmefic.com.kw
172.16.2.188    Commserver
192.168.10.1    node1
192.168.10.2    node2
172.16.2.4      KMETSM

If you have an additional Ethernet interface in each cluster node, it is always a good idea to configure a separate IP network as an additional network for heartbeats between cluster nodes. It is important that the RHEL cluster uses, by default, eth0 on the cluster nodes for heartbeats. However, it is still possible to use other interfaces for additional heartbeat exchanges.

For this type of configuration, you simply can use the network configuration GUI to assign IP addresses—for example, 192.168.10.1 and 192.168.10.2 on eth2, and get it resolved from the /etc/hosts file.

Setup of the Fencing Device

As I was using HP hardware, I relied on the configuration of the HP ILO devices as a fencing device for my cluster. However, you may consider configuring other fencing devices, depending on the hardware type used for your cluster configuration.

To configure HP ILO, you have to reboot your servers and press the F8 key to enter into the ILO configuration menus. Basic configuration is relatively simple; you have to assign IP addresses to ILO devices with the name of the ILO device. I assigned 172.16.1.100 with Commilo1 as the name of ILO device on node1, and 172.16.1.101 with Commilo2 as the ILO device name on node2. Be sure, however, to connect Ethernet cables to the ILO adapters, which usually are marked clearly on the back side of HP servers.

Once rebooted, you can use the browsers on your Linux servers to access ILO devices. The default user name is Administrator, with a password that usually is available on the hard-copy tag associated with the HP servers. Later, you can change the Administrator password to a password of your choice, using the same Web-based ILO administration interface.

Setup of the Shared Storage Drive and Quorum Partitions

In my cluster setup environment, I used an HP fiber-based shared storage MSA1000. I configured a RAID-1 of 73.5GB using the HP smart array utility, and then assigned it to both of my cluster nodes using the selective host presentation feature.

After rebooting both nodes, I used HP fiber utilities, such as hp_scan, so that both servers should be able to see this array physically.

To verify the physical availability of shared storage for both cluster nodes, look in the /dev/proc/proc file for an entry like /dev/sda or /dev/sdb, depending upon your environment.

Once you find your shared storage on the OS level, partition it according to your cluster storage requirements. I used the parted tool on one of my cluster nodes to partition the shared storage. I created two small primary partitions to hold raw devices, and a third primary partition was created to hold the shared data filesystem:

Parted> select /dev/sda

Parted > mklabel /dev/sda msdos

Parted > mkpart primary ext3 0 20

Parted > mkpart primary ext3 20 40

Parted > mkpart primary ext3 40 40000

I rebooted both cluster nodes and created the /etc/sysconfig/rawdevices file with the following contents:

/dev/raw/raw1           /dev/sda1
/dev/raw/raw2           /dev/sda2

A restart of rawdevices services on both nodes will configure raw devices as quorum partitions:

/home/root> services rawdevices restart

I then created a JFS2 filesystem on the third primary partition using the mke2jfs command; however, its related entry should not be put in the /etc/fstab file on either cluster node, as this shared filesystem will be under the control of the Rgmanager of the cluster suite:

/home/root> mke2jfs -j -b 4096 /dev/sda3

Now, you can create a directory structure called /shared/data on both nodes and verify the accessibility of the shared filesystem from both cluster nodes by mounting that filesystem one by one at each cluster node (mount /dev/sda3 /shared/data). However, never try to mount this filesystem on both cluster nodes simultaneously, as it might corrupt the filesystem itself.

Cluster Configuration

Almost everything required for cluster infrastructure has been done, so the next step is configuring the cluster itself.

A RHEL cluster can be configured in many ways. However, the easiest way to configure a RHEL cluster is to use the RHEL GUI and go to System Management→Cluster Management→Create a cluster.

I created a cluster with the cluster name of Commcluster, and with node names of Commsvr1 and Commsvr2. I added fencing to both nodes—fencing devices Commilo1 and Commilo2, respectively—so that each node would have one fence level with one fence device. If you have multiple fence devices in your environment, you can add another fence level with more fence devices to each node.

I also added a shared IP address of 172.16.2.188, which will be used as the service IP address for this cluster. This is the IP address that also should be used as the service IP address for applications or databases (like for listener configuration, if you are going to use an Oracle database in the cluster).

I added a failover domain, namely Kmeficfailover, with priorities given in the following sequence:

Commsvr1 
Commsvr2

I added a service called CommSvc and then put that service in the above-defined failover domain. The next step is adding resources to this service. I added a private resource of the filesystem type, which has the characteristic of device=/dev/sd3, mountpoint of /shared/data and mount type of ext3.

I also added a private resource of the script type (/root/CommS.sh) to service CommSvc. This script will start my C-based application, and therefore, it has to be present in the /root directory on both cluster nodes. It is very important to have correct ownership of root and security; otherwise, you can expect unpredictable behavior during cluster startup and shutdown.

Application or database startup and shutdown scripts are very important for a RHEL-based cluster to function properly. RHEL clusters use the same scripts for providing application/database monitoring and high availability, so every application script used in a RHEL cluster should have a specific format.

All such scripts should at least have start and stop subsections, along with a status subsection. When an application or database is available and running, the status subsection of the script should return a value of 0, and when an application is not running or available, it should return a value of 1. The script also should contain a restart subsection, which tries to restart services if the application is found to be dead.

A RHEL cluster always tries to restart the application on the same node that was the previous owner of the application, before trying to move that application to the other cluster node. A sample application script, which was used in my RHEL cluster implementation (to provide high availability to a legacy C-based application) is shown in Listing 2.

Listing 2. Sample Application Script

#Script Name: CommS.sh
#Script Purpose: To provide application 
#start/stop/status under Cluster
#Script Author: Khurram Shiraz

#!/bin/sh
basedir=/home/kmefic/KMEFIC/CommunicationServer
case $1 in
'start')
cd $basedir
su kmefic -c "./CommunicationServer -f Dev-CommunicationServer.conf"
exit 0
;;
'stop')
z=`ps -ef | grep Dev-CommunicationServer | grep -v "grep"| 
 ↪awk ' { print $2 } '
`
if [[ $? -eq 0 ]]
then
kill -9 $z
fuser -mk /home/kmefic
exit 0
fi
;;
'restart')
   /root/CommunicationS.sh stop
   sleep 2
  echo Now starting...... 
  /root/CommunicationS.sh start 
    echo "restarted"
    ;;

'status')
ps -U kmefic | grep CommunicationSe 1>/dev/null
if [[ $? = 0 ]]
then
exit 0
else
exit 1
fi
;;
esac

Finally, you have to add a shared IP address (172.16.2.188) to the service present in your failover domain, so that the service should contain three resources: two private resources (one filesystem and one script) and one shared resource, which is the service IP address for the cluster.

The last step is synchronizing the cluster configuration across the cluster nodes. The RHEL cluster administration and configuration tool provides a “save configuration to cluster” option, which will appear once you start the cluster services. Hence, for the first synchronization, it is better to send the cluster configuration file manually to all cluster nodes. You easily can use the scp command to synchronize the /etc/cluster/cluster.conf file across the cluster nodes:

/home/root> scp /etc/cluster/cluster.conf Commsvr2:/etc/cluster/cluster.conf

Once synchronized, you can start cluster services on both cluster nodes. You should start and stop RHEL-related cluster services, in sequence.

To start:

service ccsd start
service cman start
service fenced start
service rgmanager start

To stop:

service rgmanager stop
service fenced stop
service cman stop
service ccsd stop

If you use GFS, startup/shutdown of the gfs and clvmd services have to be included in this sequence.

Additional Considerations

In my environment, I decided not to start cluster services at RHEL boot time and not to shut down these services automatically when shutting down the RHEL box. However, if your business requires 24/7 service availability, you can do this easily by using the chkconfig command.

Another consideration is logging cluster messages in a different log file. By default, all cluster messages go into the RHEL log messages file (/var/log/messages), which makes cluster troubleshooting somewhat difficult in some scenarios. For this purpose, I edited the /etc/syslog.conf file to enable the cluster to log events to a file that is different from the default log file and added the following line:

daemon.* /var/log/cluster

To apply this change, I restarted syslogd with the service syslog restart command. Another important step is to specify the time period for rotating cluster log files. This can be done by specifying the name of the cluster log file in the /etc/logrotate.conf file (the default is a weekly rotation):

/var/log/messages /var/log/secure /var/log/maillog /var/log/spooler
/var/log/boot.log /var/log/cron /var/log/cluster {
    sharedscripts postrotate
          /bin/kill -HUP `cat /var/run/syslogd.pid 2> /dev/null` 2>
          /dev/null || true
    endscript
}

You also have to pay special attention to keeping UIDs and GIDs synchronized across cluster nodes. This is important in making sure proper permissions are maintained, especially with reference to the shared data filesystem.

GRUB also needs to conform to the suite environment's specific needs. For instance, many system administrators, in a RHEL cluster environment, reduce the GRUB selection timeout to some lower values, such as two seconds, to accelerate system restart time.

Database Integration with a RHEL Cluster

The same RHEL cluster infrastructure can be used for providing high availability to databases, such as Oracle, MySQL and IBM DB2.

The most important thing to remember is to base your database-related services on a shared IP address—for example, you have to configure Oracle listener based on the shared service IP address.

Next, I explain, in simple steps, how to use an already-configured RHEL cluster to provide high availability to a MySQL database server, which is, no doubt, one of the most commonly used databases on RHEL.

I assume that the MySQL-related RPMs are installed on both cluster nodes and that the RHEL cluster already is configured with a service IP address of 172.16.2.188.

Now, you simply need to define a failover domain using the cluster configuration tool (with the cluster node of your choice having a higher priority). This failover domain will have the MySQL service, which, in turn, will have two private resources and one shared resource (the service IP address).

One of the private resources should be of the filesystem type (in my configuration, it has a mountpoint of /shared/mysqld), and the other private resource should be of the script type, pointing toward the /etc/init.d/mysql.server script. The contents of this script, which should be available on both cluster nodes, is shown in Listing 3 on the LJ FTP site at ftp.linuxjournal.com/pub/lj/listings/issue163/9759.tgz.

This script sets the data directory to /shared/mysqld/data, which is available on our shared RAID array and should be available from both cluster nodes.

Testing for high availability of the MySQL database can be done easily with the help of any MySQL client. I used SQLyog, which is a Windows-based MySQL client. I connected to the MySQL database on Commsvr1 and then crashed this cluster node using the halt command. As a result of this system crash, the RHEL cluster events were triggered, and the MySQL database automatically restarted on Commsvr2. This whole failover process took one to two minutes and happened quite seamlessly.

Summary

RHEL clustering technology provides a reliable high-available infrastructure that can be used for meeting 24/7 business requirements for databases as well as legacy applications. The most important thing to remember is that it is best to plan carefully before the actual implementation and test your cluster and all possible failover scenarios thoroughly before going live with a RHEL cluster. A well-documented cluster test plan also can be helpful in this regard.