# # Copyright (C) 1998, 1999 Philip J. Lewis # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. # # Contact: lewispj@email.com # An Inexpensive High Availability Cluster for Linux Introduction Although Linux is widely known to be an extremely stable operating system, the fact that the standard PC hardware is not quite so reliable must not be overlooked. I've been maintaining Linux servers for a long time and in most cases where a system has failed, it has been due to server hardware failure. Unix in the commercial world has the reputation for having good clustering and high availability (HA) technologies. Linux has not really been put to the test quite so much in this area. In my present company, Electec, we rely heavily upon email (sendmail and imap4), windows file sharing (Samba), ftp and dial-up authentication (radius) services on a 24 hour basis for communications with our suppliers, staff, and customers who are spread all over different timezones. Until recently we had all of these services consolidated in one Linux server. This system has served us very well. However, it is just a matter of time before a hardware failure occurs which may cause us loss of productivity and revenue. High availability is becoming increasingly important as we depend more and more on computers in businesses today. I decided to design and implement an inexpensive high availability solution for our business critical needs without requiring the use of expensive additional hardware or software. This article will go into the design aspects, pitfalls and implementation experiences of the solution. Clusters or Fault-Tolerant Hardware There are quite a few different approaches and combinations of approaches to high availability on servers. One way is to use a single fault-tolerant server with redundant power supplies, RAID, environmental monitoring, fans, network interface cards, and so on. The other way involves the use of several units of non-redundant hardware but arranged in a cluster so that each node (or server) in the cluster is able to takeover from any failures of partner nodes. In comparison, the fault-tolerant server approach has the advantage that operating system, application configurations and operations are the same as if you were using a simple inexpensive server. Whereas with a cluster, the application and OS configurations may get very complex and much advanced planning is needed. With a fault-tolerant server, failures are catered for in such a way that the clients will not notice any downtime; the recovery is seamless. Ideally this should also be the case with node failures in a cluster. In many cases the hardware cost of a cluster is far less than a single fault-tolerant server especially when you do not have to fork out lots of cash for some of the commercial cluster software offerings. Trade-off Between Cost and Downtime There is a trade-off between cost and client disruption/downtime. You must ask yourself how much downtime you and your users can tolerate. Shorter downtimes usually require a much more complex or costly solution. In our case I decided that we could live with approximately five minutes downtime in the event of a failure therefore I chose to use a cluster as opposed to a single fault-tolerant server. There are many clustering solutions available to the Unix market which can provide almost zero downtime in the event of a node takeover by means of session and network connection takeover. These solutions are mostly expensive and normally require the use of external shared storage hardware. In our case we can allow for sessions and connections to be lost. This simplifies the task of implementing a high availability cluster. Distribution of Storage Devices When implementing HA clustering it is recommended by some that a shared storage device should be used such as a dual ported RAID box, however it is possible to approach the problem by using a separate storage device for each node in the cluster and to mirror the storage devices as and when necessary. Each avenue has its own merits. The shared storage approach has the benefit of never requiring any software mirroring of data between cluster nodes thus saving precious CPU, I/O and network resources. Sharing is also beneficial because data accessible from another cluster node is always up to date. The mirroring approach, which uses separate storage devices, has the advantage that the cluster nodes definitely do not need to be in the same geographical location and is therefore more useful in a disaster recovery center scenario. In fact, if the mirror data was compressed, it could be sent over a WAN connection to a remote node. There are RAID systems available which allow the disks to be geographically distributed and which require interconnectivity by optical fibre, however, these are rather expensive. The cost of two sets of simple storage devices is less expensive than a dual ported RAID box of similar capacity. The dual ported RAID box can, in some cases, introduce a single point of failure in the cluster. If the RAID filesystem is somehow corrupted beyond recovery, it would cause serious cluster downtime. Most RAID systems mirror the data at the device level and have no regard for which filesystem is in use. A software based system can mirror files in user space so if a file becomes unreadable on one node it will not necessarily copy the same filesystem corruption to the other node. Due to this weakness and the cost factor I decided to use separate storage devices on each node in the cluster. It should be noted that even if a dual ported storage device is used, both nodes in the cluster should never mount the same partition read/write simultaneously. Cluster Load Balancing It is often desirable to spread the workload evenly across the nodes in a cluster. It would be a waste of resources to have one node do all the hard work until it failed and then have another node take over everything. That would be termed a hot standby system. Load balancing can take many forms depending on the service in question. For web servers, which provide simple static pages and which are often mainly read-only, a round-robin DNS solution can be quite effective. Unfortunately, with the read/write or transactional type of services like email or database access, unless the connection and session information from the service on one node can be shared and used by other nodes, it would be very difficult to provide seamless load balancing over the cluster. It would also require that the disk mirroring be near-instantaneous and use lots of distributed locking techniques which most daemons will not support without complex modifications. To avoid these complex drawbacks I decided that a simpler approach could be used which stands between the hot-standby and the network-level load balancing. In my two node cluster, I put half of the required services on node 'A' (Serv1) and the other half on node 'B' (Serv2). A mutual failover configuration was employed so that if node 'A' failed, node 'B' would take over all of its services and vice versa. Service Suitability for Clustering It had to be decided which services needed to be running on the overall cluster. This involved comparatively rating how much computing resource each service would consume. For our previous Linux server it was found that Samba and cyrus imap4 were the most resource intensive services with ftp, httpd and sendmail coming in approximately joint second place. Careful consideration had to be given to which services were already suited to running on two or more nodes concurrently. Examples of such services included sendmail (for sending mail only or as a relay), bind, httpd, ftpd (downloading only) and radius. Examples of services which cannot simply be run in such a way are cyrus imap4, ftpd (uploading), and Samba. The reason Samba cannot be running on two servers at once is simply because there would be two servers broadcasting the same netbios name on the same LAN. It is not yet possible to have PDC/BDC (primary and backup domain controller) arrangements with the current stable versions of Samba. The pages on a simple web site, on the other hand, are not fast changing so mirroring is very effective and parallel web servers can run quite happily without any major problems. The servers were configured so that each server took primary care of a specific group of services. In my case I put Samba on Serv1 and cyrus imap4 on Serv2. The other services are shared in a similar way. httpd, bind, and radius run on both nodes concurrently. Node Takeover In the event of a node failure, the other node takes over all the services of the failed one in such a way as to minimize disruption to the network users. This was best achieved by using IP and MAC address takeover from the failed node onto an unused ethernet card on the takeover node. In effect, the node would appear to be both Serv1 and Serv2 to the network users. The use of MAC address takeover was preferred to avoid the potential problems with the clients' ARP cache still associating the old MAC address with the IP address. MAC address takeover, in my opinion, is neater and more seamless than IP takeover alone but unfortunately has some scalability limitations. Networking Hardware When considering the HA network setup it was very important to try to eliminate all single points of failure. Our previous server had many single points of failure; the machine itself, the network cable, the ethernet hub, the UPS, etc. The list was endless. The network has been designed to be inexpensive and reliable as shown in Figure 1. In the network diagram you can see that there are three network interface cards (NICs) in each server. The first NIC in each server is used for the main LAN access to clients. Each node is plugged into a separate ethernet switch or hub to give redundancy in case of a switch lock-up or failure (this actually happened to us not so long ago). The second NIC is used for creating a private inter-node network using a simple cross-over 100BaseTX full-duplex ethernet cable. A cross-over cable is far less likely to fail than two cables plugged into an ethernet hub or switch. This link is primarily used for disk mirroring traffic and the cluster heartbeat. It also helps to take the network traffic load off the main interface and provides a redundant network path between the nodes. The third NIC is the redundant LAN access card which is used for MAC and IP address takeover in the event of a remote node failure in the cluster. Again, these are plugged into different ethernet switches or hubs for greater network availability. Cluster Partitioning If a node fails in some way it is vital that only one of the nodes performs the IP and MAC address takeover. Determining which node has failed in a cluster is easier said than done. If the heartbeat network failed, while using a simplistic takeover algorithm, both of the nodes would wrongly perform MAC, IP and application takeover and the cluster would become partitioned. This would cause major problems on any LAN and would probably result in some kind of network and server deadlock. One way to prevent this scenario from taking place is to make the node which first detects a remote node failure, remote login to each of that remote node's interfaces and attempt to put it into a standby runlevel (eg. single user mode). This runlevel would prevent the failed node from attempting to restart itself and thus stop an endless failure-recovery loop. There are problems with this method. What if node 'A' which has a failed NIC thinks node 'B' is not responding then remotely puts node 'B' into single user mode? You would end up with no servers available to the LAN at all! There needs to be some mechanism to decide which node has actually failed. One of the only ways to do this on a two node cluster is to rely on a third party. My method of implementing this is to use a list of locally accessible devices which can be pinged on the LAN. Then by a process of arbitration, the node which detects the highest number of unreachable devices will gracefully surrender and go into the standby runlevel. This is shown in Figure 2. Distributed Mirroring of Filesystems To implement this solution with minimal risk of data loss, the data on the two servers needs to be constantly mirrored. It would be ideal if the data written to Serv1 would get simultaneously written to Serv2 and vice versa. In practice, a near-perfect mirror would require a substantial kernel implementation and there would be many hurdles along the way such as filesystem performance and distributed lock management. One way would be to implement a RAID mirror which used disks from different nodes; a cluster filesystem. This is supposed to be possible in the later incarnations of the 2.1 and probably the 2.2 kernel by using md, NFS and network block devices. The performance hit of this solution remains to be evaluated once the 2.2 kernel is released. Another solution, which also remains to be evaluated, is the use of the CODA distributed filesystem. Synchronization Design A practical way to have a mirror of data on each node is to allow the frequency of the file mirroring to be predefined by the administrator. Not only for the overall node but rather on a per file or directory basis. With this fine grained level of control the data volatility characteristics of a particular file, directory or application can be reflected in frequency of mirroring to the other node in the cluster. For example, fast changing data such as an imap4 email spool where users are constantly moving, reading and deleting email could be mirrored every minute, whereas slow changing data such as the company's mainly static web pages could be mirrored hourly. Trade-off Between Mirror Integrity and Excessive Resource Usage There are trade-offs to be considered when mirroring data this way. One major trade-off is mirror integrity and CPU and I/O resource consumption. It would be nice if I could have my imap4 mail spools mirrored every 1 second. In practice this would not work because the server takes 15 seconds to synchronize this spool each time. The CPU and disk I/O usage could also be so high that the services would be noticeably slowed down. This would seem to defeat the objective of high availability. Even if the CPU had the resources to read the disks in less than 1 second, there may still be problems transferring the data changes between the nodes due to a network throughput bottleneck. The Risk of Data loss This mirroring approach does have some flaws. If a file is saved to a Samba fileshare on Serv1, then before the Samba fileshares are mirrored Serv1 fails, the file will remain unavailable until Serv1 fully recovers. In a worst case scenario the Serv1 filesystem will have been corrupted and the file lost forever. However, compared to a single server with a backup tape, this scenario is less risky because traditional backups are made far less frequently than the mirroring in the cluster. Of course, a cluster is no replacement for traditional backups which are still vital for many other reasons. Resynchronization of Files on Node Recovery A major design factor is resynchronization (mirroring back) of the files once a failed node has recovered. A reliable procedure must be employed so that data which has changed on the failover node during the failure period is mirrored back to the original node and not lost due to the original node overwriting or deleting it when restoring that same file or directory. The resynchronization procedure should be implemented such that (i) a node cannot perform any mirroring while another node has taken over its services and, (ii) before the services can be started the original node after a failure, all files associated with it must be completely mirrored back to the original node. This must be done while the services are kept off-line on both nodes to prevent the services writing to the files which are being written to from the other node. Failure to prevent this could result in data corruption and loss. Mirroring Caveats There was a brain-teasing problem when using this solution with imap4 and pop3 mail spools. If an email is received then delivered on serv2, and before mirroring can take place, serv2 fails, serv1 will takeover the mail services. Subsequent mails would arrive in serv1's mail spool. When serv2 recovers the email that was received just before failure will be overwritten by the new mails received on serv1. The best way to solve this would be to configure sendmail to queue a copy of any mail destined for itself also for delivery to the takeover node. In the event that the takeover node is offline, mails would remain in the sendmail queue. Once the failed node has recovered they would get successfully delivered. This method requires that there is no mirroring of the mail spools and queues. However, it would be necessary to have two sendmail configurations available on both nodes. One configuration for normal operation and the other for node takeover operation. This would be to prevent mail bouncing between the two servers. I will not pretend to be a Sendmail expert. If you know how to configure dual queuing sendmail delivery please let me know. This part is still a work-in-progress. As a temporary measure, until I implement the above solution, I create backup files on resynchronization of the mail spool which need manual checking on node recovery which is time consuming. I also prevent such difficulties by mirroring the mail spool as frequently as possible. This has the unfortunate temporary side effect of making my hard disks work overtime. Similar problems would be encountered when clustering a database service. However, a few large Unix database vendors are now providing parallel versions of their products which enable concurrent operation across several nodes in a cluster. The Node Recovery Procedure A node could fail for various reasons ranging from an operating system crash, which would result in a hang or reboot, to a hardware failure which could result in the node going into standby mode. If the system is in standby mode it will not automatically recover. It should be necessary for the administrator to manually remove a standby lock file and start runlevel 5 on the failed node to confirm to the rest of the cluster that the problem that caused the failure has been resolved. If the OS hangs then this would have the same effect as a standby runlevel but if the reset button is pressed or the system reboots, the node will try to rejoin the cluster as there will be no standby lock file. When a node attempts to rejoin the cluster, the other node will detect the recovery and stop all of the cluster services while the resynchronization of the disks takes place. Once this has completed the cluster services will be restarted and the cluster will once again be in full operation. Implementation Platform My choice of Linux distribution is Redhat 5.1 on the intel platform. There are, however, no reasons why this should not be adapted for another Linux distribution. The implementation is purely in user space. No special drivers are required. There are some basic prerequisites in order to effectively deploy this system. You will need: * Two similarly equipped servers especially in terms of data storage space. * Three network interface cards per server are recommended (although you could actually get away with two at the expense of some modifications and extra LAN traffic.) * Sufficient network bandwidth between the cluster nodes. My system is configured as follows: * 2 x DELL PowerEdge 2300 Servers Each complete with: * 3 x 3C905B 100BaseTX Ethernet Cards * 2 x 9 Gbyte Ultra SCSI 2 Hard disks * 1 x Pentium II 350 MHz CPU Overview of Cluster Software Configuration The administrator can configure groups of files which are mirrored together by creating some small mirror description files in the configuration directory. Below is the description file for my lpd mirroring. Note that directory entries must end with a '/'. Contents of '/etc/cluster.d/serv1/Flpd': /var/spool/lpd/ /etc/printcap The frequency of the mirroring is then controlled by an entry in the '/etc/crontab' file also shown below. Each entry executes the sync-app program which examines the specified service mirror description file and mirrors the contents to the specified server IP address. In this example the specified server address is the cross-over cable IP address on Serv2. These crontab entries are from Serv1. Mirroring of the lpd system is done every hour. Example Crontab entries: 0,5,10,15,20,25,30,35,40,45,50,55 * * * * root /usr/local/bin/sync-app \ /etc/cluster.d/serv1/Fsmbd serv2-hb 0 * * * * root /usr/local/bin/sync-app /etc/cluster.d/serv1/Flpd serv2-hb Cluster Daemon Implementation The brains of the system lie in the cluster daemon, clusterd. This was written in the Bourne shell and will shortly get rewritten in 'C'. The algorithm outline is shown as a flowchart in figure 2. clusterd continuously monitors the ICMP reachability of the other node in the cluster as well as a list of hosts which are normally reachable from each node. It does this using a simple ping mechanism with a timeout. If the other node becomes unreachable or partially unreachable clusterd will decide which node actually has the failure by counting the number of hosts in the list which each node can reach. The node which can reach the least hosts is the one which gets put into standby mode. clusterd will then start the failover and takeover procedures on the working node. This node then continues to monitor if the failed node recovers. When it recovers clusterd controls the resynchronization procedure. clusterd is invoked on each node as: clusterd It has to know which applications and services are running on each node so that it knows which ones to start and stop at failover and takeover time. This is defined in the same configuration directories as the service mirror description files described earlier. The configuration directories in each node are identical and mirrored across the whole cluster. This makes life easier for the cluster administrator as he or she can configure the cluster from a single designated node. Within the '/etc/cluster.d/' directory there is a '.conf' file and a '' directory for each node in the cluster. The reachlist file contains a list of reachable external hosts on the LAN. The contents of the '/etc/cluster.d' directory are shown here: [root@serv1 /root]# ls -al /etc/cluster.d/ total 8 drwxr-xr-x 4 root root 1024 Nov 15 22:39 . drwxr-xr-x 23 root root 3072 Nov 22 14:27 .. drwxr-xr-x 2 root root 1024 Nov 4 20:30 serv1 -rw-r--r-- 1 root root 213 Nov 5 18:49 serv1.conf drwxr-xr-x 2 root root 1024 Nov 8 20:29 serv2 -rw-r--r-- 1 root root 222 Nov 22 22:39 serv2.conf -rw-r--r-- 1 root root 40 Nov 12 22:19 reachlist As you can see, the two nodes are called serv1 and serv2. The configuration directory for serv1 has the following files: etc/cluster.d/serv1: total 9 drwxr-xr-x 2 root root 1024 Nov 16 20:30 . drwxr-xr-x 4 root root 1024 Nov 15 22:39 .. -rw-r--r-- 1 root root 23 Oct 8 19:39 Fauth -rw-r--r-- 1 root root 16 Nov 16 20:30 Fclusterd -rw-r--r-- 1 root root 34 Oct 8 19:39 Fdhcpd -rw-r--r-- 1 root root 30 Oct 8 19:39 Flpd -rw-r--r-- 1 root root 12 Nov 15 22:14 Fnamed -rw-r--r-- 1 root root 19 Oct 8 19:39 Fradiusd -rw-r--r-- 1 root root 102 Oct 8 19:39 Fsmbd lrwxrwxrwx 1 root root 24 Oct 25 17:33 K10radiusd -> /etc/rc.d/init.d/radiusd lrwxrwxrwx 1 root root 20 Oct 25 17:33 K30httpd -> /etc/rc.d/init.d/httpd lrwxrwxrwx 1 root root 20 Oct 25 17:33 K40smb -> /etc/rc.d/init.d/smb lrwxrwxrwx 1 root root 20 Oct 25 17:33 K60lpd -> /etc/rc.d/init.d/lpd lrwxrwxrwx 1 root root 22 Oct 25 17:33 K70dhcpd -> /etc/rc.d/init.d/dhcpd lrwxrwxrwx 1 root root 22 Oct 25 17:33 K80named -> /etc/rc.d/init.d/named lrwxrwxrwx 1 root root 22 Oct 25 17:33 S20named -> /etc/rc.d/init.d/named lrwxrwxrwx 1 root root 22 Oct 25 17:33 S30dhcpd -> /etc/rc.d/init.d/dhcpd lrwxrwxrwx 1 root root 20 Oct 25 17:33 S40lpd -> /etc/rc.d/init.d/lpd lrwxrwxrwx 1 root root 20 Oct 25 17:33 S50smb -> /etc/rc.d/init.d/smb lrwxrwxrwx 1 root root 20 Oct 25 17:33 S60httpd -> /etc/rc.d/init.d/httpd lrwxrwxrwx 1 root root 24 Oct 25 17:33 S90radiusd -> /etc/rc.d/init.d/radiusd The files that begin with the letter 'F' are service mirror description files. Those starting with 'S' and 'K' are linked to the SysVinit start/stop scripts and behave in a similar way to the files in the SysVinit runlevels. The 'S' services are started when node serv1 is in normal operation. The 'K' services are killed when node serv1 goes out of service. The number following the 'S' and 'K' determine the order of starting and stopping the services. clusterd running on node serv2 uses this same '/etc/cluster.d/serv1/' directory to decide which services to start on serv2 when node serv1 has failed. It also uses the serv1 service mirror description files (those files starting with 'F') to determine which files and directories need to be mirrored back (resynchronized) to serv1 after it has recovered. The configuration directory for node serv2 is as follows: /etc/cluster.d/serv2: total 6 drwxr-xr-x 2 root root 1024 Nov 16 20:29 . drwxr-xr-x 4 root root 1024 Nov 15 22:39 .. -rw-r--r-- 1 root root 88 Oct 15 17:15 Fftpd -rw-r--r-- 1 root root 29 Oct 15 17:15 Fhttpd -rw-r--r-- 1 root root 44 Oct 8 18:47 Fimapd -rw-r--r-- 1 root root 62 Nov 16 20:29 Fsendmail lrwxrwxrwx 1 root root 25 Sep 25 17:33 K60sendmail -> /etc/rc.d/init.d/sendmail lrwxrwxrwx 1 root root 22 Sep 25 17:33 K80httpd -> /etc/rc.d/init.d/httpd lrwxrwxrwx 1 root root 22 Sep 25 17:33 K85named -> /etc/rc.d/init.d/named lrwxrwxrwx 1 root root 21 Nov 15 22:52 K90inetd -> /etc/rc.d/init.d/inet lrwxrwxrwx 1 root root 21 Nov 15 22:52 S10inetd -> /etc/rc.d/init.d/inet lrwxrwxrwx 1 root root 22 Sep 25 17:33 S15named -> /etc/rc.d/init.d/named lrwxrwxrwx 1 root root 22 Sep 25 17:33 S20httpd -> /etc/rc.d/init.d/httpd lrwxrwxrwx 1 root root 25 Sep 25 17:33 S40sendmail -> /etc/rc.d/init.d/sendmail As you can see here, the node serv2 normally runs sendmail, named, httpd, imap4, and ftpd. Network control scripts Whenever the network interfaces need to be brought up or down I have used RedHat's supplied 'ifup' and 'ifdown' scripts respectively. This makes the network interface configuration more tightly integrated with the GUI network configuration tools. The node configuration files '/etc/cluster.d/.conf' allows you to specify which ethernet NIC is used for which purpose on each node in the cluster. Here are my node configuration files: [root@serv1 /root]# cat serv1.conf # config for serv1 main interface & address MAIN_IPADDR=203.1.2.1 MAIN_INTERFACE=eth0 MAIN_MACADDR=00:10:4B:63:1C:08 # Server's heartbeat interface & address HEARTBEAT_IPADDR=192.168.111.1 HEARTBEAT_INTERFACE=eth1 # Server's redundant interface REDUNDANT_INTERFACE=eth2 [root@serv1 /root]# cat serv2.conf # config for serv2 main interface & address MAIN_IPADDR=203.1.2.2 MAIN_INTERFACE=eth0 MAIN_MACADDR=00:10:4B:31:46:0F # Server's heartbeat interface & address HEARTBEAT_IPADDR=192.168.111.2 HEARTBEAT_INTERFACE=eth1 # Server's redundant interface REDUNDANT_INTERFACE=eth2 To implement the MAC address takeover there is one important addition to the RedHat ethernet configuration files. You must add a line to the '/etc/sysconfig/network-scripts/ifcfg-eth2' file to set the MAC address. eth2 is the redundant interface in my case so I need it to takeover the MAC address of the main service interface on the other node in the cluster. In other words, the MAC address of eth2 on serv2 must be the same as the MAC address of eth0 on serv1. The line 'MACADDR=00:10:4B:63:1C:08' was appended to this file on node serv2. The RedHat 'ifup' will use this variable when bringing up an interface. A similar modification must be made to each node. If you will be using an ethernet switch (instead of a hub) it will be necessary to set the MAC address cache timeout to a suitable period to avoid the cluster losing communication with the LAN clients after a MAC address takeover. I set ours to 20 seconds for the ports which are connected directly to the nodes. Consult your switch manual or vendor if you need information on how to do this. It can usually be done via the console cable. Procedure for setting up a cluster 1. Install the required clusterd software and associated utilities (see resources) 2. Configure the .conf files with your selected interfaces, IP, and MAC addresses. 3. On each node use the RedHat netcfg tool to configure the addresses of your three ethernet interfaces - make the first two (eth0 and eth1) 'activate at boot time' and the redundant interface (eth2) to remain inactive. Configure the RedHat MACADDR variable as described above. 4. Get ssh working as root user so that you may log in anywhere between the nodes without requiring any password to be entered when using any interface. You may want to use the '.shosts' file for this. 5. Create a 'reachlist' which is made up of two or three normally reachable hosts on the shared LAN. SNMP managed ethernet switches/hubs are ideal candidates for this. 6. Start the clusterd program from inittab. Procedure of adding a service to a cluster 1. Install the service (eg. Samba) on each node as per normal. 2. Configure the service as usual for a standalone server but only on one primary node (i.e. the node where the service will normally run). 3. Create a service mirror description file in the directory listing the directories and files you wish to be mirrored between the nodes - be sure to include the configuration files (eg. /etc/smb.conf) and the data directories (eg. /home/samba/ and /var/samba/). 4. Create a sync-app crontab entry on the primary node and decide the mirroring frequency. 5. Remove the start script symbolic links which may be present in the usual SysVinit runlevel directories (eg. /etc/rc.d/rc5.d/S??smbd). 6. Create start/stop script symbolic links in the directory for your service. 7. Double check your settings. Centralized Cluster Administration I have created service mirror description files and crontab entries for /etc/hosts, passwd/group files and the whole /etc/clusterd/ directory so that I can administer the cluster from a single node. This greatly simplifies cluster configuration. To avoid confusion, I found it helpful to create a DNS alias for each service used on the cluster which points to the primary node for that service so that when I need to configure Samba all I need to do is remote login to 'samba.yourdomainname.com'. If the secondary node for a service is configured by mistake, the changes will be ignored until the primary node fails. Current Software Limitations Currently only two nodes maybe in a cluster. The effort in scaling this upto clusters of more than two nodes should not be too difficult although a different approach would probably have to be used instead of MAC address takeover because of the large number of NICs otherwise required for larger clusters. Other utilities used Several useful utilities enabled me to do efficient mirroring. rsync is an invaluable utility which uses the 'rsync' algorithm. This program will look for changes in files and only mirror the parts which have changed rather than the whole file. It also checks if the file has been updated by examining the modification date and file size before doing any further comparisons. ssh (secure shell), can also be used between the nodes in conjunction with rsync so that the mirrored data is sent via an encrypted and authenticated connection. You can alternatively use rsh if you prefer. When rsync is doing file comparisons it uses the file's date and time therefore it is vital that the nodes all agree on the same time. I chose to run the netdate utility every hour from cron. The nodes used a list of remote trusted time sources. To make sure that a failed node boots with the correct time the CMOS PC clock is updated after running netdate. Synchronization Implementation rsync was configured so that files in the source directory that do not exist on the target directory are deleted. This behaviour is necessary to avoid accumulative and excessive disk usage on the target node. If this setting is not used then a user connected to a Samba file share would effectively not be able to delete any file on the mirrored node. The same goes for almost all applications. clusterd is configured to create backups of deleted or changed files when the resynchronization procedure is in progress. This can help minimize the risk of data loss in the event of mirroring failure prior to a node takeover. The subsequent removal of backup files would necessitate some human intervention after it has been confirmed that files or data were not lost after the node recovery. This was done using the '--backup' option in rsync version 2.2.1. You may find it more CPU efficient to turn off the rsync algorithm and fully mirror files which have changed instead of mirroring the changes however this will utilize more network bandwidth. Resynchronization Implementation The resynchronization (mirroring back) procedure was implemented also using rsync. It uses a lock file which disallows any mirroring to another node while it senses a node failure. The lock file is checked for existence by sync-app before any files are mirrored. This prevents node 'A' mirroring to node 'B' while node 'B' is mirroring the same files to node 'A'. Using a Shared Storage Device clusterd could be used with a shared and/or distributed storage device if preferred by removing the resynchronization function and by not using sync-app, although I haven't tried this yet. Testing and Results To test server failure I had to simulate the failure of every interface on the cluster. In every case the cluster took the expected action and shut down the correct server. In the case of the inter-node/heartbeat network failing, the nodes simply carried on normal operation and notified the administrator of the failure. On a point-to-point network of this nature it is almost impossible to determine which NIC is at fault. I simulated various network switch failures and power supply failures. The results were all as expected. After a node was put into standby mode (single user mode) I had to manually remove a standby lock file in order to fully bring up the node once again. If a node recovered and entered a network runlevel while the standby lock file still existed, the remote node immediately put the node back into standby mode to prevent an IP and MAC address clash on the LAN. The mirroring was tested over a period of several months and I found that the nodes could typically compare 6 Gbytes of unchanged data in around 50,000 files in under 45 seconds. After catastrophic node failure (pulled the power plug from the UPS), recovery time for the node was around 10-15 minutes for fsck disk checking, and a disk resynchronization time of around 3 minutes (9+ Gigabytes of data). This represented a cluster services downtime of around 3 minutes to the LAN clients. Failover delay from when a node failed until the remote node fully took over was typically 60-80 seconds. The effect this had on users was dependent on the service; sendmail, imap4, http and ftp just refused connection for users for the duration, whereas Samba sometimes momentarily locked up a windows PC application where there were open files at the point of failure. Radius and dhcpd caused no client lock-outs probably because of their UDP implementation. Conclusions On the whole the cluster provides us with much better system availability. It is a vast improvement over the single server as now we can afford to do server maintenance and upgrades during working hours. We have not yet had any real life catastrophic failures with the new Dell servers but the test results show that there is a minimal downtime of less than 2 minutes while a node takes over. We have saved spending large amounts of capital by implementing a simple high availability cluster without the need for expensive specialist hardware such as dual ported RAID. This clustering solution is certainly not as advanced as some of the commercial clusters or as thorough as some of the upcoming open source Linux-HA project proposals, however it does meet our needs sufficiently. The system has been in full time production operation since September 1998. We have over 30 LAN clients using the cluster as their primary 'server'. The system has proven to be reliable. The company sees the server as a business critical system and it seems like we have achieved the objectives of high availability. Future Improvements * Use of a high performance network block device with software RAID-1 which would do away with the periodic mirroring and its associated caveats. * Evaluation of the CODA distributed filesystem for a cluster filesystem. * Improved administrator notification and communication with network management systems. * Use of service watchdogs or agents which will monitor the health of a particular service and initiate a node takeover when things look bad. * Inclusion of a lower level and more robust heartbeat system which could make use of the serial port or target mode SCSI instead of the more complex network interface card. * A GUI and setup scripts for cluster and node configuration. * Improved scalability across more than two nodes. * Dual sendmail queuing and delivery for concurrent operation. Other related links * Eddie - A high availability and load balancing system currently for DNS and httpd: http://www.eddieware.org * RSF-1 - A commercial HA offering for Linux which uses dual ported RAID. http://www.rsi.co.uk * Alan's High-Availability Linux Project Site: http://www.henge.com/~alanr/ha * Software RAID Howto by Linas Vepstas: http://www.linas.org/linux/Software-RAID/Software-RAID.html * Linux HA Howto by Harald Milz: http://metalab.unc.edu/pub/Linux/ALPHA/linux-ha/High-Availability-HOWTO.html * The Coda Distributed File System: http://www.coda.cs.cmu.edu/ * Slashdot.org contributors: http://slashdot.org Resources * clusterd: ftp://ftp.ssc.com/pub/????/clusterd/ * rsync, written by Andrew Tridgell: http://samba.anu.edu.au/rsync * Secure Shell, ssh: http://www.replay.com/redhat/ssh.html * RedHat Linux: http://www.redhat.com Credits * Electec, Singapore for being guinea-pigs and letting me loose on their networks. * Dr. Ian Vince McLoughlin, for reading, correcting and suggesting useful improvements to this article. About the Author Philip Lewis is from the UK. He graduated from the University of Birmingham in 1994 & has spent 3 years working in Singapore. He has been writing software since 1983. He now runs his own consultancy company in UK designing WAN/LAN infrastructures & writing Linux software. His interests include Linux software development& hacking, telecomms, network security, promoting Linux, making wine and eating good food in Malaysia. He can be reached via email at lewispj@email.com