BYOC: Build Your Own Cluster, Part I—Design

Nathan R. Vance

Michael L. Poublon

William F. Polik

Issue #277, May 2017

Cluster computing mainly has been relegated to the professional realm. It doesn't have to be that way. Anyone with modest Linux experience can build their one.

Computer clusters are standard tools in scientific computing. Clusters speed up calculations by computing a single task in parallel or by computing multiple tasks simultaneously, thus rapidly solving extremely large or computationally intensive problems. They utilize commodity hardware, resulting in an excellent performance-to-price ratio.

In a basic computer cluster, one computer (the head node) relays instructions to the rest of the computers (compute nodes) across an isolated local network. The compute nodes then carry out their assigned tasks, optionally communicate among themselves, and return the results to the head node. This structure is analogous to a work force: the head node is the manager, who receives jobs from a customer and subcontracts with the compute node workers. When a worker is done, it signals the manager that it is available for another task. When the job is completed, the manager returns the final product to the customer—in this case, a calculated result.

Building a cluster typically is accomplished in one of two different ways. One way is to use configuration software, such as ROCKS, to set up a cluster automagically. Although that method has the obvious strength of convenience, weaknesses include being constrained to the assumptions and supported architectures of the tool. Additionally, it can be difficult to diagnose problems that arise, because such tools conceal what's happening under the hood.

The other method of cluster building is to build your own. It may take more time to complete, but the level of control and understanding of the cluster gained by this method makes it worth it in the long run. Cluster design goals include:

Robustness: the cluster is flexible enough to support many different applications, whether or not known at the time of its initial setup.
Reliability: the cluster, both hardware and software, should be stable for the long-term after the initial setup.
Portability: the same process and tools used in this guide can be used on different distributions with little change.
Scalability: the procedures should be practical if there are 2, 10 or 100 nodes.
No “Magic”: common problems are resolved by straightforward, well documented, easy-to-follow solutions.
Heterogeneity: upgrading in the future is still possible when the hardware may be different.
Simplicity: the approach outlined here should allow people with modest Linux/UNIX experience to build a cluster of their own.
Low-Cost: the cluster uses readily available hardware and free software (Linux).

This three-part series walks through the process of building a cluster that has all of those attributes. This first article gives an overview of the cluster's design, including the setup of computing hardware, networking and storage. The second article will deal with installing the operating system and software on the cluster in a scalable manner. The third article will cover the configuration of tools and services that transform the loose collection of computers to a tightly integrated cluster.

Cluster Hardware

Many hardware components go into building your cluster. These components can be broken down into six general categories: the computers themselves, networking supplies, physical storage of the computers, power distribution equipment, a console to access the computers and spare parts.

Computers

Clusters can be constructed from generic tower computers, but for large clusters, computers specifically designed for high-performance computing can be purchased from suppliers like Supermicro. These systems are preferable to tower PCs because they come in high-density packages, such as 1 or 2 U rack-mountable cases (a U is 1.75" of vertical rack space). In either case, the computers must work well with Linux. For scalability, the compute nodes need to support PXE booting and IPMI control.

The head node typically is more capable than the compute nodes since it is the entry point for the entire cluster. It should support RAIDed hard drives (more on that later) and must have at least two Ethernet ports, one to connect to the outside world and the other for the isolated internal compute node network.
Compute nodes ideally should be small, cheap and plentiful. Depending on the application of the cluster, they could have any combination of powerful CPUs, large amounts of RAM, large hard disks or GPUs.

Networking

Your network switch needs to have at least as many ports as you have compute nodes. Extra ports always are handy in case you decide to add to your cluster in the future. Network cables also are essential.

Physical Configuration

A cluster can be constructed from tower PCs on a shelf; however, a professionally built cluster typically will use special rack-mounted computers. Computers often will be on rails allowing you to slide them out far enough to remove the lid without physically detaching them from the rack. It is advisable to leave enough slack in the cables on the backs of the computers so they can be running while pulled out for diagnostic purposes.

An important consideration is the location for the cluster. A cluster can be rather noisy due to the fans, so put it in a place where you won't mind some extra white noise. A cluster also can generate a lot of heat. If it's large, you'll need ventilation or a dedicated air conditioning unit.

Power Distribution

Computers draw a lot of power, and lots of computers draw lots of power. The circuits they're on must be able to handle the draw, and you'll need power strips to distribute the power. If your cluster is small, a few consumer-grade power strips should be adequate. Otherwise, large rack-mountable power strips exist that report current.

Additionally, the head node and storage unit should be plugged in to an uninterruptible power supply (UPS), so that they don't immediately halt on a power outage, potentially corrupting data.

Access

It isn't practical to have a keyboard, video monitor and mouse (KVM) for every node in the cluster. It is a good idea, however, to have a local KVM hooked up to the head node. This guarantees that you always will be able to access your cluster to perform administrative tasks. There are specialty products, such as a rack-mountable LCD monitor and keyboard, that can serve this purpose well.

Once the cluster is set up, you will be able to access compute nodes using SSH from the head node. Under normal operation, the nodes then can be headless (operate without a KVM). Under abnormal operation, such as when initially setting up the cluster or when diagnosing a problem, you can access the compute nodes using a crash-cart, which is a mobile KVM with long cables that you can plug in to whatever node is having troubles. Another option is a KVM switch. These switches can use standard IO cables (such as USB and VGA), or they can work entirely over IP if the computers' BIOSes support it. The more nodes involved, the pricier the KVM and cables will be.

Spares

Stuff breaks. When you have a lot of stuff (as in a cluster), it breaks often. For example, let's say that you have 100 hard disks among all of the computers in your cluster, and each hard disk is rated to operate for 20 years. This is an annual failure rate of 5%, so you can expect five of them to fail in a year, or roughly one every ten weeks. This analysis applies to all computer components, meaning that in addition to spare hard disks, it's also a good idea to purchase spare RAM, power supplies and possibly even motherboards and CPUs. In a large cluster, it's wise to have spare networking equipment, and in a production environment, an entire spare head node. Spare parts for compute node repairs are not as necessary since dysfunctional nodes simply can be taken offline or cannibalized for parts.

In summary, the parts needed to build your own cluster are as follows:

Head node (optionally a storage node and a spare as well).
RAID storage (integrated directly into the head node or storage node, or as a separate device).
Compute nodes.
Networking switch(es).
Networking cables.
Rack and mounting hardware.
Power strips.
Uninterruptible power supply.
KVM switch and cables.
Spare parts kit (hard drives, RAM, power supplies).

Network Setup

Once you settle on hardware, you need to plan how to connect it up. Let's start with communication. First, give your cluster a name. Names are used as aliases for IP addresses, making it much easier for a human to identify individual computers on a network. A cluster computer uses two different networks: the external network (aka the internet) that only the head node connects to, and the internal network that the cluster uses for internal communication. Therefore, two names must be configured, one for the external network and one for the internal network:

External network: this is used only by the head node. The name on the external network typically is formatted as hostname.domain.suffix, where the hostname is whatever you want, and the domain.suffix pertains to the organization using the cluster. The example used in this guide is name.university.edu.
Internal network: this is used by all nodes in the cluster. The internal name is typically the hostname component from the external name used in conjunction with a numbering scheme. For example, we append two digits to the end of the hostname for each node: name00 (head node), name01 (first compute node) and so on. This scheme limits us to 100 nodes, but it easily can be expanded to accommodate future upgrades.

Naming computers is vital for humans to be able to maintain the cluster, but the computers themselves deal with numeric IP addresses. The method for obtaining an IP address on the external network is up to your network administrator, but you get full reign over the internal network. Two methods exist for assigning IP addresses in the internal network: static and dynamic assignment.

Static assignment: each compute node is configured individually with its own IP address. This contradicts the scalability goal of this guide, because manually configuring IP addresses for a large number of nodes is not practical.
Dynamic assignment: each compute node has an identical configuration and receives its IP address from the head node through the network based on its unique MAC address. This guide uses dynamic IP assignment.

Storage Node

So far in our description of a cluster, we have mentioned a single head node that acts as an access point to the cluster, along with many compute nodes to perform the tasks the cluster receives. Many large clusters will separate out tasks further, especially if the head node becomes a bottleneck for cluster operation. For example, it is common to have a separate storage node to manage the files to which the compute nodes need access, such as application software and each user's home directory.

Disk Partitioning

As opposed to Windows, where partitions are referred to as lettered drives, in Linux, they are mounted under directories called “mount points” in the filesystem. Partitions are useful for keeping data, applications and system software separate for easy backups and re-installations. This section highlights useful Linux partitions assumed in this article:

root (/) — This partition is where the actual operating system resides, and other partitions will be mounted in its filesystem.
/boot — The files Linux uses to boot, including the kernel itself, are located here. For mostly historical reasons, some administrators prefer to keep this on a separate partition, but we will keep them on the same partition as root.
/admin — Disk images, software distributions, kickstart files and backups are stored here. This is vital for the installation of all compute nodes in a scalable way.
/home — User files are located here. We will make this a separate partition from root for ease of backups, upgrading and re-installations.
/export — System-wide application software to be run on compute nodes is stored here. Although sometimes a partition of its own, it can be subsumed under /home/export instead.
/scratch — Hefty computations like those done on clusters often involve writing large temporary files to the hard disk over the process of the computation, then reading those files to complete a result. It is recommended to have a large partition on all compute nodes set aside for this purpose.
swap — Swapping is the process by which, should Linux run out of memory, it writes pages of memory to the swap partition on the hard disk. This can result in allowing memory-intensive software to run on systems with too little RAM. However, swapping is orders of magnitude slower than using RAM. Perhaps a decade ago when RAM was expensive, it would have been advisable to have a large swap partition. But now that RAM is cheap, it is best not to swap at all but instead buy more RAM if memory constraints become a problem, or use software that is designed to use the /scratch partition.

If your nodes use multiple disks, you will have the choice of which ones to use for which partitions. By convention, the root partition should go on the first hard disk, but the rest is up to you. Tables 1 and 2 are examples of single disk partitioning schemes using 1TB hard drives.

Table 1. Head Node

/ (includes boot)	200GB
/admin	200GB
/home (includes export)	rest of space

Table 2. Compute Node

/	200GB
/scratch	rest of space

RAID Devices

When storing large amounts of data, it is highly recommended to utilize a RAID (Redundant Array of Independent Disks) device. This may be integrated directly into the head node, a separate component connected to the head node or part of a separate storage node.

A RAID device works by combining several small physical drives to form one larger, faster virtual drive. A RAID device also can introduce data redundancy, which allows a drive to fail while still preserving the data. This is absolutely vital in a production environment. As mentioned earlier, with many components comes frequent component failures. A drive on a compute node failing isn't the end of the world, because there's nothing on it that can't be re-installed, so clusters often will use a single hard drive for each compute node. However, losing all of the cluster's application or user data would be a disaster, making RAIDing of head node partitions a must.

Several commonly used RAID levels achieve increased size and speed, redundancy or both of these goals.

RAID 0 provides a storage size and performance increase by “striping” data across two or more drives. This means that consecutive data segments are stored on different disks. This may improve read time significantly in some applications; however, one failed drive causes all of the data to be lost. A RAID 0 drive is as large as the size of its smallest drive times the number of drives.
RAID 1 provides redundancy with no storage size or performance increase by “mirroring” data writes to two or more disks, allowing one to go down while still preserving the data. The size of a RAID 1 drive is the same as its smallest drive.
RAID 5 is similar to RAID 0 except that it includes redundant parity information spread across the three or more disks. This allows any one disk to fail without the loss of data. The RAID as a whole will store as much as the smallest drive times one less than the number of drives.
RAID 6 is similar to RAID 5 except that it has two disks worth of redundant parity information spread across four or more drives. This allows for two disks to fail without the loss of data. Storage will be limited to the size of the smallest drive times two less than the number of drives.

Hot Spares are blank drives included with a RAID 5 or 6 device. When one drive fails, the RAID device rebuilds the information formerly on that drive onto the hot spare using the redundant information spread across the other drives. If hot spares are not included, this process would begin only when you manually swap out the failed disk. If you happen to be on vacation and can't get a friend to perform this task, you run the risk of additional drives failing and resulting in data loss before you return.

No matter how many hot spares you provision yourself with, your data isn't 100% safe from disk failures wiping it out. Using a RAID device may make the likelihood of losing your data smaller, but it won't eliminate it. Therefore, if your data is at all important (you're going through the effort of building a cluster in order to obtain it, so it is), make sure you have access to another machine in a different physical location to which you can back up files.

Assembly

After you have gathered all your hardware and planned the configuration, you can begin the fun part of cluster work: actually assembling your cluster. Arrange your nodes for good air flow. When running cables, make sure to use differently colored cables for the internal and external networks, and label them on both ends. It can be both time-consuming and frustrating to track down a problem only to find that it was caused by a swapped network cable. If things are kept consistent among nodes, your life will be much easier when it comes to managing your cluster. Ideally when starting out, your compute nodes all should be identical, both in terms of internal hardware and external cable configuration. In our experience, however, this ideal is seldom maintained over the long-term as the cluster expands or specialized capabilities are added.

Maintenance

Scalability for a cluster requires that it is easy to set up all of the compute nodes with identical configurations. This ability is useful in several scenarios: to install the cluster initially, to re-install a node for maintenance, or to add new nodes to the cluster.

There are two methods for achieving this goal. The first is to perform a complete installation on a single node, save a disk image and write that image to all other nodes. Unfortunately, this strategy results in a loss of support for heterogeneity. If you desire to add nodes of a different architecture than what's already in the cluster, you'd be forced to start from scratch in installing them.

The other method is to script all the changes to the operating system so that they can be applied during an automated install. Such installation scripts typically record basic settings similar to those that would be configured in the system installer, a list of software packages to install beyond the initial system and a script to handle all other modifications.

This installation method solves the issue of heterogeneity in that the installer handles the choice of software, allowing the same script to be used on multiple architectures assuming that all requested software is packaged for the different architectures. Furthermore, a script containing an exhaustive list of modifications from a clean install is an excellent resource when diagnosing future issues or for performing an operating system upgrade. Most major distributions support this method. On CentOS, it is called kickstart.

In practice, a combination of these two methods is used. For example, scripts are used to build the cluster, and an image can be used to replace a bad hard disk on a compute node.

System Software

In Part II of this series we will dive into installing CentOS using kickstart. This procedure involves performing a single manual installation to generate the base kickstart file, then iteratively making modifications until the operating system installs on the head node and compute nodes without manual intervention. The end result will be a functional operating system on every node with networking in place.

In Part III, we will describe the installation and configuration of software that makes these networked computers function as an integrated cluster. This software includes DHCP for IP addresses assignment, NFS to share filesystems over the internal network, passwordless SSH between all nodes, a suite of administrative tools, a local software repository for supplying RPMs to compute nodes, Ganglia for monitoring the cluster and SLURM as a resource manager.

Application Software

When building a cluster, it's important to know what you're going to do with it. You already should have done some research to be sure that software exists to achieve your goal. For example, we built a high-throughput computational chemistry cluster that runs quantum chemical programs and molecular dynamics simulation software. This requires compute nodes with multicore CPUs, GPUs, large amounts of RAM and significant scratch space.

An important consideration is licensing of the application software that will run on the cluster. For example, there are many free or open-source computational chemistry programs for which licensing isn't a problem, such as MOPAC, GAMESS and ORCA. One can purchase a site license for commercial software, such as Gaussian, and use it across the cluster. Other commercial programs like QChem require being keyed to the specific nodes on which they will be running.

Conclusion

In this article, we discussed the considerations that go into designing a cluster computer. To start, we outlined the design goals, a vital one being that the setup is scalable to accommodate any number of nodes without their installation and administration becoming impractical. We then discussed the hardware that goes into the cluster, including the computers, networking equipment, physical storage, power distribution, access and spare parts. We also designed a disk partitioning scheme for the head node and compute nodes that allows for easy backups, upgrades and re-installations. We described the networking of a cluster, including an external network connection and an isolated internal network. Finally, we discussed the physical assembly of the cluster, introduced the importance of maintenance and touched on the cluster's application.

In the next article, we will install the base operating system and set up network connectivity. In the process, we will create two kickstart files, one for the head node and the other for the compute nodes. In the third article, we will turn the group of computers into a single cluster by configuring vital system services to communicate and run cluster software.