Transform a collection of networked computers into a tightly integrated cluster.
In Part I of this series, we covered designing and assembling a computer cluster, and in Part II, we described how to set up a scalable method to perform reproducible installations across an entire cluster. If you've been following along, at this point, you should have a cluster consisting of Linux (CentOS 7) installed to every node and a network that currently allows SSH between nodes as root. Additionally, you should have kickstart files, making it possible to re-install the entire cluster in a scalable manner.
The next step is to configure various Linux services to convert this collection of networked computers into a tightly integrated cluster. So in this article, we address the following:
Firewalld: a firewall on the head node to protect the cluster from external threats.
DHCP: a more in-depth look at assigning IP addresses to compute nodes in a deterministic manner.
/etc/hosts: the file that maps node names to IP addresses.
NFS: share additional filesystems over the network.
SSH/RSH: log in to compute nodes without providing a password.
btools: scripts to run administrative tasks on all nodes in the cluster.
NTP: keep the cluster's clock in sync with the Network Time Protocol.
Yum local repository: install packages to compute nodes without traffic forwarding.
Ganglia: a cluster monitoring suite.
Slurm: a resource manager for executing jobs on the cluster.
Each of the following sections provides a description of the service, how to configure it for use in a cluster and some simple use cases as appropriate. Installing these services converts the networked computers into a useful, functioning cluster.
Firewalld is an abstraction layer for netfilter and is the default firewall for CentOS. Only the head node needs a firewall because it is the only node that is in direct contact with the outside world. The compute nodes already should have had their firewalls disabled in their kickstart, so that their firewalls don't interfere with internal communication.
See the September 2016 LJ article “Understanding Firewalld in Multi-Zone Configurations” (www.linuxjournal.com/content/understanding-firewalld-multi-zone-configurations) for an indepth description of how firewalld works. Provided here are a few commands to set up a basic firewall that allows SSH and http at your local institution, drops traffic from the rest of the world and allows all traffic on the internal network:
# firewall-cmd --permanent --zone=internal ↪--add-source=[IP/MASK OF YOUR INSTITUTION] # firewall-cmd --permanent --zone=internal ↪--remove-service=dhcpv6-client # firewall-cmd --permanent --zone=internal ↪--remove-service=ipp-client # firewall-cmd --permanent --zone=internal --add-service=ssh # firewall-cmd --permanent --zone=internal --add-service=http # firewall-cmd --permanent --zone=public --remove-service=ssh # firewall-cmd --permanent --zone=public ↪--remove-service=dhcpv6-client # firewall-cmd --permanent --zone=public --set-target=DROP # firewall-cmd --permanent --zone=public ↪--change-interface=[EXTERNAL INTERFACE] # echo "ZONE=public" >> /etc/sysconfig/network-scripts/ ↪ifcfg-[EXTERNAL INTERFACE] # firewall-cmd --permanent --zone=trusted ↪--change-interface=[INTERNAL INTERFACE] # echo "ZONE=trusted" >> /etc/sysconfig/network-scripts/ ↪ifcfg-[INTERNAL INTERFACE] # nmcli con reload # firewall-cmd --reload
In the above, [IP/MASK OF YOUR INSTITUTION] is the network address and netmask for your school or business formatted as 123.45.67.0/24. [EXTERNAL INTERFACE] is the interface connected to the outside world, such as eno1. [INTERNAL INTERFACE] is the interface connected to the internal compute node network, such as eno2.
As with all of the services you will configure, these changes should be added to your ever-growing kickstart file. Note that when adding firewalld commands to your kickstart file in the %post section, you need to use firewall-offline-cmd instead of firewall-cmd. In addition, --remove-service will have to be changed to --remove-service-from-zone, and --permanent is not necessary. For example, the command:
firewall-cmd --permanent --zone=internal ↪--remove-service=ipp-client
Sooner or later a compute node will have problems. To associate physical devices with IP addresses, you need to configure DHCP to give out IP addresses based on hardware MAC addresses.
The following procedure logs the MAC addresses of machines that receive an IP address over DHCP in the order requested, then uses this information to set up DHCP to give each node the same number each time.
To associate IP addresses with specific computers, complete the following steps.
1) If it exists, delete the file /var/lib/dhcpd/dhcpd.leases. Whether or not it existed, create it as a new file:
# touch /var/lib/dhcpd/dhcpd.leases
2) Power off all compute nodes.
3) Power on the nodes in order.
4) On the head node, view the file /var/lib/dhcpd/dhcpd.leases. This should contain entries with the IP and MAC addresses of recently connected compute nodes.
5) Append entries to the very bottom of /etc/dhcp/dhcpd.conf for each node, associating the node number with the order they appear in the file. For example:
host name01 { hardware ethernet 00:04:23:c0:d5:5c; fixed-address 192.168.1.101; }
To automate this task, use the following commands, either run from the command line or in a script. Tweak as necessary for your naming scheme:
#!/bin/sh index=1 for macaddr in `cat /var/lib/dhcpd/dhcpd.leases | grep ↪'ethernet' | sed 's/.*hardware ethernet //'`; do printf 'host name%.2d {\n\thardware ethernet ↪%s\n\tfixed-address 192.168.1.1%.2d;\n}\n' ↪"$index" "$macaddr" "$index" index=`expr $index + 1` done
This script will dump the output to the terminal where it can be copy/pasted into dhcpd.conf. If you are running a terminal that doesn't support copy/paste, you can instead redirect the output.
6) Reboot the cluster and ensure that this change takes effect. As with all configuration changes described in this article, make sure this revised dhcpd.conf file finds its way into the head node kickstart file:
/etc/hosts
The /etc/hosts file is used to pair hostnames with IP addresses. The general format is this:
XXX.XXX.XXX.XXX hostname.domain shorthostname
The shorthostname field is optional, but it saves typing in many situations. Make sure that the first line is as follows, since it is required for the proper function of certain Linux programs:
127.0.0.1 localhost.localdomain localhost
After this line will be a mapping for the head node's fully qualified domain name to its external IP address:
123.45.67.89 name.university.edu name
The rest of the file will contain mappings for every node on the internal network. For example, a cluster with a head node and two compute nodes will have a /etc/hosts as follows:
127.0.0.1 localhost.localdomain localhost 123.45.67.89 name.university.edu name 192.168.1.100 name00 192.168.1.101 name01 192.168.1.102 name02
Remember that 123.45.67.89 and 192.168.1.100 both correspond to the same machine (the head node), but over different network adapters. The hosts file should be identical on the head node and compute nodes. As always, add these modifications to your kickstart files.
NFS (Network File System) is used to share files between the head node and the compute nodes. We already explained how to configure it in the previous article, but we cover it in more detail here. The general format of the /etc/exports configuration file is:
/exportdir ipaddr/netmask(options)
To configure NFS on your cluster, do the following steps.
1) Modify /etc/exports on the head (or storage) node to share /home and /admin:
/home 192.168.1.100/255.255.255.0(rw,sync,no_root_squash) /admin 192.168.1.100/255.255.255.0(ro,sync,no_root_squash)
This allows read/write access to /home and read-only access to /admin.
2) Restart NFS on the head node:
# systemctl restart nfs
3) Import the shares on the compute nodes by appending the following lines to /etc/fstab:
name00:/home /home nfs rw,hard,intr 0 0 name00:/admin /admin nfs ro,hard,intr 0 0
name00 is the name of the head node, and /home is one of the shares defined in /etc/exports on the head node. The second /home is the location to mount the share on the compute node, and nfs lets Linux know that it should use NFS to mount the share. The remaining items on the line are options that specify how the mountpoint is treated.
4) The shares will be mounted on the compute nodes automatically on boot up, but they can be mounted manually as follows:
# mount /home # mount /admin
As always, once you have tested your configuration using one or two compute nodes, modify both kickstart files so that you don't have to apply it to all of them manually.
SSH (Secure SHell) and RSH (Remote SHell) both allow access between nodes. In most situations when using Linux, SSH should be used because it uses encryption, making it much more difficult for a malicious person to gain unauthorized access. Trusted networks, like the internal network in the cluster, are exceptions to this rule, because everything is under the protective shelter of the head node. As such, security can and should be more relaxed. It's all one big machine, after all.
Since SSH was built with security in mind, connections require a heftier authentication overhead than is the case for RSH. For this reason, many people building high-performance clusters that use multiple nodes to run a given job will favor RSH for its low-latency communication. However, this can be a moot point. When using parallelization software, such as OpenMPI, SSH or RSH is used only to start jobs, and OpenMPI handles the rest of the communication.
Many nerd wars have been fought about using SSH vs. RSH in a cluster. This isn't the place to duke them out, so we provide instructions for both SSH and RSH in this section. For the rest of this article, we'll assume that you chose the SSH route, but with some minor modifications, you can make everything work with RSH as well.
By default, SSH requires a password in order to access another machine. However, it can be configured to use rsa keys instead. By doing this, you will be able to ssh between machines without using passwords.
This step is absolutely vital since most cluster software, such as Slurm (addressed later in this article), assumes passwordless SSH for communication. Additionally, cluster administration can be a pain when constantly juggling passwords around. Once administrators or users have access to the head node, they should be able to access any other node within the cluster without providing any additional authentication.
For each user that you desire to have passwordless SSH (this includes root), complete the following steps.
1) Start by generating rsa keys. As the user for which you are setting up passwordless SSH, execute the command:
ssh-keygen
Accept the default location (~/.ssh/id_rsa), and leave the passphrase empty. If you supply a passphrase, they key will be encrypted and a passphrase will be necessary to use it, which defeats the purpose.
2) Copy the contents of id_rsa.pub to authorized_keys:
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
When you set up passwordless SSH for root, you will need to share /root/.ssh over NFS for the compute nodes. Since /home was shared in the NFS section of this article, this does not need to be done for regular users. Edit /etc/exports, and make /root/.ssh a read-only NFS share. On a single compute node (for testing purposes), add /root/.ssh to /etc/fstab and mount it. You now should be able to ssh as root in both directions between the head node and the compute node.
You may have noticed something annoying though. The first time you ssh'd from the head node to a compute node as root, whether just now or earlier on in your cluster building, SSH complained that the authenticity of that compute node couldn't be established and prompted you to continue. You likely said “yes” and went on your merry way, and SSH hasn't complained to you since. However, now it's upset again, and every time you ssh from a compute node back to the head node, you get something like this:
The authenticity of host 'name00 (192.168.1.100)' can't be established. ECDSA key fingerprint is 01:23:45:67:89:ab:cd:ef:01:23:45:67:89:ab:cd:ef. Are you sure you want to continue connecting (yes/no)? yes Failed to add the host to the list of known hosts (/root/.ssh/known_hosts).
The reason adding the host fails (causing you to go through this dialog every time you connect) is because the NFS share is read-only. One way to fix this is to make it read/write, but that solves only half the issue. With that solution, you'd still have to accept this message manually for every node, which is unacceptable for scalability. Instead, edit /etc/ssh/ssh_config on all nodes, and after the line that looks like Host *, insert the following:
StrictHostKeyChecking no UserKnownHostsFile=/dev/null LogLevel error
If you omit the log level modification, SSH will warn you every time you connect that it's adding the host to the nonexistent (/dev/null) list of known hosts. Although this isn't a major problem, it can fill system logs with non-issues, making debugging future problems more difficult.
When configuring the root user, you may need to edit /etc/ssh/sshd.conf on all nodes to allow passwordless root logins. Make sure it has the following values:
PermitRootLogin without-password PubkeyAuthentication yes
Remember to continue adding these changes to your kickstart files!
If you want to use RSH instead of SSH because it doesn't needlessly encrypt communication, do the following steps on each node in the cluster.
1) Install RSH. On the head node, you can do so using yum:
# yum install rsh rsh-server
On the compute nodes, you will have to add these to the packages section of the kickstart file and re-install.
2) Append the following lines to /etc/securetty:
rsh rexec rsync rlogin
3) Create the file /root/.rhosts where each line follows the pattern [HOSTNAME] root. For example:
name00 root name01 root name02 root
4) Create the file /etc/hosts.equiv, in which each line is a hostname. For example:
name00 name01 name02
5) Enable and start the sockets:
# systemctl enable rsh.socket # systemctl enable rexec.socket # systemctl enable rlogin.socket # systemctl start rsh.socket # systemctl start rexec.socket # systemctl start rlogin.socket
6) Now you should be able to access any computer from any other computer in the cluster as any user (including root) by executing:
$ rsh [HOSTNAME]
Add these changes to your kickstart files.
Btools are a set of scripts used to automate the execution of commands and tasks across all compute nodes in a cluster. While “atools” would be the commands themselves typed in by some poor system administrator, and “ctools” (which actually exist) are part of a complex GUI cluster management suite, btools are short scripts that fit somewhere in the middle. The listing of btools (bsh, bexec, bpush, bsync) and required support files (bhosts, bfiles) follows. These should be located on the head node. Feel free to modify them as needed.
A few support files are required for the rest of the tools to function. bhosts (Listing 1) contains the list of hostnames of all compute nodes, allowing tools that perform operations across the cluster to iterate over these hosts. bfiles (Listing 2) contains the names of files that define the users on the system. If a script copies these files from the head node to all compute nodes, any users on the head node will be recognized on the compute nodes as well.
bsh (Listing 3) loops through the hosts in bhosts, executing ssh <some command> for each. Another tool, bexec (Listing 4), is similar to bsh in that it executes commands over all nodes, but it executes them in parallel. While bsh waits for the first node to finish before moving on to the second, bexec gets them all started, then collects and displays the logs for the operations.
Besides executing commands, it is often useful to copy files to all nodes. bpush (Listing 5) copies a file to all compute nodes. Similarly to bexec, it executes simultaneously, then displays logs.
Finally, bsync (Listing 6) copies the files defined in bfiles to all compute nodes. This causes all users on the head node to be users on the compute nodes as well.
Each of the executable btools (bsh, bexec, bpush and bsync) needs to be made executable before it can be run from the command line:
# chmod +x /usr/local/sbin/tool_name
This collection of btools is useful for a variety of administrative tasks. For example, you can use bsh to run a command quickly on all nodes, perhaps to check that they all have access to an NFS share:
# bsh ls /admin
A word of caution: bash doesn't pass some things, such as redirection (>, <, |), as parameters to executables. If you want to use bsh to echo "something" >> /some/file on each compute node, you need to use quotes, like this:
# bsh 'echo "something" >> /some/file'
For tasks that take more computation time, such as installing software, use bexec. For example, when you get a yum local repository set up, you can use bexec to install software from it:
# bexec yum -y install package_name
bpush is used to copy a file to all nodes. For example, if you want to test how a configuration works on all compute nodes without putting it in a kickstart and re-installing the cluster, simply bpush it to all nodes and reload the relevant service:
# bsh mv /etc/service/service.conf /etc/service/service.conf.000 # bpush /path/to/service.conf /etc/service/service.conf # bexec systemctl restart service
Finally, every time you add a user to the cluster by using useradd on the head node, make sure to run bsync so that the new users have access to all nodes:
# useradd user_name # passwd user_name # bsync
NTP is a protocol used by chrony to synchronize the time on a computer with some external source. It should be set up on the head node to synchronize with the outside world and on the compute nodes to synchronize with the head node. It is vital that the entire cluster is in agreement on the time, so there aren't any errors regarding timestamps on files shared over the internal network.
The following steps set up the head node as a chrony server for the compute nodes.
1) Find a working timeserver (if your institution has its own, that's the best) and add it to /etc/chrony.conf on the head node in the following format:
server 192.43.244.18 #time.nist.gov
The comment at the end of the server line is purely optional but can be helpful when looking at the file. If you have chosen more than one server, you may add them in a similar fashion.
2) Comment out the server lines on both the head node and compute nodes, as they will interfere with the configuration:
server X.centos.pool.ntp.org iburst
3) Allow your compute nodes to be clients of the head node's NTP server by uncommenting the line on the head node:
allow 192.168/16
4) Set a compute node to use the head node as its time server by adding the following line to /etc/chrony.conf on the compute node:
server 192.168.1.100 # head node
5) Enable and start chrony on both head and compute nodes:
# systemctl enable chronyd # systemctl start chronyd
6) After a few minutes have passed and NTP has had a chance to synchronize, execute the following commands to test your NTP configuration:
# chronyc tracking # chronyc sources -v
The -v option on the sources command prints cleverly formatted explanations for each column in the output. Together, these commands print out debugging information about the time server connection. They can be run on the head node to show info about your external time server(s) or on a compute node to print information about the time server on the head node.
This may sound like a broken record, but be sure to add all this to the kickstart files.
Creating a local repository on the head node is useful for installing software that isn't available from the installation media and installing updates to the compute nodes. Although this could be achieved in a pinch with traffic forwarding, that method not only poses a security risk but it also bogs down the external network with redundant requests for the same software from each compute node. Instead, it is possible to download some software packages to the head node once, then set up the head node as a yum repository for the compute nodes to access.
The following procedure sets up the head node as a yum server and mirrors a repository onto the head node.
1) Prepare a good spot to store a CentOS repository:
# mkdir -p /admin/software/repo/
2) Sync with a mirror:
# rsync -azHhv --delete some_mirror.org::CentOS/7* ↪/admin/software/repo/
Note the syntax for specifying the folder to be synchronized. If the folder is rsync://mirrors.liquidweb.com/CentOS/7, it would be written as mirrors.liquidweb.com::CentOS/7.
This should use about 40GB of disk space and take several hours. You can use the same command to update your local copy of the repo. Updates will proceed much more quickly than the first copy.
3) Edit the yum configuration files for both the head node and compute nodes in /etc/yum.repos.d/, so that the head node will use itself as the update server and the compute nodes will use their NFS mount of /admin as the update server. Change the line:
#baseurl=http://mirror.centos.org/centos/$releasever...
and comment out all mirrorlist lines. You can automate this with the following sed commands:
# sed -i.000 "s|#baseurl=http://mirror.centos.org/centos|baseurl= ↪file:/admin/software/repo|" /etc/yum.repos.d/*.repo # sed -i "s|mirrorlist|#mirrorlist|" /etc/yum.repos.d/*.repo
Perform this on the head node and compute nodes (using bsh). It also should happen in their kickstarts. If you have installed additional repos (such as epel on the head node for Ganglia), make sure not to modify their .repo files; otherwise yum will have trouble finding those repos on your hard drive!
4) Update the head node and compute nodes using the head node as a software source:
# yum -y upgrade # bexec yum -y upgrade
There are times when you will need to install software packages to your compute nodes that are not available in the repo that you cloned—for example, Ganglia from epel. In this case, it is not efficient to clone the entire epel repository for only a few software packages. Thus, the possible solutions are either to download the rpm files to the head node, bpush them to the compute nodes and bexec rpm -ivh /path/to/software.rpm, or to create your own specialized repository and install them using yum.
To create your own repository, complete the following steps.
1) Create a directory to house the repo in the head node:
# mkdir /admin/software/localrepo
2) Download some RPMs to populate your local repo. For example, get the Ganglia compute node packages from epel, a third-party repository:
# cd /admin/software/localrepo # yum install epel-release # yumdownloader ganglia ganglia-gmond ganglia-gmetad libconfuse
3) Create the repo metadata:
# createrepo /admin/software/localrepo
4) Create /etc/yum.repos.d/local.repo containing the following:
[local] name=CentOS local repo baseurl=file:///admin/software/localrepo enabled=1 gpgcheck=0 protect=1
5) Push local.repo to your compute nodes:
# bpush /etc/yum.repos.d/local.repo /etc/yum.repos.d/
6) Install the desired software on the compute nodes:
# bexec yum -y install ganglia-gmond ganglia-gmetad
7) If you want to add software to the local repo, simply repeat steps 2 and 3. Note that yum keeps a cache of available software, so you may need to run the following before you can install added software:
# bexec yum clean all
Be sure that the modifications to the files in /etc/yum.repos.d/ are added to your kickstart files. The other changes occur in /admin, which remains unchanged during a re-install.
It's important to be able to monitor the performance and activity of a cluster to identify nodes that are having problems or to discover inefficient uses of resources. Ganglia is a cluster monitoring software suite that reports the status of all the nodes in the cluster over http.
Before installing Ganglia, you must have httpd installed and have the ganglia, ganglia-gmond, ganglia-gmetad and libconfuse packages available from the local repository.
Complete the following steps to install Ganglia on the cluster.
1) Install httpd:
# yum install httpd # systemctl enable httpd # systemctl start httpd
2) On the head node, install ganglia-gmetad, ganglia-gmond and ganglia-web:
# yum -y install ganglia ganglia-gmetad ganglia-gmond ganglia-web
3) On the compute nodes, install ganglia-gmetad and ganglia-gmond:
# bexec yum -y install ganglia ganglia-gmetad ganglia-gmond
4) On the head node, back up and edit /etc/ganglia/gmond.conf such that:
In the cluster block, name is something you will recognize.
In the udp_send_channel block, mcast_join is commented out and the line host = 192.168.1.100 (internal IP of head node) is added.
In the udp_recev_channel block, mcast_join and bind are both commented out.
Push the modified gmond.conf file to all compute nodes:
# bpush /etc/ganglia/gmond.conf /etc/ganglia/
5) Enable and start gmetad and gmond:
# systemctl enable gmetad # systemctl start gmetad # systemctl enable gmond # systemctl start gmond # bexec systemctl enable gmetad # bexec systemctl start gmetad # bexec systemctl enable gmond # bexec systemctl start gmond
6) In /usr/share/ganglia/conf.php, between the <?php and ?> tags, add the line:
$conf['gweb_confdir'] = "/usr/share/ganglia";
7) Edit /etc/httpd/conf.d/ganglia.conf such that it reflects the following:
Alias /ganglia /usr/share/ganglia <Directory "/usr/share/ganglia"> AllowOverride All Require all granted </Directory>
And, restart httpd:
# systemctl restart httpd
8) Monitor your cluster from a web browser by going to http://name.university.edu/ganglia.
Be sure to update your kickstart files.
In a cluster environment, multiple users share resources. Batch queueing systems run jobs on appropriate hardware resources and queue jobs when resources are unavailable. Some examples of batch queueing systems include PBS, Torque/Maui, SGE (now Oracle Grid Engine) and Slurm.
Slurm (Simple Linux Utility for Resource Management) is a tool for executing commands on available compute nodes. Slurm uses an authentication agent called munge.
Follow the steps below to configure and install Slurm and munge.
1) On the head node, add slurm and munge users:
# groupadd munge # useradd -m -d /var/lib/munge -g munge -s /sbin/nologin munge # groupadd slurm # useradd -m -d /var/lib/slurm -g slurm -s /bin/bash slurm
Then sync them to the compute nodes:
# bsync
2) Slurm uses an authentication agent called munge, which is available from the epel repository:
# yum -y install munge munge-libs munge-devel
Munge must be installed to the compute nodes as well. See the section titled Yum Local Repository for details, then perform the installation using bexec.
3) Generate a key to be used by munge across the cluster:
# dd if=/dev/urandom bs=1 count=1024 > /etc/munge/munge.key # chown munge:munge /etc/munge/munge.key # chmod 400 /etc/munge/munge.key
The same key must be in the /etc/munge/ directory on all compute nodes. Although it is possible to propagate it to all of them, it can be more convenient (vital for re-installations, in fact) to make the entire directory a read-only NFS share. See the section above on NFS for details.
4) Enable and start the munge service on all nodes. Test that munge is installed correctly:
# munge -n | ssh name01 unmunge
5) On the head node, install the packages necessary to build Slurm:
# yum -y install rpm-build gcc openssl openssl-devel pam-devel ↪numactl numactl-devel hwloc hwloc-devel lua lua-devel ↪readline-devel rrdtool-devel ncurses-devel gtk2-devel ↪man2html libibmad libibumad perl-Switch ↪perl-ExtUtils-MakeMaker mariadb-server mariadb-devel
6) On the head node, download and build the latest Slurm distribution. In the following commands, substitute the latest version of Slurm for VERSION (at the time of this writing, the latest version was 16.05.9):
# cd /tmp # curl -O https://www.schedmd.com/downloads/latest/ ↪slurm-VERSION.tar.bz2 # rpmbuild -ta slurm-VERSION.tar.bz2
7) Copy the built RPMs to your local repository and update its metadata:
# cp /root/rpmbuild/RPMS/x86_64/*.rpm /admin/software/localrepo # createrepo /admin/software/localrepo
Make yum on all machines aware that this change occurred:
# yum -y clean all # bexec yum -y clean all
8) On all the nodes, install the following packages from the local repo:
# bexec yum -y install slurm slurm-devel slurm-munge ↪slurm-perlapi slurm-plugins slurm-sjobexit ↪slurm-sjstat slurm-seff
And on the head node only, install the Slurm database:
# yum -y install slurm-slurmdbd slurm-sql
9) On all nodes, create and set permissions on Slurm directories and log files:
# mkdir /var/spool/slurmctld /var/log/slurm # chown slurm: /var/spool/slurmctld /var/log/slurm # chmod 755 /var/spool/slurmctld /var/log/slurm # touch /var/log/slurm/slurmctld.log # chown slurm: /var/log/slurm/slurmctld.log
10) On the head node, create the configuration file /etc/slurm/slurm.conf:
ControlMachine=name ReturnToService=1 SlurmUser=slurm StateSaveLocation=/var/spool/slurmctld # LOGGING AND ACCOUNTING ClusterName=cluster SlurmctldLogFile=/var/log/slurm/slurmctld.log SlurmdLogFile=/var/log/slurm/slurmd.log # COMPUTE NODES NodeName=name[01-99] CPUs=1 State=UNKNOWN PartitionName=debug Nodes=name[01-99] Default=YES ↪MaxTime=INFINITE State=UP
This file must be the same across all nodes. To ensure that it is, make the /etc/slurm directory an NFS share like you did for munge. Alternatively, you can place this file in /home/export/slurm (which already is an NFS share) and symbolically link it to /etc/slurm on all nodes:
# ln -s /home/export/slurm/slurm.conf /etc/slurm/slurm.conf # bsh ln -s /home/export/slurm/slurm.conf /etc/slurm/slurm.conf
We didn't use this trick when installing munge, because munge checks that it's reading a regular file rather than a symbolic link.
11) On the head node, enable and start slurmctld:
# systemctl enable slurmctld # systemctl start slurmctld
And on the compute nodes, enable and start slurmd:
# bsh systemctl enable slurmd # bsh systemctl start slurmd
12) Test the system. First submit an interactive job on one node:
$ srun /bin/hostname
Then submit a batch job. First, in a home directory, create the file job.sh:
#!/bin/bash sleep 10 echo "Hello, World!"
And execute the file using the command:
$ sbatch job.sh
Before ten seconds pass (the amount of time this sample job takes to run), check the job queue to make sure something's running:
$ squeue
It should display something like this:
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 1 debug job.sh user R 0:07 1 name01
Finally, if /home is NFS-mounted read/write, you should see the job's output file slurm-1.out in your current directory with the following contents:
Hello, World!
At this point, the cluster is operational. You can run jobs on nodes using Slurm and monitor the load using Ganglia. You also have many tools and tricks for performing administrative tasks ranging from adding users to re-installing nodes. A few things remain, most notably installing application-specific software, adding parallelization libraries like OpenMPI for running one job over multiple nodes and employing standard administrative good practices.
When installing application software to a single directory (for example, /usr/local/app_name/), install it on the head node. Then share it to the compute nodes by moving the entire installation to /home/export and symbolically linking it to the install location:
# mv /usr/local/app_name /home/export/app_name # ln -s /home/export/app_name /usr/local/app_name # bsh ln -s /home/export/app_name /usr/local/app_name
When installing libraries, they often are in the standard repositories that you cloned to the head node when creating a local repository. This is the case with OpenMPI. Installation is as easy as executing a yum install:
# yum -y install openmpi openmpi-devel # bexec yum -y install openmpi openmpi-devel
If you need a library that is not included in the standard repositories but is still packaged for CentOS, you can download it to your local repo following the procedure discussed in the Create Your Own Repo section of this article. If the library isn't packaged at all for CentOS (perhaps you compiled it from source), you still can bpush it to the compute nodes:
# bpush /path/to/library.so /path/to/library.so
To prevent a computational Tragedy of the Commons, you will want to protect the communal resources, including the head node's disk and CPU. To keep a user from hogging the entire home partition, enforce a quota on disk usage. To prevent long jobs from running on the head node (that's what compute nodes are for), you can write a “reaper” dæmon that checks for long-running user processes and kills them.
Another useful tool to have in place is the ability to send email from the cluster. For example, many RAID devices have corresponding software that will send an email when a disk goes down. In order for this email to make its way to your inbox, the cluster must be configured to forward the email to the correct recipient(s).
Although outside the scope of this article, it's important for a production cluster to be backed up regularly. You'll want a cron job that tars up the entire home directory and transfers it over the network to a backup machine, preferably in a different building.
Computer clusters are important tools for massively parallelizable tasks (high-performance computing) and for running many jobs concurrently (high-throughput computing). All too often the setup and maintenance of clusters is left to specialists, or even worse, to complex “magical” configuration software. In this three-part series, we wrote some complex configuration scripts ourselves, but understanding each step strips the process of all magic.
The resulting cluster is flexible enough to be used for many different computationally intensive problems. Furthermore, because of the redundancy built in to the hardware and the ease of re-installing compute nodes, the cluster is reliable as well. Our production cluster has been in operation for more than ten years; it has seen multiple hardware and software upgrades, and it still functions reliably.
In a few years, you may decide to upgrade by adding some shiny new compute nodes, which won't be a problem, since all you will have to do is tweak your kickstart file to take advantage of the new capacities or capabilities. Or, perhaps you will want to upgrade to the latest operating system version. Since you used standard technologies and methodologies to build your own cluster (BYOC), future transitions should proceed as smoothly as the original installation!