Failure tolerance with Xen 4 and Remus

Return of the Lost Sibling

The new 4.0 version of Xen is rich in features in its own right; now, it adds Remus for fault-tolerant high-availability integration.

By Ralf Spenneberg

Legend has it that Remus died at the hand of his twin, Romulus, after an argument about the details of founding Rome. The virtualization industry might not be as martial in its approach, but augurs have pronounced many a stakeholder dead in the past.

Xen is probably the single piece of software most often counted out by the pundits, especially when its code failed to make the Linux kernel [1]. However, version 4.0, released in April 2010, impresses with many features, including the high-availability extension Remus [2].

Features and Kernels

Developers have quite obviously put in a major effort to stem the current trend toward KVM. Xen 4 includes its own hard disk back end called blktap2, the Netchannel2 network back end, improved PCI passthrough for VGA adapters, memory page sharing, and vastly improved response times.

Besides the classic Xen kernel 2.6.18, Xen 4 now also supports 2.6.31 and 2.6.32, which was previously impossible after the functions required by the Xen Dom0 kernel were replaced with the Paravirt Ops interface (PVOps) [3] in the new kernel. Thanks to developer Jeremy Fitzhardinge, Xen now uses PVOps as its Dom0 kernel. Despite this, 2.6.18 remains the reference kernel on which some of the new features run - including Remus.

Nonstop

Xen is typically used in combination with a SAN, DRBD, Heartbeat, or Pacemaker as a high-availability solution. The administrator installs the required network service on a virtual machine, which is then launched on a Xen host. A second host uses the storage back end to access the hard disk. Pacemaker or Heartbeat continually monitor availability. If the active Xen host fails, the cluster software starts the virtual machine on a second virtualization server. This strategy helps administrators reduce outages to the few minutes, at most, it takes to boot the virtual machine on the new host.

However, many applications will not survive a failure of a couple of minutes. In manufacturing, in particular, the failure of a control unit can take down the whole production line, which can be very expensive. These scenarios require nonstop or fault-tolerant systems.

Several nonstop and fault-tolerance techniques appear in the wild. Hardware vendor Stratus Technologies [4], for example, manufactures nonstop machines under the ftServer brand; the machines consist of two linked systems that are precisely synchronized and execute the same programs. Just one of these two systems talks to the outside world, and the second system, known as the shadow, has exactly the same data and identical code. If the active system detects a hardware error, the second system takes over responsibility for communications.

VMware's vSphere provides a similar kind of fault tolerance in a virtualized environment, but it comes at the price of approximately US$ 2,800 per CPU. External solutions for Xen, such as Kemari [5], SecondSite [6], and Remus, have been around for years.

Many fault-tolerant systems depend on CPU lock stepping [7], which ensures that both CPUs always run exactly the same commands. If both CPUs are in an identical environment and have the same code and data, one instance can always replace the other. This approach does assure genuine fault tolerance, but the overhead is enormous.

Remus takes a different approach: It assumes that the two virtual machines don't need exactly the same state. As long as the client doesn't notice any difference between the active system and its shadow, the overall system is fault tolerant in case of a failover.

Remus achieves this goal by triggering a migration of the active virtual machine to a second host at intervals of 200ms. The copy of the virtual machine that this process creates is then available as a shadow copy, while the original machine carries on running and is not destroyed, which would be the case with live migration. To reduce the overhead, Xen only transfers the dirty pages since the first complete migration, rather than the whole virtual machine.

Because the shadow copy is only identical to the active session once every 200ms, the Remus developers rely on a workaround. Although the active system continues to work for these 200ms, the server buffers all the network connections and keeps back the packets; from the client's point of view, the response is still outstanding.

If the active virtual guest fails, the shadow copy can step in and assume its functionality at this point. Packets from existing TCP connections sent by the client to the former active guest are automatically requested by the TCP protocol layer, but this time they go to the shadow instance only.

Under normal circumstances, Xen synchronizes the shadow copy with the active system after another 200ms. Once synchronization has been completed, the clients receive the packets that were retained and buffered for 200ms. If these response times are not sufficient for your needs, you can reduce the synchronization intervals to 50ms. Whether the additional overhead will affect resource-hungry applications is something you will need to find out through a process of trial and error.

Migration

Testing to see whether the live migration works perfectly in your Xen environment is also a good idea. To do so, you first need a storage back end for the virtual machine that is accessible by two Xen Dom0 hosts. A simple NFS export is fine for this test; Remus will handle this later itself once your Remus system is operational.

All you need to do is export an NFS directory on another machine and mount it as the same path on the two Xen Dom0 hosts. Then give the xm migrate --live my VM target-Dom-0-IP command to see whether one virtual machine with a virtual disk residing in the NFS export can be migrated.

If this works, the next test is with Remus: If not, you should check that the Xend relocation server is enabled in the /etc/xen/xend-config.sxp file and the xend-relocation-hosts-allow parameter allows access.

Testing Remus

The developers recommend using a copy of a virtual machine for the initial test because the virtual disk is damaged by it. To do so, copy the disk image to both Xen Dom0 hosts; start the virtual machine on one of the two systems and call Remus with the command:

remus --no-net my-VM target-Dom-0-IP

This command launches Remus without protecting the network connections or the disk. Buffering and synchronization for network packets, as referred to earlier, is not performed.

The screen should now display messages similar to those shown in Figure 1. At the same time, xm list on the second Xen Dom0 system should show the synchronized and stopped VM (Figure 2). If all of this works, you can add disk replication in your next test.

Figure 1: Every 200ms, Remus synchronizes the two virtual machines; a network buffer holds all connections during this time.

Figure 2: The Remus shadow copy running nicely in the background.

How is disk synchronization handled in this setup? Because Remus only synchronizes the shadow copy at 200ms intervals, the active virtual machine and the shadow copy should not access the same virtual disk. Remus has to handle this synchronization, too. This offers some benefits because it avoids the worry about the SAN or DRBD in a high-availability Xen solution.

The virtual disks paths should be the same on both Xen Dom0 hosts. It makes no difference whether the DomU uses a logical volume, a file, or a separate partition as its hard disk. The only thing that matters is the Xen configuration entry. For a flat file, Remus handles access to the virtual hard disk like this:

disk = [ 'tap:remus:192.168.0.2:9000|aio:/images/disk.img,sda1,w' ]

192.168.0.2:9000 specifies the target computer and port for synchronizing the disks. Remus only uses this connection while actually synchronizing the shadow copy. The virtual machine can't write to the disk until you launch Remus-based synchronization. Typically, it will remain in place after checking the filesystems (Figure 3).

Figure 3: The shadow copy can't access the hard disk image until Remus synchronization has started.

New Back Ends

Xen uses the new blktap2 hard disk back end and the tapdisk2 user process for the synchronization. After launching synchronization with the

remus --no-net My-VM 192.168.0.2

command-line tool, the active machine can start to use the disk for write access and the virtual machine will continue to boot.

To test the setup, you can open the Xen console for the virtual machine (xm console My-VM) and give the top command. Then destroy the instance by issuing the xm destroy My-VM command on the active Xen Dom0 host, or simply pull the plug.

Now connect to the console on the shadow copy, and see if the top command is running. To extend this protection to the network connections, you need to run the remus command without the --no-net option. The trick with the top command will now work via SSH, too.

Of course, Remus-based fault tolerance comes at a price. Constant synchronization in the background slows down the virtual machine. How much performance you lose depends on the virtual machine itself. The more changes made to memory and disk, the more time the synchronization process will consume and the less time the virtual machine has for its work.

Performance

If in doubt, run a test to see whether the Remus speed hit is acceptable for production use. To optimize performance, the developers recommend disabling automatic distribution of the Dom0s and DomUs over multiple cores in the Xen Scheduler and assigning the Dom0s and DomUs to specific cores.

Despite this, carefully consider the use of Remus. Fault tolerance works exactly once. Once the shadow copy is active, you do not have a second instance to protect yourself if it fails. As of this writing, Xen is unable to create a new shadow on the fly and instead needs to restart the DomU. To avoid damage, you need to do this in a planned maintenance window, and before you can do this, you need to synchronize the virtual hard disk for the DomU with the new shadow copy.

Where To Next?

The Remus developers still have a long list of to-dos. Right now, Remus doesn't support fault tolerance (on a 2.6.18 Dom0 kernel) for paravirtualized (PVM) and fully virtualized (HVM) guests. Future versions will be focusing on cleaning up the source code and adding the external remus command to the xm control front end. Libvirt support is also planned. Remus still uses its own, very simple Heartbeat mechanism to support high availability, although Corosync [8] and Pacemaker offer far more powerful functionality. Future Xen versions should support the integration of these tools.

Alive and Kicking

Still uncertain is whether Linux distributions that discontinued Xen support as a Dom0 will reintegrate it now. Also, it should be just as exciting to see whether the new version increases pressure on kernel developers to add Xen to the Linux kernel. KVM offers functionality similar or equivalent to Xen's, and KVM's developers will probably take the new Xen as a challenge to include a fault tolerance feature of their own. A lack of administrative interfaces with Remus support is probably a bigger hindrance to using Remus in production environments, but it is definitely too early to write off Remus and Xen.

INFO

[1] LKML Xen discussion: http://thread.gmane.org/gmane.linux.kernel/800658/focus=800714
[2] Remus: http://nss.cs.ubc.ca/remus
[3] Xen Paravirt Ops kernel: http://wiki.xensource.com/xenwiki/XenParavirtOps
[4] Stratus ftServer: https://www.stratus.com/products/ftserver
[5] Kemari: http://www.osrg.net/kemari
[6] SecondSite: http://dsg.cs.ubc.ca/secondsite/
[7] Lockstepping: http://en.wikipedia.org/wiki/Lockstep_(computing)
[8] Corosync: http://www.corosync.org

THE AUTHOR

Ralf Spenneberg is a freelance Unix/Linux trainer, consultant, and author and manager of Open Source Training Ralf Spenneberg. The second edition of his latest book, VPN on Linux, was published recently.