InfiniBand and Linux

Roland Dreier

Issue #133, May 2005

Learn why letting a remote system on the network scribble in your memory is fine, how user-space applications can send data without bothering the kernel and more facts about the new high-performance interconnect.

After a long gestation, use of InfiniBand (IB) is taking off, and work is under way to add IB support to Linux. At the physical level, IB is similar to PCI Express. It carries data using multiple high-speed serial lanes. The first versions of the InfiniBand specification allowed for only the same signaling rate for each lane, 2.5Gb/s, as PCI Express. The latest version of the specification (1.2), however, adds support for 5Gb/s and 10Gb/s rates per lane. Also, IB supports widths of 1X, 4X, 8X and 12X, while PCI Express supports X1, X2, X4, X8, X12, X16 and X32. The most commonly used IB speed today is 4X at a 2.5Gb/s/lane rate, or 10Gb/s total. But the 12X width combined with the 10Gb/s/lane rate means the current IB spec supports links with an astonishing 120Gb/s of throughput.

Table 1. Comparison Chart

Technology	Data Rate	Cables
USB	12Mb/s	5m
Hi-Speed USB (USB 2.0)	480Mb/s	5m
IEEE 1394 (FireWire)	400Mb/s	4m
Gigagbit Ethernet	1,000Mb/s	100m (cat5 cable)
10 Gigabit Ethernet	10,000Mb/s	10m (copper IB cable), 1+ km ( optical)
Myrinet	2,000Mb/s	10m (copper), 200m (optical)
1X InfiniBand	2,000Mb/s	10m (copper), 1+ km (optical)
4X InfiniBand	8,000Mb/s	10m (copper), 1+ km (optical)
12X InfiniBand	24,000Mb/s	10m (copper), 1+ km (optical)

Because IB is used to build network fabrics, IB supports both copper and optical cabling, while the PCI Express cable specification still is being developed. Most IB installations use copper cable (Figure 1), which can be used for distances up to about 10 meters. IB also allows a variety of optical cabling choices, which theoretically allow for links up to 10km.

Figure 1. Top to Bottom: cat5 Ethernet Cable, 4X InfiniBand Cable and 12X InfiniBand Cable (US quarter coin included for scale)

In past years, IB was pitched as a replacement for PCI, but that no longer is expected to be the case. Instead, IB adapters should continue to be peripherals that connect to systems through PCI, PCI Express, HyperTransport or a similar peripheral bus.

The network adapters used to attach systems to an IB network are called host channel adapters (HCAs). In addition to the fabric's extremely high speed, IB HCAs also provide a message passing interface that allows systems to use the 10Gb/sec or more throughput offered by InfiniBand. To make use of IB's speed, supporting zero-copy networking is key; otherwise, applications will spend all their time copying data.

The HCA interface has three key features that make zero-copy possible: a high-level work queue abstraction, kernel bypass and remote direct memory access (RDMA). The work queue abstraction means that instead of having to construct and process network traffic packet by packet, applications post work requests to queues processed by the HCA. A message sent with a single work request can be up to 4GB long, with the HCA taking care of breaking the message into packets, waiting for acknowledgements and resending dropped packets. Because the HCA hardware takes care of delivering large messages without any involvement from the CPU, applications receive more CPU time to generate and process the data they send and receive.

Kernel bypass allows user applications to post work requests directly to and collect completion events directly from the HCAs queues, eliminating the system call overhead of switching to and from the kernel's context. A kernel driver sets up the queues, and standard memory protection is used to make sure that each process accesses only its own resources. All fast path operations, though, are done purely in user space.

The final piece, RDMA, allows messages to carry the destination address to which they should be written in memory. Specifying where data belongs is useful for applications such as serving storage over IB, where the server's reads from disk may complete out of order. Without RDMA, either the server has to waste time waiting when it has data ready to send or the client has to waste CPU power copying data to its final location.

Although the idea of remote systems scribbling on memory makes some queasy, IB allows applications to set strict address ranges and permissions for RDMA. If anything, IB RDMA is safer than letting a disk controller DMA into memory.

Beyond its high performance, IB also simplifies building and managing clusters by providing a single fabric that can carry networking and storage traffic in addition to cluster communication. Many groups have specified a wide variety of upper-level protocols that can run over IB, including:

IP-over-InfiniBand (IPoIB): the Internet Engineering Task Force (IETF) has a working group developing standards-track drafts for sending IP traffic over IB. These drafts eventually should lead to an RFC standard for IPoIB. IPoIB does not take full advantage of IB's performance, however, as traffic still passes through the IP stack and is sent packet by packet. IPoIB does provide a simple way to run legacy applications or send control traffic over IB.
Sockets Direct Protocol (SDP): the InfiniBand Trade Association itself has specified a protocol that maps standard socket operations onto native IB RDMA operations. This allows socket applications to run unchanged and still receive nearly all of IB's performance benefits.
SCSI RDMA Protocol (SRP): the InterNational Committee for Information Technology Standards (INCITS) T10 committee, which is responsible for SCSI standards, has published a standard for mapping the SCSI protocol onto IB. Work is underway on developing a second-generation SRP-2 protocol.

Many other groups also are studying and specifying the use of IB, including APIs from the DAT Collaborative and the Open Group's Interconnect Software Consortium, RDMA bindings for NFS and IB support for various MPI packages.

Of course, without open-source support, all of these fancy hardware capabilities are a lot less interesting to the Linux world. Fortunately, the OpenIB Alliance is an industry consortium dedicated to producing exactly that—a complete open-source IB stack. OpenIB currently has 15 member companies, including IB hardware vendors, server companies, software companies and research organizations.

Work on the OpenIB software began in February 2004, and the first kernel drivers were merged into Linux's tree in December 2004, right after the tree opened for 2.6.11 following the release of 2.6.10. The first batch of code merged into the kernel is the smallest set of IB drivers that do something useful. It contains a midlayer to abstract low-level hardware drivers from upper-level protocols, a single low-level driver for Mellanox HCAs, an IPoIB upper-level protocol driver and a driver to allow a subnet manager to run in user space.

A few snippets of code from the IPoIB driver should provide some understanding of how one can use the kernel's IB support. To see this code in context, you can look at the complete IPoIB driver, which is in the directory drivers/infiniband/ulp/ipoib in the Linux kernel source.

Listing 1 shows what the IPoIB driver does to allocate all of its IB resources. First, it calls ib_alloc_pd(), which allocates a protection domain (PD), an opaque container that every user of IB must have to hold other resources.

Listing 1. IPoIB Driver Initialization


struct ib_qp_init_attr init_attr = {
    .cap = {
        .max_send_wr  = IPOIB_TX_RING_SIZE,
        .max_recv_wr  = IPOIB_RX_RING_SIZE,
        .max_send_sge = 1,
        .max_recv_sge = 1
    },
    .sq_sig_type = IB_SIGNAL_ALL_WR,
    .rq_sig_type = IB_SIGNAL_ALL_WR,
    .qp_type     = IB_QPT_UD
};

priv->pd = ib_alloc_pd(priv->ca);

priv->cq = ib_create_cq(priv->ca,
            ipoib_ib_completion,
            NULL, dev,
            IPOIB_TX_RING_SIZE +
            IPOIB_RX_RING_SIZE + 1);

if (ib_req_notify_cq(priv->cq, IB_CQ_NEXT_COMP))
    goto out_free_cq;

priv->mr = ib_get_dma_mr(priv->pd,
                         IB_ACCESS_LOCAL_WRITE);

init_attr.send_cq = priv->cq;
init_attr.recv_cq = priv->cq,

priv->qp = ib_create_qp(priv->pd, &init_attr);

By the way, proper error checking has been omitted from the listings, although any real kernel code must check the return values of all functions for failure. All of the IB functions that allocate resources and return pointers use the standard Linux method for returning errors by way of the ERR_PTR() macro, which means that the status can be tested with IS_ERR(). For example, the call to ib_alloc_pd() in the real kernel actually looks like:


priv->pd = ib_alloc_pd(priv->ca);
if (IS_ERR(priv->pd)) {
        printk(KERN_WARNING "%s: failed "
               "to allocate PD\n", ca->name);
        return -ENODEV;
}

Next, the driver calls ib_create_cq(), which creates a completion queue (CQ). The driver requests that the function ipoib_ib_completion() be called when a completion event occurs and that the CQ be able to hold at least IPOIB_TX_RING_SIZE + IPOIB_RX_RING_SIZE + 1 work completion structures. This size is required to handle the extreme case when the driver posts its maximum number of sends and receives and then does not get to run until they all have generated completions. Confusingly enough, CQs are the one IB resource not associated with a PD, so we don't have to pass our PD to this function.

Once the CQ is created, the driver calls ib_req_notify_cq() to request that the completion event function be called for the next work completion added to the CQ. The event function, ipoib_ib_completion(), processes completions until the CQ is empty. It then repeats the call to ib_req_notify_cq() so it is called again when more completions are available.

The driver then calls ib_get_dma_mr() to set up a memory region (MR) that can be used with DMA addresses obtained from the kernel's DMA mapping API. Translation tables are set up in the IB HCA to handle this, and a local key (L_Key) is returned that can be passed back to the HCA in order to refer to this MR.

Finally, the driver calls ib_create_qp() to create a queue pair (QP). This object is called a queue pair because it consists of a pair of work queues—one queue for send requests and one queue for receive requests. Creating a QP requires filling in the fairly large ib_qp_init_attr struct. The cap structure gives the sizes of the send and receive queues that are to be created. The sq_sig_type and rq_sig_type fields are set to IB_SIGNAL_ALL_WR so that all work requests generate a completion.

The qp_type field is set to IB_QPT_UD so that an unreliable datagram (UD) QP is created. There are four possible transports for an IB QP: reliable connected (RC), reliable datagram (RD), unreliable connected (UC) and unreliable datagram (UD). For the reliable transports, the IB hardware guarantees that all messages either are delivered successfully or generate an error if an unrecoverable error, such as a cable being unplugged, occurs. For connected transports, all messages go to a single destination, which is set when the QP is set up, while datagram transports allow each message to be sent to a different destination.

Once the IPoIB driver has created its QP, it uses the QP to send the packets given to it by the network stack. Listing 2 shows what is required to post a request to the send queue of the QP.

Listing 2. IPoIB Driver Send Request Posting


priv->tx_sge.lkey            = priv->mr->lkey;
priv->tx_sge.addr            = addr;
priv->tx_sge.length          = len;

priv->tx_wr.opcode           = IB_WR_SEND;
priv->tx_wr.sg_list          = &priv->tx_sge;
priv->tx_wr.num_sge          = 1;
priv->tx_wr.send_flags       = IB_SEND_SIGNALED;
priv->tx_wr.wr_id            = wr_id;
priv->tx_wr.wr.ud.remote_qpn = qpn;
priv->tx_wr.wr.ud.ah         = address;

ib_post_send(priv->qp, &priv->tx_wr, &bad_wr);

First, the driver sets up the gather list for the send request. The lkey field is set to the L_Key of the MR that came from ib_get_dma_mr(). Because the IPoIB is sending packets that are in one contiguous chunk, the gather list has only a single entry. The driver simply has to assign the address and length of the packet. The address in the gather list is a DMA address obtained from dma_map_single() rather than a virtual address. In general, software can use a longer gather list to have the HCA collect multiple buffers into a single message to avoid having to copy data into a single buffer.

The driver then fills in the rest of the fields of the send work request. The opcode is set to send, sg_list and num_sge are set for the gather list just filled in and the send flags are set to signaled so that the work request generates a completion when it finishes. The remote QP number and address handle are set, and the wr_id field is set to the driver's work request ID.

Once the work request is filled in, the driver calls ib_post_send(), which actually adds the request to the send queue. When the request is completed by the IB hardware, a work completion is added to the driver's CQ and eventually is handled by ipoib_ib_completion().

InfiniBand can do a lot, and the OpenIB Alliance is only getting started writing software to do it all. Now that Linux has basic support for IB, we will be implementing more upper-level protocols, including SDP and storage protocols. Another major area we are tackling is support for direct user-space access to IB—the kernel bypass feature we talked about earlier. There's plenty of interesting work to be done on IB, and the OpenIB Project is open to everyone, so come join the fun.

Resources for this article: /article/8131.