Adjusting assembly language routines for higher performance from Alpha processors.
API NetWorks is a leading developer of high-performance, high-density servers. As such, we (and our customers) are acutely sensitive to system performance issues. With the majority of our customers running Linux on Alpha-based systems, we pay close attention to key open-source components and packages to ensure we don't inadvertently leave performance “on the table”.
In the proprietary world, improvements unearthed after software product releases are saved for the next version. In the world of open-source software, the time between improvements can be measured in hours or days. In addition, tuning techniques are shared quickly for the enhancement of the communal knowledge base. With this in mind, we examined the assembly language routines in the Linux kernel, with an eye toward taking advantage of the micro-architectural features of the latest Alpha 21264 (also known as ev6) processor.
We discovered that by rewriting a handful of routines in the architecture-specific portion of the Linux kernel to take advantage of the features of the 21264, we could attain significant speed improvements, amounting to a 40% reduction in overhead for some workloads. This article describes how and why these optimizations work, so that others may be able to use these techniques to boost their Alpha applications performance. For an e-mail server running on a 21264-based Alpha, for example, applying these tuning tips will enable it to scale to higher loads, extending performance beyond where it would normally tend to taper off. End users will feel as though their system is running faster; systems managers will be pleased because they won't have to buy another system as soon to handle increased traffic.
In the case of the Linux kernel, the performance-critical routines are carefully (and conveniently) separated by architecture and mostly implemented in assembly language. While there is a large body of developers continually working to improve the algorithmic performance of the kernel, there are only a handful of developers with the requisite interests, skills and knowledge to enable them to tune these assembly routines to extract the maximum performance from a processor. Most of these developers are tuning code for x86 systems; the relevant Alpha code base had been dormant for quite some time prior to our tuning.
Anecdotal evidence and previous measurements suggest a disproportionately large amount of time was spent inside the Linux/Alpha kernel when running various web and networking servers. Given that the number of known performance-critical routines in the Linux kernel is relatively small (20-30 assembly language routines), it made sense to sort through them to make sure the code was written with the 21264 in mind. A quick reading of these routines revealed they had been carefully hand-scheduled for the previous generation of Alpha CPUs, the 21164 (also called ev5).
While the difference may seem minor, the 21264 differs significantly from the 21164 at the micro-architectural (chip implementation) level, having features that result in roughly double the performance of the 21164 at the same clock speed.
Before detailing the performance improvements, it is useful to note some of the differences between the 21164 and 21264 processors (see Table 1).
After realizing a careful rewrite could yield significant performance gains, we undertook the tuning of the Alpha assembly language routines in the Linux kernel. (There have been attempts to solve dynamically and automatically the problem of efficiently running binaries compiled for older versions of Alpha CPUs on new Alpha CPUs1, but that is beyond the scope of this article.) Specifically, it was clear that about 20 routines in linux/arch/alpha/lib and four to six routines in linux/include/asm-alpha needed to be rewritten to take full advantage of the features of the 21264.
From an internal perspective, the 21264 is a significantly more complex and powerful CPU than the older 21164. Because of these differences, the nature of the performance improvements can come from transformations (discussed below) involving the rescheduling of the code in order to keep the processor from stalling on fetch, avoid branch and branch target penalties, avoid replay traps from occurring where possible, use the instruction latencies and scheduling rules of the 21264, use instructions available in the 21164 and 21264 that had not been used, use instructions newly available in the 21264 and not use some instructions if possible on the 21264.
When writing code to avoid fetch stalling, one must ensure the instructions dynamically encountered in a fetch block do not result in one or more of the functional units being over-subscribed. On the 21264 it is also important to ensure that branch targets are aligned on a fetch block boundary. Given that it fetches four instructions at a time, branch targets should have an alignment of 0mod4 instructions, equivalent to an address alignment of 0mod16.
It is particularly important to avoid branch penalties on the 21264. Sophisticated, trainable branch prediction logic is built in and works effectively if only one control flow change instruction is in a fetch block (a “quad-pack”). In the 21164-tuned kernel assembly language routines, there are a number of places where multiple control-flow change instructions occur within a quad-pack. Additionally, branch targets were aligned to 8mod16 addresses, which often resulted in branch target labels appearing in the middle of a quad-pack. While these sequences run quite well on a 21164, they run relatively slow on a 21264.
Replay traps occur when the processor must roll back the state of memory to force accesses to a particular memory location in order to be sequential, or when there are different-sized accesses to the same memory location. But the code context in the modified routines was such that replay traps were not an issue, so rewriting sequences to avoid replay traps was unnecessary.
The instruction scheduling and slotting rules for the 21264 are too complex to list here, but for those interested in the details, the 21264 Compiler Writer's Guide is an excellent reference.
Byte- and word-sized loads and stores were introduced in the 21164A (ev56) processor, but they were not used in the original versions of the assembly language routines. Prior experience (in the context of static binary translation of applications) has shown performance can be typically improved 10% to 20% by utilizing these instructions. This is particularly true for the stw (store word) and stb (store byte) instructions, as it eliminates memory traffic in a way guaranteed to cause replay traps on the 21264. In the context of the tuned-up kernel routines, these instructions were helpful, but it was typically localized to the tail code of large region copies, while the bulk of the data movement used eight-byte granularity load and store instructions.
The Alpha architecture also features various forms of pre-fetch instructions. Pre-fetch instructions are hints to the memory subsystem to fetch a block of memory to the data cache for future consumption. These do not normally appear in compiled code, as few compilers have enough context available to permit their generation; Compaq's compilers do generate pre-fetch instructions. In the context of moving large amounts of data, it is possible (and desirable) for the assembly language programmer to utilize pre-fetches. The __asm__() feature of gcc enables programmers to insert relevant pre-fetch instructions at key points in routines when rewriting entire routines in assembly language is undesirable. Because they can minimize or prevent data-cache stalls, using these instructions can significantly boost performance.
The 21264 is the first Alpha implementation to include support for three instructions useful for boosting performance: CTLZ, CTTZ and WH64.
The CTLZ and CTTZ instructions count the number of leading/trailing zeros in a 64-bit register and are handy for string manipulation. When a program performs string operations involving pattern matching (strlen() matches on NULL), it is often the case that the byte-number index of the pattern match in an eight-byte value in a register is needed. Without CTTZ, it takes about ten instructions involving multiple CMOVxx (conditional move) instructions to determine this index. The result is a reduction in code size (always useful), as well as a decrease in the number of cycles needed to perform string operations. Also, there are some filesystem primitives involving finding holes in a bitfield where these instructions are useful.
WH64 (write hint for 64-bytes) is a memory subsystem hint that a specified 64-byte region is going to be written to in the near future. The processor can pass this information to the memory subsystem, which can invalidate the target contents and avoid some number of memory system cycles to keep the memory state coherent. Since a process context switch entails moving large amounts of information in memory from one place to another, any improvement in copying performance between kernel-space memory and user-space memory is good news. Meanwhile, program load time is another place in the operating system that depends upon doing a lot of memory-to-memory traffic. The program bits all have to get mapped, and all of the zeroed memory (.bss in executables) must have zeros written to it.
The CMOVxx conditional move instructions on the 21264 have been implemented in the hardware by decomposing them into two separate instructions inside the processor. The result of the latency of a CMOVxx instruction is a minimum of two cycles, and can take up to five cycles, depending upon the number of CMOVs in a given fetch block. In some situations, replacing CMOV instructions with highly predictable conditional branches can result in a performance gain on the 21264. Overall, a good rule of thumb is to try to minimize the number of CMOV instructions if possible.
The data was collected from an API NetWorks' CS20 server, which has dual 833MHz processors with 4MB DDR cache, 1GB of SDRAM and Ultra-160 SCSI disks. Two load-generation tests were run: five builds of the 2.2.18 kernel and five builds of gcc-2.95.3. The average system time (as reported by /usr/bin/time -p) was recorded, using various levels of parallelism with make (see Tables 2 and 3).
A similar version of the experiment was run using the 2.4.2 kernel in default mode (all of the performance patches exist). The results were compared to an unpatched 2.4.2 kernel with most (but not all) of the performance changes reverted.
This experiment was initially performed on an API NetWorks' UP1000 motherboard system, which has a 700MHz processor with 4MB cache, 128MB SDRAM and IDE disks. Again, five builds of the kernel and gcc were run, and the average times were recorded. The kernel used was 2.4.0-test6, with and without the patches.
On a modestly configured 21664 system (the UP1000), the performance increase is significant in terms of reducing the amount of time spent in the kernel, with improvements in the 40% range for some activities (kernel builds). On a more generously configured CS20, we consistently attained speed increases of 14-15% for the measured loads.
We attribute the differences between the UP1000 and CS20 systems to be related to their memory: the UP1000 has an 800MB/sec, 64-bit bus, while the CS20 has a 2.65GB/sec, 256-bit bus.
All of the rewritten routines have appeared in one form or another (some have undergone subsequent rewriting) as part of the 2.4.2 kernel. Additionally, we have put together a patch for 2.2.17 of the kernel and made it available on our corporate web site, http://www.api-networks.com/products/downloads/developer_support/ under “Performance”. Through additional efforts, these improvements have also migrated into glibc and will eventually help improve application performance of user-mode code.