Kernel Korner

Kernel Mode Linux for AMD64

Toshiyuki Maeda

Issue #136, August 2005

When user code runs inside the kernel, system calls become function calls, 50 times faster. How does that affect the performance of a real application, MySQL?

Kernel Mode Linux (KML) is a technology that enables the execution of user processes in kernel mode. I described the basic concept and the implementation techniques of KML on IA-32 architecture in my previous article, “Kernel Mode Linux”, which appeared in the May 2003 issue of Linux Journal (see the on-line Resources). Since then, I have extended KML to support AMD64, or x86-64, architecture, which is a viable 64-bit extension of the IA-32 architecture. In this article, I briefly describe the background of KML and then show the implementation techniques of KML for the AMD64 architecture. In addition, the results of a performance experiment using MySQL are presented.

The Problem of Protection by Hardware

Traditional OS kernels protect themselves by using the hardware facilities of CPUs. For example, the Linux kernel protects itself using a privilege level mechanism and a memory protection mechanism built in to CPUs. As a result, to use the services of the kernel, such as the filesystem or network, user programs must perform costly and complex hardware operations.

In Linux for AMD64, for example, user programs must use special CPU instructions (SYSCALL/SYSRET) to use kernel services. SYSCALL can be regarded as a special jump instruction whose target address is restricted by the kernel. To utilize system services or, in other words, to invoke system calls, a user program executes the SYSCALL instruction. The CPU then raises its privilege level from user mode to kernel mode and jumps to the target address of SYSCALL, which is specified by the kernel in advance. Then, the code located at the target address switches the context of the CPU from the user context to the kernel context by using the SWAPGS instruction. Finally, it executes the requested system service. To return to the user program, the SYSRET instruction reverses these steps.

Some problems exist, however, in this protection-by-hardware approach. One problem is system calls become slow. For example, on my Opteron system, SYSCALL/SYSRET is about 50 times slower than a mere function call/return.

One obvious solution to speed up system calls is to execute user processes in kernel mode. Then, system calls can be only the usual function calls, because user processes can access the kernel directly. Of course, it is dangerous to let user processes run in kernel mode, because they can access arbitrary portions of the kernel.

One simplistic solution to ensure safety is to use virtual machine (VM) techniques such as VMware and Xen. If user programs and a kernel are executed in virtual kernel mode, user programs can access the kernel directly. However, this protection-by-VM approach does not quite work, because the overhead of virtualization is considerable. In addition, although VM can prevent user programs from destroying the host system outside of the VM, it cannot prevent them from destroying the kernel inside the VM. It is unlikely that these difficulties could be solved even if CPUs, such as Intel's Vanderpool and AMD's Pacifica, provide better support for virtualization.

A recommended way to execute user processes in kernel mode safely is to use safe languages, also known as strongly typed languages. The recent advances in static program analysis, or type theory, can be used to protect the kernel from user processes. For example, many technologies already enable this protection-by-software approach, such as Java bytecode, .NET CLI, Objective Caml, Typed Assembly Language (TAL) and Proof-Carrying Code (PCC). I currently am implementing a TAL variant that is powerful enough to write an operating system kernel.

Based on this idea, I implemented Kernel Mode Linux (KML) for IA-32, a modified Linux kernel that can execute user processes in kernel mode, called kernel-mode user processes. My previous article described KML for IA-32. Since then, I have implemented KML for AMD64, because AMD64 has come into widespread use as a possible successor to IA-32. Interestingly, in spite of the similarities between IA-32 and AMD64, the implementation techniques of KML for these two architectures differ considerably. Therefore, I describe the basic concept, usage and implementation techniques of KML for AMD64 in the rest of this article.

How to Use KML for AMD64

KML is provided as a patch to the source of the original Linux kernel. To use KML, all you have to do is patch the original source of the Linux kernel with the KML patch and enable the Kernel Mode Linux option at the configuration phase, as you might do with other kernel patches. The KML patch is available from the KML site (see Resources).

In current KML, programs under the directory /trusted are executed as kernel-mode user processes. Therefore, if you want to execute bash in kernel mode, all you have to do is execute the following commands:

% cp /bin/bash /trusted/bin
% /trusted/bin/bash

How to Speed Up System Call Invocations

In KML for IA-32, system call invocations are translated automatically into fast, direct function calls without modifying user programs. This is possible because the recent GNU C Library for IA-32 has a mechanism to choose one of several methods that the kernel provides for system call invocation, and the KML provides direct function calls as one way of invoking system calls.

However, the GNU C Library for AMD64 doesn't have such a mechanism for choosing among methods of system call invocations. Therefore, I created a patch for the GNU C Library. With the patch, kernel-mode user processes can invoke system calls rapidly, because the invocations automatically are translated to function calls. The patch is available from the KML site (see Resources).

What Kernel-Mode User Processes Can Do

One of advantages of KML is the kernel-mode user processes are almost the same as usual user processes except for their privilege level. That is, kernel-mode user processes can do almost anything that ordinary user processes can do. For example, kernel-mode user processes can invoke all system calls. This means they can use filesystems. They also can call open, read, write and other functions, including network systems, with socket, connect and bind. They even can create processes and threads with fork, clone and execve. In addition, they have their own memory address space that they can access freely. Even if a kernel-mode user process uses tons of memory, the kernel pages out the memory.

Moreover, the scheduling mechanism and the signal mechanism of the original Linux kernel work for the kernel-mode user processes. You can check this by executing the following commands:

% cp /usr/bin/yes /trusted/bin
% /trusted/bin/yes

You should notice that your system does not hang. This is true, because the kernel's scheduler preempts the kernel-mode yes and gives CPU time to other processes. You can stop the kernel-mode yes by sending Ctrl-C. This means the kernel can interrupt the kernel-mode yes and send a signal to kill it.

What Kernel-Mode User Processes Cannot Do

As described in the previous section, kernel-mode user processes are ordinary user processes and can perform almost every task that user processes can perform. However, there are a few exceptions:

Kernel-mode user processes cannot modify their GS segment register, because KML uses the GS segment register internally to eliminate the overhead of SWAPGS instruction.
32-bit binaries cannot be executed in kernel mode on AMD64. KML for AMD64, like other typical OS kernels for AMD64, runs in 64-bit mode and there is no efficient way to let 32-bit programs directly call 64-bit functions.

Please notice that, as in the case of KML for IA-32, these limitations are present only in kernel-mode user processes. Ordinary user processes can alter their GS selector, and IA-32 binaries can be executed if an IA-32 emulation environment is set up.

How KML Executes User Processes in Kernel Mode

The way to execute user processes in kernel mode in AMD64 is almost the same as it is in IA-32. To execute user processes in kernel mode, the only thing KML does is launch user processes with the CS segment register, which points to the kernel code segment instead of user code segment.

In AMD64 CPUs, the privilege level of running programs is determined by the privilege level of their code segment. This is almost the same as in IA-32 CPUs; the only difference is the segmentation memory system is degenerated in AMD64. Although segment registers still are used in 64-bit mode of AMD64, the only segment that the segment registers can use is the 16 EB flat segment. Thus, the role of the segment descriptors is simply to specify privilege levels. Therefore, only four segments—kernel code segment, kernel data segment, user code segment—exist in 64-bit mode.

The Stack Starvation Problem and Its Solution

Although it is fairly easy to execute user processes in kernel mode, as shown in the previous section, there is a big problem—the stack starvation problem. The problem itself is almost the same as that of KML for IA-32, so I describe it briefly here. Further details are available in my previous article.

The original Linux kernel for AMD64 handles interrupts and exceptions by using the legacy interrupt gates mechanism. For each interrupt/exception, the kernel specifies an interrupt handler by using the interrupt gates in advance, typically at boot time. If an interrupt occurs, the AMD64 CPU suspends the running program, saves the execution context of the program and executes the interrupt handler specified in the corresponding interrupt gate.

The important point is the AMD64 CPU may or may not switch stacks before saving the execution context, depending on the privilege level of the suspended program. If the program is running in user mode, the CPU automatically switches from the stack of the running program to the kernel stack, whereas the CPU does not switch stacks if the program is running in kernel mode. The CPU then saves the execution context—RIP, CS, RFLAGS, RSP and SS register—to the stack.

Now, let us assume that a kernel-mode user process accesses its memory stack, which is not mapped by the page tables of the CPU. First, the CPU raises a page fault exception, suspends the process and tries to save the execution context. This cannot be done, however, because the CPU does not switch stacks, and the stack where the CPU is ready to save the context is nonexistent. To signal this serious situation, the CPU tries to raise a special exception, a double fault exception. Again, the CPU tries to access the nonexistent stack to save the context. Finally, the CPU gives up and resets itself. This process is known as the stack starvation problem.

To solve the stack starvation problem, KML for IA-32 uses the task management mechanism of IA-32 CPUs. The mechanism can be used to switch CPU contexts including all registers and all segment registers, when interrupts or exceptions are raised. KML for IA-32 switches stacks using the mechanism when double faults are raised. However, in 64-bit mode on AMD64, the task management mechanism cannot be used because it simply does not exist.

Instead, KML for AMD64 uses the Interrupt Stack Table (IST) mechanism, which is a newly introduced mechanism of the AMD64 architecture. In AMD64, the task state segment (TSS) has fields for seven pointers to interrupt stacks. In addition, each interrupt gate descriptor has a field for specifying whether the CPU should use the IST mechanism instead of the legacy stack switching, and if so, which interrupt stack should be used. If an interrupt occurs that is specified to use the IST mechanism, the CPU unconditionally switches from a user stack to the interrupt stack specified in the interrupt gate descriptor.

In KML for AMD64, all interruptions and exceptions are handled with the IST mechanism. Therefore, even if an interrupt or exception occurs while a kernel-mode user process is running with its %rsp pointing to an invalid memory, the kernel can keep running without any problem, because the CPU switches stacks automatically.

There are two reasons why KML for AMD64 handles not only double faults but also other interrupts and exceptions with the IST mechanism. One reason is that the overhead incurred by the IST mechanism is negligibly small. Therefore, I think it is better to keep it simple. Handling only double faults with the IST mechanism requires complex modifications to the original kernel, as in KML for IA-32. Second, the red zone of the stack is required by System V Application Binary Interface for AMD64 architecture. The red zone is a 128-byte memory range located just below the stack, that is, from %rsp - 8 to %rsp - 128. System V ABI for AMD64 specifies that user programs can use the red zone for temporary data storage and signal handlers, and interrupt handlers should never touch the zone. If KML handles an interrupt with the usual interrupt handling mechanism, this red zone is corrupted, because a stack is not switched. In this case, some CPU contexts are overwritten to the red zone if a kernel-mode user process is running. Therefore, KML for AMD64 handles all interrupts/exceptions with the IST mechanism in order to provide System V ABI to user programs correctly.

There also is a limitation in KML for IA-32: kernel-mode user processes cannot change their CS segment registers. This is not possible because KML for IA-32 requires at least one scratch register to switch from a user stack to a kernel stack manually when exceptions or interrupts are raised. It prepares the register by using the memory where the CS register is saved. This limitation is not applicable to KML for AMD64, because stacks are switched by the IST mechanism. It is not so important, however, to change the CS segment register in 64-bit mode of AMD64 because there can be only two code segments.

Performance Measurement

To see how much performance improvement is possible, I ran the Wisconsin benchmark for MySQL both on the original Linux kernel and on KML, using sql-bench, which comes with MySQL. The experimental environment is shown in Table 1. In the test on KML, both the MySQL server and the benchmark client were executed as kernel-mode user processes, and the patched GNU C Library was used to eliminate the overhead of system call invocations. In addition, the loop count of the test was increased to 10,000, as the default loop count of 10 was too small to produce meaningful results.

Table 1. Experimental Environment

CPU	Opteron 850 (2.4GHz, L2 cache 1MB) x 4
Memory	8GB (Registered DDR1-333 SDRAM)
Hard disk	146GB (Ultra320 SCSI 73GB x 2, RAID-0, XFS)
OS	Linux kernel 2.6.11 (KML_2.6.11_002)
Libc	GNU libc 2.3.5 + patch for KML
MySQL	MySQL 4.1.11

The result is shown in Table 2. The second column shows the total CPU time consumed by the benchmark. The third and forth columns show the breakdown of the total CPU time. The third column shows the CPU time consumed by the user process, and the forth column shows the CPU time consumed by the kernel.

Table 2. Result of Wisconsin Benchmark (in seconds)

	CPU	User	System
Original Linux	753.86	611.78	142.08
KML	728.61	605.95	122.66

The results show that the total CPU time was improved by about 3%. The user CPU time was improved by about 1%, and the system CPU time was improved by about 14%. The result indicates that KML could improve the performance of database applications slightly by eliminating the overhead of system call invocations.

Conclusion and Future Work

KML is a modified Linux kernel that can execute user processes in kernel mode. By executing in kernel mode, the performance of user programs can be improved by, for example, eliminating the overhead of system call invocations. Besides the performance improvement, KML also can be used to ease inspection and debugging of the kernel and development of kernel modules, because kernel-mode user processes can access the kernel and use large amount of memory and CPU time. I now am considering implementing a helper library to provide kernel-mode user processes with an easy way to access kernel functions and data by exporting them as some kind of shared object.

Resources for this article: /article/8327.