GCC Inline Assembly and Its Usage in the Linux Kernel

Dibyendu Roy

Issue #271, November 2016

Learning GCC inline assembly just got one more benefit. Now let's dive in to the kernel to see how a few things actually work.

The GNU C compiler allows you to embed assembly language code into C programs. This tutorial explains how you can do that on the ARM architecture. As GNU assembler is similar for different architectures, including assembler syntax and most assembler directives, the general concepts of inline assembly remain same for other architectures as well.

Why should you embed assembly code into C? There are at least two reasons:

Optimization: the compiler tends to optimize unless specified otherwise. For some applications, however, hand-written assembly replaces the most performance-sensitive parts. Because the inline assembler does not require separate assembling and linking, it is more convenient than a separately written assembly module. Inline assembly code can use any C variable or function name that is in scope, so it is easy to integrate it with your C code.
Access to processor-specific instructions: C does not support saturated math operations, co-processor instructions or accessing the Current Program Status Registers (CPSR). C code also does not support ARM LDREX/STREX instructions. ARM implements its atomic operations and locking primitives with LDREX/STREX. Inline assembly is the easiest way to access instructions not supported by the C compiler.

Getting Started

Let's start with the simple example shown in Listing 1.

Listing 1. Example Program

#include<stdio.h>

int add(int x, int y)
{
    int result;
    asm volatile("add %[Rd], %[Rm], %[Rn]"
                 : [Rd] "=r" (result)
                 : [Rm] "r" (x), [Rn] "r" (y)
                 );
    return result;
}

int main(void)
{
    int ret;
    ret = add(5, 7);
    printf("the result is = %d\n", ret);
    return 0;
}

The part from the example program in Listing 1 that needs explanation is this:

asm volatile("add %[Rd], %[Rm], %[Rn]"
             : [Rd] "=r" (result)
             : [Rm] "r" (x), [Rn] "r" (y)
             );

Before explaining the code, let's start with the basics. The asm keyword enables you to embed assembler instructions within C code. GCC has two forms of inline asm statements: basic asm and extended asm. A basic asm is one with no operands, while an extended asm includes one or more operands. Basic asm enables you to include assembly language outside any function. The extended form is preferred for mixing C and assembly languages within a function.

Basic asm and Extended asm

A basic asm statement has the following format:

asm [volatile] (Assembly code)

The volatile qualifier is optional here. All basic asm statements are implicitly volatile.

Assembly code is a string that can contain any assembly instruction(s) recognized by the GNU assembler, including directives. A C compiler does not parse or check the validity of the assembly instructions. Assembly code parsing and syntax checking is done at the assembling stage. A single asm string may contain multiple assembler instructions. You can use a newline followed by a tab (\n\t) to break and indent the code in the next line.

Below is an example of basic asm in the kernel (arch/arm/include/asm/barrier.h):

#define nop() __asm__ __volatile__("mov\tr0,r0\t@ nop\n\t");

This is simply:

asm volatile("mov r0,r0");

The above inline assembly copies the r0 register content to itself. nop() ends up only introducing some delay.

Note that the asm keyword is a GNU extension. Use __asm__ instead of asm when your code is compiled with -ansi and the various -std options. The Linux kernel uses both __asm__ and asm for compatibility.

An extended asm statement has the following syntax:

asm [volatile] (Assembly code
                : OutputOperands /* optional */
                : InputOperands  /* optional */
                : Clobbers       /* optional */
                )

The volatile qualifier is optional here. However, asm statements may produce side effects while operating on inputs and generating outputs. You may need to use the volatile qualifier to disable certain optimizations in that case.

Assembly code is a string literal that is a combination of fixed text and tokens that refer to the input and output parameters. OutputOperands and InputOperands are optional comma-separated lists of C variables. Clobbers are also an optional comma-separated list of registers or other special values. Read on for more about these.

Coming Back to the Example

The example program from Listing 1 includes an extended asm statement. Colons delimit each operand parameter after the assembly code:

"add %[Rd], %[Rm], %[Rn]"

This is the string literal containing the assembly code:

[Rd] "=r" (result)

Output operands consist of a symbolic name enclosed in a square bracket, followed by a constraint string and a C variable name enclosed in parentheses:

[Rm] "r" (x), [Rn] "r" (y)

The list of input operands uses similar syntax as output operands.

More on Output, Input and Clobbers

Output Operands

OutputOperands has the following format:

[asmSymbolicName] constraint (cvariablename)

An asm statement has zero or more output operands indicating the names of C variables modified by the assembler code. asmSymbolicName specifies a symbolic name for the operand. Square brackets are used to reference this inside the asm statement. The scope of the name is the asm statement that contains the definition.

You also can use the position of the operands in the assembler template (for example, if there are three operands, %0 to the first, %1 for the second and %2 for the third). You can re-write the example code as:

asm volatile("add %0, %1, %2"
             : "=r" (result)
             : "r" (x), "r" (y)
             );

A constraint is a string constant specifying restrictions on the placement of the operand. Refer to the GCC documentation for a full list of supported constraints for ARM and other architectures. The most commonly used constraints are “r”, used as general-purpose registers (r0 to r15); “m”, which refers to any valid memory location, and “I” for immediate integer. Constraint character may be prefixed with constraint modifiers:

= — write-only operand, used for output operands.
+ — read-write operand, must be listed as an output operand.
& — register used for output only.

Output operators must be write-only, and input operands are read-only. Constraints without any modifiers are read-only. So, it should be clear why the output operand in the example program has "=r" and input operands "r".

But, what if your input and output operands are the same? "+r" can be used as a constraint and must be listed as output operands:

asm volatile("mov %[Rd], %[Rd], lsl #2"
             : [Rd] "+r" (x)
             );

The assembly code goes here:

#APP
@ 5 "inline_shift.c" 1
    mov r3, r3, lsl #2
@ 0 "" 2

Sometimes a compiler may choose the same register for input and output, even if you do not instruct it to do so. If your code explicitly requires different registers for input and output operands, use the "=&" constraint modifier.

Constraints in an output operand should follow a cvariablename that must be an lvalue expression for output operands.

Input Operands

Input operands have a similar syntax as output operands. But, constraints should not start with “=” or “+”. Input operands' constraints for registers do not have any modifiers, as they are read-only operands. You should never try to modify the contents of input-only operands. Use "+r" when input and output operands are the same, as explained above.

Clobbers

Sometimes inline assembly may modify additional registers, as side effects, apart from those listed in the output operands. In order to make the compiler aware of this additional change, you need to list them in a clobber list. Clobber list items are either register names or the special clobbers. Each clobber list item is a string constant and is separated by commas. When the compiler allocates registers for input and output operands, it does not use any of the clobbered registers. Clobbered registers are available for any use in the assembler code. Let's take a closer look at an inline add program that does not have a clobber list. The inline assembly code may look like this:

#APP
@ 6 "inline_add.c" 1
    add r3, r3, r2
@ 0 "" 2

Here the code uses register r3 and r2. Now let's modify it and list the r2 and r3 registers in a clobber list:

asm volatile("add %[Rd], %[Rm], %[Rn]"
             : [Rd] "=r" (result)
             : [Rm] "r" (x), [Rn] "r" (y)
             : "r2", "r3"
             );

The assembly code:

#APP
@ 6 "inline_add2.c" 1
    add r4, r1, r0
@ 0 "" 2

Notice that the compiler did not use the r2 and r3 registers as they were mentioned in the clobber list. The processor can use r2 and r3 for any other work in the assembly code.

There are also two special clobbers available apart from registers: “cc” and “memory”. The “cc” clobber indicates that the assembler code modifies the CPSR (Current Program Status Register) flag register. The “memory” clobber tells the compiler that the inline assembly code performs memory reads or writes on items apart from input and output operands. The compiler flushes the register contents to memory so that memory contains the correct value before executing the inline asm. Moreover, the compiler reloads all memory access after the inline asm statement so that it gets a fresh value. This way, the “memory” clobbers form a read-write compiler barrier across the inline asm statement.

In Linux, a compiler barrier is defined as a macro barrier() that is nothing but a memory clobber:

#define barrier() __asm__ __volatile__("": : :"memory")

Important:

Use __asm__ instead of asm when your code is compiled with -ansi and the various -std options.
The difference between basic and extended asm is the latter has optional output, input and clobber lists separated by colons (:).
Extended asm statements must be inside a function. Only basic asm statements may be outside functions.
Inside a function, extended asm statements typically produce more efficient and robust code.

Inline Assembly in the Linux Kernel

Now that I've gone through the basics of GCC inline assembly, let's move on to a more interesting topic—its usage in the Linux kernel. The rest of this article is architecture-dependent and is discussed with respect to ARMv7-A. Basic knowledge of ARM and assembly language will be helpful in understanding the rest of the material covered here.

A Little Background

In multitasking computers, shared resource accesses must be restricted to only one modifier at a time. This shared resource can be a shared memory location or a peripheral device. Mutual exclusion, a property of concurrency control, protects such shared resources. In a single processor system, disabling interrupts could be a way of achieving mutual exclusion inside critical sections (although user mode cannot disable interrupts), but this solution fails in SMP systems as disabling interrupts on one processor will not prevent others from entering the critical section. Atomic operations and locks are used to enforce mutual exclusion.

Mutual exclusion enforces atomicity. Let's consider the definition of atomicity first. Any operation is atomic if the operation is entirely successful and its result is visible to all CPUs in the system instantaneously, or it's not successful at all. Atomicity is the basis of all mutual exclusion methods.

All modern computer architectures, including ARM, provide hardware mechanisms for atomically modifying the memory locations.

The ARMv6 architecture introduced the concept of exclusive accesses to memory locations for atomically updating memory. The ARM architecture provides instructions to support exclusive access.

LDREX (Load Exclusive) loads the value of a given memory location into a register and tags that memory location as reserved.

STREX (Store Exclusive) stores an updated value from a register back to a given memory location, provided that no other processor has modified the physical address since its last load. It returns 0 for success, and 1 otherwise, to a register indicating whether the store operation completed successfully. By checking this return value, you can confirm whether any other processor has updated the same location in between.

These instructions need hardware support to tag a physical address as “exclusive” by that specific processor.

Note: ARM says:

If a context switch schedules out a process after the process has performed a Load-Exclusive but before it performs the Store-Exclusive, the Store-Exclusive returns a false negative result when the process resumes, and memory is not updated. This does not affect program functionality, because the process can retry the operation immediately.

The concept of exclusive accesses also is related to the concepts of local and global monitors, memory types, memory access ordering rules and barrier instructions. See the Resources section of this article for more information.

Implementation of Atomic Operations

Atomic integer operations are generally required to implement counters. As protecting a counter with a complex locking scheme is overkill, atomic_inc() and atomic_dec() are preferable. All the atomic functions in the Linux kernel are implemented using LDREX and STREX.

Take a look at atomic_t defined in include/linux/types.h as the following:

typedef struct {
        int counter;
} atomic_t;

After simplifying the macro definitions, the atomic_add() function definition in kernel-4.6.2 (arch/arm/include/asm/atomic.h) looks like Listing 2.

Listing 2. atomic_add() Implementation

static inline void atomic_add(int i, atomic_t *v)                      
{                                                                       
        unsigned long tmp;                                              
        int result;                                                     
                                                                        
        prefetchw(&v->counter);                                         
        __asm__ __volatile__("@ atomic_add\n"                       
"1:     ldrex   %0, [%3]\n"                                             
"       add     %0, %0, %4\n"                                   
"       strex   %1, %0, [%3]\n"                                         
"       teq     %1, #0\n"                                               
"       bne     1b"                                                     
        : "=&r" (result), "=&r" (tmp), "+Qo" (v->counter)               
        : "r" (&v->counter), "Ir" (i)                                   
        : "cc");                                                        
}

Let's take a closer look at the code shown in Listing 2.

The function below uses PLD (Preload Data), PLDW (Preload Data with intent to write) instructions that are typically memory system hints to bring the data into caches for faster access:

prefetchw(&v->counter);

ldrex loads the “counter” value to “result” and tags that memory location as reserved:

ldrex   %0, [%3]

The following adds i to the “result” and stores that to “result”:

add     %0, %0, %4

Two scenarios are possible here:

strex   %1, %0, [%3]

In first scenario, strex successfully stores the value of “result” into the memory location and returns 0 at “tmp”. This happens only when no other processor has modified the location in between the last load and store by the current processor. However, if any other processor has modified the same physical memory in between, the current processor's store fails. In this case, it returns 1 at “tmp”.

This instruction tests equivalence and sets the Z (zero) flag of CPSR if “tmp” is 0 or clears it if “tmp” is 1:

teq     %1, #0

For a successful store scenario, the Z flag is set. So, the branch condition does not satisfy. However, if store fails, the branch takes place and execution starts again from the ldrex instruction. The loop continues until store is successful:

bne     1b

All other atomic operations are similar and use LDREX and STREX.

Barriers

If a sequence of memory operations is independent, the compiler or CPU performs it in a random fashion to achieve optimization—for example:

a = 1;
b = 5;

However, to synchronize with other CPUs or with hardware devices, it is sometimes a requirement that memory-reads (loads) and memory-writes (stores) issue in the order specified in your program code. To enforce this ordering, you need barriers. Barriers are commonly included in kernel locking, scheduling primitives and device driver implementations.

Compiler Barrier

The compiler barrier does not allow the compiler to re-order any memory access across the instruction. As discussed before, the barrier() macro is used as a compiler barrier in Linux:

#define barrier() __asm__ __volatile__("": : :"memory")

Processor Barriers

Processor optimizations, such as caches, write buffers and out-of-order execution, can result in memory operations occurring in a different sequence from the program order. A processor barrier is an implied compiler barrier as well. ARM has three hardware barrier instructions:

Data Memory Barrier (DMB) ensures that all memory accesses (in program order) before the barrier are visible in the system before any explicit memory accesses after the barrier. It does not affect instruction prefetch or execution of the next non-memory data access.
Data Synchronization Barrier (DSB) ensures that all pending explicit data accesses complete before any additional instructions execute after the barrier. It does not affect prefetching of instructions.
Instruction Synchronization Barrier (ISB) flushes the pipeline and prefetch buffer(s) so that once ISB has completed, the processor can fetch the next instructions from cache or memory.

Listing 3. Implementation of the Memory Barrier

#define dmb(option) __asm__ __volatile__ ("dmb " #option : : : "memory"
#define dsb(option) __asm__ __volatile__ ("dsb " #option : : : "memory")
#define isb(option) __asm__ __volatile__ ("isb " #option : : : "memory")

SY is the default. It applies to the full system, including all processors and peripherals. Refer to the ARM manual for other options. Linux provides various memory barrier macros that are mapped to the ARM hardware barrier instructions: read memory barrier, rmb(); write memory barrier, wmb(); and full memory barrier, mb(). There also are corresponding SMP versions: smp_rmb(), smp_wmb() and smp_mb(). When the kernel is compiled without CONFIG_SMP, smp_* are simply barrier() macros.

Spinlock

To execute any critical section code atomically, you need to ensure that no two threads of execution should execute critical sections concurrently. As described in Robert Love's Linux Kernel Development, “The term threads of execution implies any instance of executing code. This includes, for example, a task in the kernel, an interrupt handler, a bottom half, or a kernel thread.”

For uniprocessor systems, spinlock implementation boils down to disabling preemption or local interrupts. spin_lock() disables preemption. spin_lock_irq() and spin_lock_irqsave() disable local interrupts. But, this is not sufficient for SMP, as other processors are free to execute the critical section code simultaneously.

Listing 4. Spinlock Implementation

static inline void arch_spin_lock(arch_spinlock_t *lock)
{
    unsigned long tmp;
    u32 newval;
    arch_spinlock_t lockval;

    prefetchw(&lock->slock);
    __asm__ __volatile__(
"1: ldrex   %0, [%3]\n"
"   add %1, %0, %4\n"
"   strex   %2, %1, [%3]\n"
"   teq %2, #0\n"
"   bne 1b"
    : "=&r" (lockval), "=&r" (newval), "=&r" (tmp)
    : "r" (&lock->slock), "I" (1 << TICKET_SHIFT)
    : "cc");

    while (lockval.tickets.next != lockval.tickets.owner) {
        wfe();
        lockval.tickets.owner = ACCESS_ONCE(lock->tickets.owner);
    }

    smp_mb();
}

static inline void arch_spin_unlock(arch_spinlock_t *lock)
{
    smp_mb();
    lock->tickets.owner++;
    dsb_sev();
}

#define wfe()   __asm__ __volatile__ ("wfe" : : : "memory")

#define sev()   __asm__ __volatile__ ("sev" : : : "memory")

Linux uses an improved version of the ticket lock algorithm to implement spinlock. Like atomic instructions, the spinlock implementation uses LDREX/STREX.

The wfe (Wait For Event) and sev (Send EVent) ARM instructions need some introduction here. wfe puts the ARM processor into a lower power state until a wake-up event occurs. The wake-up events for wfe include the execution of an sev instruction on any processor on an SMP system, an interrupt, an asynchronous abort or a debug event. While contending for a spinlock, the processor goes to a low power state instead of being busy waiting, hence saving power. The ACCESS_ONCE macro prevents the compiler from an optimization that forces it to fetch the lock->tickets.owner value each time through the loop. A memory barrier smp_mb() is required after you get a lock and before you release it, so that other processors can be updated on time with whatever is happening on the current processor.

Note: acquiring and releasing a lock should be atomic. Otherwise, more than one thread of execution may acquire the same lock in parallel causing a race condition.

Semaphore

Semaphores and mutexes can sleep, unlike a spinlock. When a task is holding a semaphore and another task attempts to acquire it, the semaphore places the contended task onto a wait queue and puts it to sleep. When the semaphore becomes available, the scheduler wakes one of the tasks on the wait queue to acquire the semaphore. As you can see in Listing 5, the semaphore implementation uses raw_spin_lock_irqsave() and raw_spin_unlock_irqrestore() to acquire the lock. If another task is holding the semaphore, the current task releases the spinlock and goes to sleep (as sleeping is not an option while holding the spinlock), and after waking up, it re-acquires the spinlock. up() is used to release the semaphore that also uses the spinlock. up() may be called from any context and even by tasks that have never called down(), unlike mutexes.

Listing 5. Semaphore Implementation

int down_interruptable(struct semaphore *sem)
{
    unsigned long flags;
    int result = 0;

    raw_spin_lock_irqsave(&sem->lock, flags);
    if (likely(sem->count > 0))
        sem->count--;
    else
        result = __down_interruptable(sem);
    raw_spin_unlock_irqrestore(&sem->lock, flags);

    return result;
}

Mutex

A call to a mutex may take two different paths. First, it calls __mutex_fastpath_lock() to acquire the mutex. Then it falls back to __mutex_lock_slowpath() if it fails to acquire the lock. In the latter case, the task is added to the wait queue and sleeps until woken up by the unlock path.

Listing 6. Mutex Implementation

void __sched mutex_lock(struct mutex *lock)
{   
    might_sleep();
    /*
     * The locking fastpath is the 1->0 transition from
     * 'unlocked' into 'locked' state.
     */
    __mutex_fastpath_lock(&lock->count, __mutex_lock_slowpath);
    mutex_set_owner(lock);
}

__mutex_fastpath_lock is a call to atomic_sub_return_relaxed() that is an atomic operation—atomically subtract i from v and return the result. Similarly, mutex_unlock() uses atomic_add_return_relaxed for incrementing the counter atomically.

Wrapping It All Up

This article neither aims to provide algorithmic details of kernel implementation of locks and barriers nor does it provide ARM architecture details. The goal is to provide the basics of GCC inline assembly and show how it can help you better understand the Linux kernel.

Resources

Using the GNU Compiler Collection: https://gcc.gnu.org

ARM Architecture Reference Manual ARMv7-A and ARMv7-R Edition: infocenter.arm.com

ARM Synchronization Primitives Development Article: https://developer.arm.com

Cortex-A Series Programmer's Guide: infocenter.arm.com

Linux Kernel Development 3rd Edition by Robert Love: https://www.amazon.com/Linux-Kernel-Development-Robert-Love/dp/0672329468

Inline assembler (Wikipedia): https://en.wikipedia.org/wiki/Inline_assembler

ARM GCC Inline Assembler Cookbook: www.ethernut.de/en/documents/arm-inline-asm.html

See also the kernel documentation on memory barriers, spinlock and mutex design.