Device Drivers Concluded

George V. Zezschwitz

Issue #28, August 1996

This is the last of five articles about character device drivers. In this final section, Georg deals with memory mapping devices, beginning with an overall descriptoin of Linux memory management concepts.

Though only a few drivers implement the memory mapping technique, it gives an interesting insight into the Linux system. I introduce memory management and its features, enabling us to play with the console, include memory mapping in drivers, and crash systems...

Address Spaces and Other Unreal Things

Since the days of the 80386, the Intel world has supported a technique called virtual addressing. Coming from the Z80 and 68000 world, my first thought about this was: “You can allocate more memory than you have as physical RAM, as some addresses will be associated with portions of your hard disk”.

To be more academic: Every address used by the program to access memory (no matter whether data or program code) will be translated--either into a physical address in the physical RAM or an exception, which is dealt with by the OS in order to give you the memory you required. Sometimes, however, the access to that location in virtual memory reveals that the program is out of order—in this case, the OS should cause a “real” exception (usually SIGSEGV, signal 11).

The smallest unit of address translation is the page, which is 4 kB on Intel architectures and 8 kB on Alpha (defined in asm/page.h).

When trying to understand the process of address resolution, you will enter a whole zoo of page table descriptors, segment descriptors, page table flags and different address spaces. For now, let's just say the linear address is what the program uses; using a segment descriptor, it is turned into a logical address, which is then resolved to a physical address (or a fault) using the paging mechanism. The Linux Kernel Hacker's Guide spends 20 pages on a rather short description of all these beasties, and I see no chance of making a more succinct explanation.

For any understanding of the building, administration, and scope of pages when using Linux, and how the underlying technique—especially of the Intel family—works, you have to read the Linux Kernel Hacker's Guide. It is freely available by ftp from tsx-11.mit.edu in the /pub/linux/docs/LDP/ directory. Though the book is slightly old [that's a gentle understatement—ED], nothing has changed in the internals of the i386, and other processors looks similar (in particular, the Pentium is exactly like a 386).

Pages—More Than Just a Sheet of Memory

If you want to learn about page management, either you start reading the nice guide now, or you believe this short and abstract overview:

Every process has a virtual address space implemented by several CPU registers which are changed during context switches (this is the zoo of selectors and page description pointers). By these registers, the CPU accesses all the memory segments it needs.
Multiple levels of translation tables are used to translate the linear address given by the process to the physical address in RAM. The translation tables all reside in memory. They are automatically looked up by the CPU hardware, but they are built and updated by the OS. They are called page descriptor tables. In these tables there is one entry (i.e., a “page descriptor”) for every page in the process's address space—we're talking of the logical addresses, also called virtual addresses.

We concentrate now on a few main aspects of pages as seen by the CPU:

A page may be “present” or not—depending on whether it is present in physical memory or not (if it has been swapped-out, or it is a page which has not yet been loaded). A flag in the page descriptor is used to indicate this status. Access to a non-present page is called a “major” page fault. The fault is handled in Linux by the function do_no_page(), in mm/memory.c. Linux counts page faults for each process in the field maj_flt in struct task_struct.
A page may be write-protected—any attempt to write on the page will cause a fault (called “minor page fault”, handled in do_wp_page() and counted in the min_flt field of struct task_struct).
A page belongs to the address space of one task or several of them; each such task holds a descriptor for the page. “Task” is what microprocessor technicians call a process.

Other important features of pages, as seen by the OS, are:

If multiple processes use the same page of physical memory, they are said to “share” it. The processes hold separate page descriptors for shared page, and the entries can differ—for example, one process can have write permission on the page and another process may not.
A page may be marked as copy on write (grep for COW in kernel sources). If, for example, a process forks, the child will share the data segments with the parent, but both will be write-protected: the pages are shared for reading. As soon as one program writes onto a page, the page is doubled and the writing program gets a new page; the other keeps its old one, with a decremented “share count”. If the share count is already one, no copy is performed when a minor fault happens, and the page is just marked as writable. Copy-on-write minimizes memory usage.
A page may be locked against being swapped out. All kernel modules and the kernel itself reside in locked pages. As you might remember from the last installment, pages which are used for DMA-transfers have to be protected against being swapped out.
Page descriptors may also point to addresses not located in physical RAM, but rather the ROM of certain peripherals, RAM buffers for video cards etc., or PCI buffers. Traditionally, on Intel architectures, the range for the first two groups is from 640 kB to 1024 kB, and the range for the PCI buffers is above high_memory (the top of physical RAM, defined in asm/pgtable.h). The range from 640 KB to 1024 kB not used by Linux, and is tagged as reserved in the mem_map structure. They are the “384k reserved”, appearing in the first kernel message after BogoMips calculation.

Virtual memory allows quite beautiful things like:

Demand-loading a program instead of loading it totally into memory at startup: whenever you start a program, it gets its own virtual address space, which is just associated with some blocks on your filesystem and some space for variables, but the memory is allocated and loading is performed only when you really access the different portions of the program.
Swapping, in case your memory gets tight. This means whenever Linux needs memory for itself or a program and unused memory gets tight, it will try to shrink the buffers for the file systems, try to “forget” pages already allocated for program code that is executed (they can be reloaded from disk at any time anyway), or swap some pages containing user data to the swap partition of the hard disk.
Memory Protection. Each process has its own address space and can't look at memory belonging to other processes.
Memory Mapping: Just declare a portion or the whole of a file you have opened as a part of your memory, by means of a simple function call.

Memory Mapping Example

Here we are. The first assumption you should be able to make when thinking about mmaping (Memory Mapping; usually pronounced em-mapping) a character device driver is you have something like a numbered position and length of that device. Of course, you could count the nth byte in the stream of characters coming from your serial line, but the mmap paradigm applies much more easily to devices that have a well-defined beginning and end.

One character “device” that is used whenever you use svgalib or the server is /dev/mem: a device representing your physical memory. The server and svgalib use it to map the video buffer of your graphics adaptor to the user space of the server or the user process.

Once upon a time (am I that old?) people wrote games like Tetris to act on text consoles using BASIC. They tended to write directly into the video RAM rather than using the bloody slow means of BASIC commands. That was exactly like using mmapping.

Looking for a small example to play with mmap(), I wrote a small program called nasty. As you might know, Arabian writing is right to left. Though I suppose nobody will prefer this style with Latin letters, the following program gives you an idea of this style. Note that nasty only runs on Intel architectures with VGA.

If you ever run this program, run it as root (because you otherwise won't get access to /dev/mem), run it in text-mode (because you won't see anything when using X) and run it with a VGA or EGA (because the program uses addresses specific of such boards). You might see nothing. If so, try to scroll back a few lines (Ctrl-PageUp) to the beginning of your screen buffer.

/* nasty.c - flips right and left on the
 * VGA console --- "Arabic" display */
# include <stdio.h>
# include <string.h>
# include <sys/mman.h>
int main (int argc, char **argv) {
    FILE    *fh;
    short*  vid_addr, temp;
    int     x, y, ofs;
    fh = fopen ("/dev/mem", "r+");
    vid_addr = (short*) mmap (
        /* where to map to: don't mind */
        NULL,
        /* how many bytes ? */
        0x4000,
        /* want to read and write */
        PROT_READ | PROT_WRITE,
        /* no copy on write */
        MAP_SHARED,
        /* handle to /dev/mem */
        fileno (fh),
        /* hopefully the Text-buffer :-)*/
        0xB8000);
    if (vid_addr)
        for (y = 0; y < 100; y++)
            for (x = 0; x < 40; x++) {
                ofs = y*80;
                temp = vid_addr [ofs+x];
                vid_addr [ofs+x] =
                  vid_addr [ofs+79-x];
                vid_addr [ofs+79-x] = temp;
            }
    munmap ((caddr_t) vid_addr, 0x4000);
    fclose (fh);
    return 0;
}

Playing with mmap()

What could you change in the mmap() call above?

You might change the rights for the mapped pages by removing one of the PROT flags asking for the right to read, write or execute (PROT_READ, PROT_WRITE, and PROT_EXEC) the data range mapped to the user program.

You might decide to replace MAP_SHARED by MAP_PRIVATE, allowing you to read the page without writing it (The Copy-on-Write Flag will be set: you will be able to write to the text buffer, but the modified content will not be flushed back to the display buffer and will go to your private copy of the pages.)

Changing the offset parameter would allow you to adapt this nasty program to Hercules Monochrome Adapters (by using 0xB0000 as text buffer instead of 0xB8000) or to crash a machine (by using another address).

You might decide to apply the mmap() call to a disk file instead of system memory, converting the contents of the file to our “Arabia” style (be sure to fit the length you mmap and access to the real file length). Don't worry if your old mmap man page tells you it is a BSD page—currently the question is who documents the features of Linux rather than who implements them...

Instead of passing NULL as first parameter, you might specify an address to which you wish to map the pages. Using recent Linux versions, this wish will be ignored, unless you add the MAP_FIXED flag. In this case Linux will un-map any previous mapping at that address and replace it with the desired mmap. If you use this (I don't know why you should), make sure your desired address fits a page boundary ((addr & PAGE_MASK) == addr).

At last, we have really hit one of the favorite uses of mmap—especially when you deal with small portions of large files like databases. You will find it helpful—and faster—to map the whole file to memory, in order to read and write it like it was real memory, leaving to the buffer algorithms of Linux all the oddities of caching. It will work much faster than fread() and fwrite().

VMA and other Cyberspaces

The guy who has to care for this beautiful stuff is your poor device driver writer. While support for mmap() on files is done by the kernel (by each file system type, indeed), the mapping method for devices has to be directly supported by the drivers, by providing a suitable entry in the fops structure, which we first introduced in the March issue of LJ.

First, we have a look at one of the few “real” implementations for such a support, basing the discussion on the /dev/mem driver. Next, we go on with a particular implementation useful for frame grabbers, lab devices with DMA-support and probably other peripherals.

To begin with, whenever the user calls mmap(), the call will reach do_mmap(), defined in the mm/mmap.c file. do_mmap() does two important things:

It checks the permissions for reading and writing the file handle against what was requested to mmap(). Moreover, tests for crossing the 4GB limit on Intel machines and other knock out-criteria are performed.
If those are well, a struct vm_area_struct variable is generated for the new piece of virtual memory. Each task can own several of these structures, “virtual memory areas” (VMAs).

VMAs require some explanation: they represent the addresses, methods, permissions and flags of portions of the user address space. Your mmaped region will keep its own vm_area_struct entry in the task head. VMA structures are maintained by the kernel and ordered in balanced tree structures to achieve fast access.

The fields of VMAs are defined in linux/mm.h. The number and content might be explored by looking at /proc/pid/maps for any running process, where pid is the process ID of the requested process. Let's do so for our small nasty program, compiled with gcc-ELF. While the program runs, your /proc/pid/maps table will look somewhat like this (without the comments):

# /dev/sdb2: nasty css
08000000-08001000 rwxp 00000000 08:12 36890
# /dev/sdb2: nasty dss
08001000-08002000 rw-p 00000000 08:12 36890
# bss for nasty
08002000-08008000 rwxp 00000000 00:00 0
# /dev/sda2: /lib/ld-linux.so.1.7.3 css
40000000-40005000 r-xp 00000000 08:02 38908
# /dev/sda2: /lib/ld-linux.so.1.7.3 dss
40005000-40006000 rw-p 00004000 08:02 38908
# bss for ld-linux.so
40006000-40007000 rw-p 00000000 00:00 0
# /dev/sda2: /lib/libc.so.5.2.18 css
40009000-4007f000 rwxp 00000000 08:02 38778
# /dev/sda2: /lib/libc.so.5.2.18 dss
4007f000-40084000 rw-p 00075000 08:02 38778
# bss for libc.so
40084000-400b6000 rw-p 00000000 00:00 0
# /dev/sda2: /dev/mem (our mmap)
400b6000-400c6000 rw-s 000b8000 08:02 32767
# the user stack
bfffe000-c0000000 rwxp fffff000 00:00 0

The first two fields on each line, separated by a dash, represent the address the data is mmaped to. The next field shows the permissions for those pages (r is for read, w is for write, p is for private, and s is for shared). The offset in the file mmaped from is given next, followed by the device and the inode number of the file. The device number represents a mounted (hard) disk (e.g., 03:01 is /dev/hda1, 08:01 is /dev/sda1). The easiest (and slow) way to figure out the file name for the given inode number is:

cd /mount/point
find . -inum inode-number -print

If you try to understand the lines and their comments, please notice that Linux separates data into “code storage segments” or css, sometimes called “text” segments; “data storage segments” or dss, containing initialized data structures; and “block storage segments” or bss, areas for variables that are allocated at execution time and initialized to zero. As no initial values for the variables in the bss have to be loaded from disk, the bss items in the list show no file device (“0” as a major number is NODEV). This shows another usage of mmap: you can pass MAP_ANONYMOUS for the file handle to request portions of free memory for your program. (In fact, some versions of malloc get their memory this way.)

Your Turn

When your device driver gets the call from do_mmap(), a VMA has already been created for the new mapping, but not yet inserted into the task's memory structure.

The device driver function should comply to this prototype:

int skel_mmap (struct inode *inode,
               struct file *file,
               struct vm_area_struct *vma)

vma->vm_start will contain the address in user space to be mapped to. vma->vm_end contains its end, the difference between these two elements represents the length argument in the original users call to mmap(). vma->vm_offset is the offset on the mmaped file, identical to the offset argument passed to the system call.

Let's explore how the /dev/mem driver performs the mapping. You find the code lines in drivers/char/mem.c in the function mmap_mem(). If you look for something complicated, you will be disappointed: it calls only remap_page_range(). If you want to understand what happens here, you really should read the 20 pages from the Kernel Hacker's Guide. In short, the page descriptors for the given process address space are generated and filled with links to the physical memory. Note, the VMA structure is for Linux memory management, whereas the page descriptors are directly interpreted by the CPU for address resolution.

If remap_page_range() returns zero, saying that no error has occurred, your mmap-handler should do the same. In this case, do_mmap() will return the address to which the pages were mapped . Any other return value is treated as an error.

A Concrete Driver

It will be difficult to give code lines for all possible applications of the mmap technique in the different character drivers. Our concrete example is of a laboratory device with its own RAM, CPU and, of course, analog to digital converters, digital to analog converters, digital inputs and outputs, and clocks (and bells and whistles).

The lab device we dealt with is able to sample steadily into its memory and report the status of its work when asked via the character channel, which is an ASCII stream-like channel. The command-based interaction is done via the character device driver we implemented and its read and write calls.

The actual mass transfer of data is done independently from that: by sending a command like TOHOST interface address, length, host address, the lab device will raise an interrupt and tell the PC it wants to send a certain amount of data to a given address at the host by DMA. But where should we put that data? We decided not to mix up the clear character communication with the mass data transfer. Moreover, as the user could even upload its own commands to the device, we could make no assumptions about the ordering and the meaning of the data.

So we decided to hand full control to the user and allow him to request DMA-able memory portions mapped to the user address space, and check every DMA request coming from the lab device against the list of those areas. In other words, we implemented something like a skel_malloc and skel_free by means of ioctl() commands and disallowed any transfer to any other region in order to keep the whole thing safe.

You might wonder why we did not use the direct mmap(). Mostly, because there is no equivalent munmap. Your driver will not be notified when the mapping to the open file is destroyed. Linux does it all by itself: it removes the vma structure, destroys the page descriptor tables and decreases the reference count for the shared pages.

As we have to allocate the DMA-able buffer by kmalloc(), we have to free it by kfree(). Linux won't allows us to do so when automatically unmapping the user reference, but without the user reference, we don't need the buffer any more. Therefore, we implemented a skel_malloc() which actually allocates the driver buffer and remaps it to the user space as well, and skel_free() which release that space and unmaps it (after checking if a DMA-transfer is running).

We could implement the remapping in the user library released with our device driver by the same means used by the nasty program above. But, for good reason, /dev/mem can be read and written only by root, and programs accessing the device driver should be able to run as normal user, too.

Two tricks are used in our driver. First, we modify the mem_map array telling Linux about the usage and permissions of our pages of physical memory. mem_map is an array of mem_map_t structures, and is used to keep information about all the physical memory.

For all allocated pages we set the reserved flag. This is a quick and dirty method, but it reaches its aim under all Linux versions (starting at least at 1.2.x): Linux keeps its hands off our pages! It considers them as if they were a video buffer, a ROM, or anything else it can't swap or release into free memory. The mem_map array itself uses this trick to protect itself from processes hungry for memory.

The second trick we use is quickly generating a pseudo file which looks something like an opened /dev/mem. We rebuild the mmap_mem() call from the /dev/mem driver, especially because it is not exported in the kernel symbol table and simply apply the same small call to remap_page_range().

Moreover, DMA-buffers allocated by our skel_malloc() call are registered in lists in order to check whether a request for a DMA transfer is going to a valid memory area. The lists are also used to free the allocated buffers when the program closes the device without calling skel_free() beforehand. dma_desc is the type of those lists in the following lines, which show the code for the ioctl-wrapped skel_malloc() and skel_free():

/* =============================================
 *
 * SKEL_MALLOC
 *
 * The user desires a(nother) dma-buffer, that
 * is allocated by kmalloc (GFP_DMA) (continuous
 * and in lower 16 MB).
 * The allocated buffer is mapped into
 * user-space by
 *  a) a pseudo-file as you would get it when
 *     opening /dev/mem
 *  b) the buffer-pages tagged as "reserved"
 *     in memmap
 *  c) calling the normal entry point for
 *     mmap-calls "do_mmap" with our pseudo-file
 *
 * 0 or <0 means an error occurred, otherwise
 * the user space address is returned.
 * This is the main basis of the Skel_Malloc
 * Library-Call
 */
 * ------------------------------
 * Ma's little helper replaces the mmap
 * file_operation for /dev/mem which is declared
 * static in Linux and has to be rebuilt by us.
 * But ain't that much work; we better drop more
 * comments before they exceed the code in length.
*/
static int skel_mmap_mem (struct inode * inode,
                   struct file * file,
                   struct vm_area_struct *vma) {
    if (remap_page_range(vma->vm_start,
                   vma->vm_offset,
                   vma->vm_end - vma->vm_start,
                   vma->vm_page_prot))
        return -EAGAIN;
    vma->vm_inode = NULL;
    return 0;
}
static unsigned long skel_malloc (struct file *file,
                            unsigned long size) {
    unsigned long    pAdr, uAdr;
    dma_desc         *dpi;
    skel_file_info   *fip;
    struct file_operations  fops;
    struct file      memfile;
    /* Our helpful pseudo-file only ... */
    fops.mmap = skel_mmap_mem;
    /* ... support mmap */
    memfile.f_op = &fops;
    /* and is read'n write */
    memfile.f_mode = 0x3;
    fip = (skel_file_info*)(file->private_data);
    if (!fip) return 0;
    dpi = kmalloc (sizeof(dma_desc), GFP_KERNEL);
    if (!dpi) return 0;
    PDEBUG ("skel: Size requested: %ld\n", size);
    if (size <= PAGE_SIZE/2)
        size = PAGE_SIZE-0x10;
    if (size > 0x1FFF0) return 0;
    pAdr = (unsigned long) kmalloc (size,
                           GFP_DMA | GFP_BUFFER);
    if (!pAdr) {
        printk ("skel: Trying to get %ld bytes"
                "buffer failed - no mem\n", size);
        kfree_s (dpi, sizeof (dma_desc));
        return 0;
    }
    for (uAdr = pAdr & PAGE_MASK;
         uAdr < pAdr+size;
         uAdr += PAGE_SIZE)
#if LINUX_VERSION_CODE < 0x01031D
        /* before 1.3.29 */
        mem_map [MAP_NR (uAdr)].reserved |=
                             MAP_PAGE_RESERVED;
#elseif LINUX_VERSION_CODE < 0x01033A
        /* before 1.3.58 */
        mem_map [MAP_NR (uAdr)].reserved = 1;
#else
        /* most recent versions */
        mem_map_reserve (MAP_NR (uAdr));
#endif
    uAdr = do_mmap (&memfile, 0,
            (size + ~PAGE_MASK) & PAGE_MASK,
            PROT_READ | PROT_WRITE | PROT_EXEC,
            MAP_SHARED, pAdr & PAGE_MASK);
    if ((signed long) uAdr <= 0) {
        printk ("skel: A pity - "
                "do_mmap returned %lX\n", uAdr);
        kfree_s (dpi, sizeof (dma_desc));
        kfree_s ((void*)pAdr, size);
        return uAdr;
    }
    PDEBUG ("skel: Mapped physical %lX to %lX\n",
            pAdr, uAdr);
    uAdr |= pAdr & ~PAGE_MASK;
    dpi->dma_adr = pAdr;
    dpi->user_adr = uAdr;
    dpi->dma_size= size;
    dpi->next = fip->dmabuf_info;
    fip->dmabuf_info = dpi;
    return uAdr;
}
/* =============================================
 *
 * SKEL_FREE
 *
 * Releases memory previously allocated by
 * skel_malloc
 */
static int skel_free (struct file *file,
                      unsigned long ptr) {
    dma_desc    *dpi, **dpil;
    skel_file_info  *fip;
    fip = (skel_file_info*)(file->private_data);
    if (!fip) return 0;
    dpil = &(fip-).>dmabuf_info);
    for (dpi = fip->dmabuf_info;
         dpi; dpi=dpi->next) {
        if (dpi->user_adr==ptr) break;
        dpil = &(dpi->next);
    }
    if (!dpi) return -EINVAL;
    PDEBUG ("skel: Releasing %lX bytes at %lX\n",
            dpi->dma_size, dpi->dma_adr);
    do_munmap (ptr & PAGE_MASK,
        (dpi->dma_size+(~PAGE_MASK)) & PAGE_MASK);
    ptr = dpi->dma_adr;
    do {
#if LINUX_VERSION_CODE < 0x01031D
        /* before 1.3.29 */
        mem_map [MAP_NR(ptr)] &= ~MAP_PAGE_RESERVED;
#elseif LINUX_VERSION_CODE < 0x01033A
        /* before 1.3.58 */
        mem_map [MAP_NR(ptr)].reserved = 0;
#else
        mem_map_unreserve (MAP_NR  (ptr));
#endif
        ptr += PAGE_SIZE;
    } while (ptr < dpi->dma_adr+dpi->dma_size);
    *dpil = dpi->next;
    kfree_s ((void*)dpi->dma_adr, dpi->dma_size);
    kfree_s (dpi, sizeof (dma_desc));
    return 0;
}

Some Final Words on PCI

Technology develops, but the ideas often remain the same. In the old ISA world, peripherals located their buffers at the “very high end of address space”--above 640 KB. Many PCI-cards now do the same, but nowadays, this is something more like the end of a 32-bit address space (like 0xF0100000).

If you want to access a buffer at these addresses, you have to use vremap() as defined in linux/mm.h to remap the same pages of this physical memory into your own virtual address space.

vremap() works a little bit like the mmap() user call in nasty, but it's much easier:

void * vremap (unsigned long offset,
               unsigned long size);

You just pass the start address of your buffer and its length. Remember, we always map pages; therefore offset and size have to be page length-aligned. If your buffer is smaller or does not start on a page boundary, map the whole page and try to avoid accessing invalid addresses.

I personally have not tried this, and I'm not sure if the tricks I described above on how to map buffers to user space work with PCI high memory buffers. If you want to give it a try, you definitely have to remove the “brute force” manipulation of the mem_map array, as mem_map is only for physical RAM. Try to replace the kmalloc() and kfree() stuff with the analogous vremap() calls and then perform a second remapping with do_mmap() to user space.

But as you might realize, we've come to an end of this series, and now it is up to you to boldly go where no Linuxer has gone before...

Good Luck!