Last month, we gave an introduction to interrupt-driver block device drivers. This month, we examine common optimizations that can allow an order of magnitude improvement in performance on some hardware, and then cover some of the points that have been left out in our whirlwind tour of block device drivers.
Last month, I ran out of space for this column just as I was about to start a long discussion on optimizations. In the interest of usefulness, I will mention only the most useful optimizations here before going on to discuss initialization.
I won't have any sample code implementations of the optimizations, because these are complex issues that need to be handled in device-specific ways, and just as the code was much more vague when I discussed interrupt-driven drivers than when I introduced the basic, initial device driver, it would either be so large as to take over the entire magazine or be so vague as to be completely useless if I were to try to write some here. I hope that my explanations are useful to you even without code.
Also, I should warn you that the optimizations I talk about, while representative of common optimizations, are not necessarily representative of anything you will find in the Linux source code, except where I explicitly state otherwise. I'm writing on a slightly more theoretical level—about what can be done, rather than what has been done. Things conceptually similar to what I write about have been done, but the details are my own and may not be the best way to go about things. As usual, use this column as an introduction to the kernel source code and read the actual source code for far more insight than I can give you here.
If you are code-starved and don't care about optimizations, jump right to the second part of this article, where I talk about initialization.
One common optimization is coalescing adjacent requests. This means that when the driver is notified or notices that a request has been added to the request queue, it looks through the request list to see if there is a request for the next block (and possibly more blocks beyond that). If so, it sends a request to the hardware to read more than one block with one command, and when the data comes in from the hardware (presumably in an interrupt service routine), it fills both requests before calling end_request(1) (actually, some similar function designed especially for that driver) on either request. After satisfying (or failing to satisfy) the requests, the equivalent of end_request() is called for each request, but without waking up processes waiting on either request until the interrupt has been satisfied.
This will require that you write your own version of end_request(). Although this probably sounds daunting, it isn't as hard as it sounds, because you can use almost all of it as-is. For example, you could copy it verbatim, except instead of doing wake_up(&wait_for_request) at the end, you could add wait_for_request to a list of events to wake up when you are ready. Then you would call this new almost_end_request() function as soon as you have finished processing each request. When you are done handling the entire interrupt and are ready to wake up processes, iterate over the list of events, calling wake_up() on each in turn, from first satisfied to last satisfied.
Note that wake_up() will not cause a context switch directly. The driver will not give up control while running wake_up() to a process being woken up. Instead, wake_up() makes all the processes being woken up “runnable”, and sets the need_resched flag. This flag says that the scheduler ought to be called at the next convenient time, such as when returning from a “slow” interrupt handling routine (including the clock handling routine) or when returning from a system call. This means that the driver will not be pre-empted by calling wake_up(), and so it will be able to wake up all the necessary processes without being pre-empted.
This will likely take several tries to get right. All I can say to help is “Make sure you have backups. Really.”
The only driver in the Linux kernel that I have noticed doing anything like this is the floppy driver; the track buffer works in a similar way, where more than one request may be satisfied by a single read command sent to the hardware. If you are interested in investigating how it works, read drivers/block/floppy.c and search for floppy_track_buffer and read the entire function make_raw_rw_request().
Sounds like a “boondoggle”, doesn't it? Scatter-gather is perhaps a little bit similar in concept to coalescing adjacent requests, but is used with more intelligent hardware, and is perhaps a bit easier to implement. The “scatter” part means that when there are multiple blocks to be written all over a disk (for example), one command is sent out to initiate writing to all those different sectors, reducing the overhead involved in negotiation from O(n) to O(1), where n is the number of blocks or sectors to write.
Similarly, the “gather” part means that when there are multiple blocks to be read, one command is sent out to initiate reading all the blocks, and as the disk sends in each block, the corresponding request is marked as satisfied with end_request(1) or equivalent device-specific code. You will only be able to easily use end_request() unmodified with scatter-gather if each block read or written results in a separate interrupt being generated, and perhaps not even then. The SCSI driver does its own, which is probably the best way to go.
If you want to increase throughput, at the slight expense of response time, you could use timers to help: when your request() is notified that there is a request, and sees that there is only one request outstanding, it could set a timer to go off soon (one or two tenths of a second, perhaps), assuming that while waiting, more requests will spill in to be dealt with, and that when a certain number of requests have been made, or the timer has gone off, whichever comes first, scatter-gather will be applied to the blocks. If the request() routine is called and notices that “enough” (however many that is...) requests have accumulated, it would un-install the timer and process the requests. If the timer were to go off, all requests would be processed.
Note that the timer used should not be the same static timer used for the hardware timeout. Instead, it should be a dynamically allocated timer. See <linux/timer.h> for details on the dynamic timer routines.
I will repeat my standard disclaimer: this is simplified (at least, I'm trying to simplify it...) and if you want more detailed and correct information, study a real driver. The SCSI high-level drivers (scsi.c, sd.c, sr.c) are definitely the place to start. (I don't mention st.c and sg.c because they are character, not block, devices.)
Optimization is almost easy to write about, and sounds glamorous, but error handling is really where the rubber meets the road. I'm not even going to try to really cover error handling in this column. It is a very complex subject, and different for each piece of hardware. If you have read this far, you should have the basic knowledge you need to start reading the block device drivers in the kernel. By reading all the error handling there, you will begin to understand how to do error handling nicely. Consider reading the drivers in this sequence: ramdisk.c, hd.c, and floppy.c.
As you write your driver, you will probably start with very simple error handling. As you use your driver, you are likely to discover more error cases to handle. Sometimes you will find that you need to redesign some components of your driver to get error handling correct. As an example, until very recently, error handling in the floppy driver was not very good. You could mount a write-protected floppy read-write and cause serious problems when you then tried to write to the floppy. Also, if you tried (for instance) to write a tar archive on a write-protected floppy, you would get a stream of errors as the driver reset the floppy drive and kept assuming the error would go away. It took a significant rewrite of the floppy driver to solve that problem correctly.
This isn't really specific to block device drivers, but it is certainly necessary to know. In issue 9, I gave an example initialization function, but it was mostly pseudo-code and didn't cover many of the things that you need to do to work with real devices. For instance, it doesn't explain how to grab a DMA channel, nor does it explain how to grab an IRQ line.
The easiest way to deal with IRQ and DMA is to allocate both the IRQ line and DMA channel while initializing the device. It's not the best way, but it is the easiest way to start out while you are writing your device driver. When you want to figure out exactly when to allocate and free them, you can read other device drivers. The floppy device driver, for instance, has functions floppy_grab_irq_and_dma() and floppy_release_irq_and_dma() which do exactly what they say, and are used not only in the initialization code, but all through the rest of the driver.
The floppy_grab_irq_and_dma() function is a good place to start to learn how to allocate IRQ lines and DMA channels. According to <asm/dma.h>, the IRQ line should be allocated first and released last.
We'll look at IRQs first. request_irq() takes four arguments. The first argument to request_irq() is the number of the IRQ line to allocate. The second is the interrupt service routine to call when an interrupt is received. The third is a flag which is either set to SA_INTERRUPT or something else (presumably 0), which determines whether the argument passed to the interrupt service routine is a pointer to a register structure (0) or the number of the interrupt (SA_INTERRUPT), and also whether the interrupt is a “fast” interrupt handler (SA_INTERRUPT) or a “slow” interrupt handler (0). A “slow” interrupt handler is one where more processing is done when the interrupt handler returns, including possibly running the scheduler to choose a new process to become the active one. A “fast” interrupt handler does as little as possible. The fourth argument is the name of the device driver.
request_irq() is simpler. It takes two arguments, the first of which is the IRQ channel, and the second of which is the name of the device driver.
The corresponding freeing functions are even simpler. free_dma() takes only the number of the DMA channel, and free_irq() takes only the number of the IRQ.
Of course, there is far more to using IRQs, and even more so to using DMA, than allocating and freeing lines and channels, but this is a start. The start, to be pedantic. Read <asm/dma.h>, kernel/dma.c, <asm/irq.h>, and kernel/irq.c for details; they are very readable, and have many useful comments.
K. D. Nguyen sent me some e-mail after reading the January issue of Linux Journal, echoing a wish I have heard from other people as well.
I have been reading two books on Unix device drivers, the KHG, and the Kernel Korner articles since last issue. I feel like I can write some device drivers. But unfortunately, there seems to be something missing from all the books and articles about Unix device drivers. It is the lack of a practice environment. We, the device driver beginners, can only read and look at some device driver code under Linux and try to understand how they work. It would be more fun if there were some hardware or device kit that let us really do some exercise on what we just read about writing Unix device drivers (rather than buying a new color printer and then begging the manufacture for the specs to write a new challenging device driver). Of course, for the meantime, I will keep reading.
There is no real practice environment. The easiest way to start learning is to write a ramdisk driver. Beyond that, many real devices are actually fairly simple. Dive in! It's hard to dabble at the water's edge when you are writing kernel code for a monolithic OS, regardless of whether you are writing code for a simple ramdisk or for a toy device kit or for a real device. The learning curve is quite steep, but that means that in a short time of strenuous learning, you really pick up most of what you need to write basic device drivers.
Also, what features would a practice device support, and what would it do? It's a hard question to answer and one I'm not going to attempt—and I think that manufacturers won't either. Since you are as likely to screw up your entire system writing a driver for a practice device as for a real one, you might as well work on a real device. There are real devices as simple as any toy device.