Summary page
Return to the Kernel page

Recent Features

Home	Weekly edition	Kernel	Security	Distributions
Archives	Search	Penguin Gallery	Calendar	LWN.net FAQ
Subscriptions	Advertise	Write for LWN	Contact us	Privacy

Driver porting: Request Queues I

This article is part of the LWN Porting Drivers to 2.5 series.

The simple block driver example earlier in this series showed how to write the simplest possible request function. Most block drivers, however, will need greater control over how requests are built and processed. This article will get into the details of how request queues work, with an emphasis on what every driver writer needs to know to process requests. A second article looks at some of the more advanced features of request queues in 2.6.

Request queues

Request queues are represented by a pointer to struct request_queue or to the typedef request_queue_t, defined in <linux/blkdev.h>. One request queue can be shared across multiple physical drives, but the normal usage is to create a separate queue for each drive. Request queues must be allocated and initialized by the block subsystem; this allocation (and initialization) is done by:

    request_queue_t *blk_init_queue(request_fn_proc *request_fn,
				    spinlock_t *lock);

Here request_fn is the driver's function which will process requests, and lock is a spinlock which controls access to the queue. The return value is a pointer to the newly-allocated request queue if the initialization succeeded, or NULL otherwise. Since setting up a request queue requires memory allocation, failure is possible. A couple of other changes from 2.4 should be noted: a spinlock must be provided to control access to the queue (io_request_lock is no more), and there is no per-major "default" queue provided in 2.6.

When a driver is done with a request queue, it should pass it back to the system with:

    void blk_cleanup_queue(request_queue_t *q);

Note that neither of these functions is normally called if a "make request" function is being used (make request functions are covered in part II).

Basic request processing

The request function prototype has not changed from 2.4; it gets the request queue as its only parameter. The queue lock will be held when the request function is called.

All request handlers, from the simplest to the most complicated, will find the next request to process with:

    struct request *elv_next_request(request_queue_t *q);

The return value is the next request that should be processed, or NULL if the queue is empty. If you look through the kernel source, you will find references to blk_queue_empty() (or elv_queue_empty()), which tests the state of the queue. Use of this function in drivers is best avoided, however. In the future, it could be that a non-empty queue still has no requests that are ready to be processed.

In 2.4 and prior kernels, a block request contained one or more buffer heads with sectors to be transferred. In 2.6, a request contains a list of BIO structures instead. This list can be accessed via the bio member of the request structure, but the recommended way of iterating through a request's BIOs is instead:

    struct bio *bio;

    rq_for_each_bio(bio, req) {
        /* Process this BIO */
    }

Drivers which use this macro are less likely to break in the future. Do note, however, that many drivers will never need to iterate through the list of BIOs in this way; for DMA transfers, use bio_rq_map_sg() (described below) instead.

As your driver performs the transfers described by the BIO structures, it will need to update the kernel on its progress. Note that drivers should not call bio_endio() as transfers complete; the block layer will take care of that. Instead, the driver should call end_that_request_first(), which has a different prototype in 2.6:

    int end_that_request_first(struct request *req, int uptodate, 
			       int nsectors);

Here, req is the request being handled, uptodate is nonzero unless an error has occurred, and nsectors is the number of sectors which were transferred. This function will clean up as many BIO structures as are covered by the given number of sectors, and return nonzero if any BIOs remain to be transferred.

When the request is complete (end_that_request_first() returns zero), the driver should clean up the request. The cleanup task involves removing the request from the queue, then passing it to end_that_request_last(), which is unchanged from 2.4. Note that the queue lock must be held when calling both of these functions:

    void blkdev_dequeue_request(struct request *req);
    void end_that_request_last(struct request *req);

Note that the driver can dequeue the request at any time (as long as it keeps track of it, of course). Drivers which keep multiple requests in flight will need to dequeue each request as it is passed to the drive.

If your device does not have predictable timing behavior, your driver should contribute its timing information to the system's entropy pool. That is done with:

    void add_disk_randomness(struct gendisk *disk);

BIO walking

The "BIO walking" patch was added in 2.5.70. This patch adds some request queue fields and a new function to help complicated drivers keep track of where they are in a given request. Drivers using BIO walking will not use rq_for_each_bio(); instead, they rely upon the fact that the cbio field of the request structure refers to the current, unprocessed BIO, nr_cbio_segments tells how many segments remain to be processed in that BIO, and nr_cbio_sectors tells how many sectors are yet to be transferred. The macro:

    int blk_rq_idx(struct request *req)

returns the index of the next segment to process. If you need to access the current segment buffer directly (for programmed I/O, say), you may use:

    char *rq_map_buffer(struct request *req, unsigned long *flags);
    void rq_unmap_buffer(char *buffer, unsigned long flags);

These functions potentially deal with atomic kmaps, so the usual constraints apply: no sleeping while the mapping is in effect, and buffers must be mapped and unmapped in the same function.

When beginning I/O on a set of blocks from the request, your driver can update the current pointers with:

    int process_that_request_first(struct request *req, 
                                   unsigned int nr_sectors);

This function will update the various cbio values in the request, but does not signal completion (you still need end_that_request_first() for that). Use of process_that_request_first() is optional; your driver may call it if you would like the block subsystem to track your current position in the request for I/O submission independently from how much of the request has actually been completed.

Barrier requests

Requests will come off the request queue sorted into an order that should give good performance. Block drivers (and the devices they drive) are free to reorder those requests within reason, however. Drives which support features like tagged command queueing and write caching will often complete operations in an order different from that in which they received the requests. Most of the time, this reordering leads to improved performance and is a good thing.

At times, however, it is necessary to inhibit this reordering. The classic example is that of journaling filesystems, which must be able to force journal entries to the disk before the operations they describe. Reordering of requests can undermine the filesystem integrity that a journaling filesystem is trying to provide.

To meet the needs of higher-level layers, the concept of a "barrier request" has been added to the 2.6 kernel. Barrier requests are marked by the REQ_HARDBARRIER flag in the request structure flags field. When your driver encounters a barrier request, it must complete that request (and all that preceded it) before beginning any requests after the barrier request. "Complete," in this case, means that the data has been physically written to the disk platter - not just transferred to the drive.

Tweaking request queue parameters

The block subsystem contains a long list of functions which control how I/O requests are created for your driver. Here's a few of them.

Bounce buffer control: in 2.4, the block code assumed that devices could not perform DMA to or from high memory addresses. When I/O buffers were located in high memory, data would be copied to or from low-memory "bounce" buffers; the driver would then operate on the low-memory buffer. Most modern devices can handle (at a minimum) full 32-bit DMA addresses, or even 64-bit addresses. For now, 2.6 will still use bounce buffers for high-memory addresses. A driver can change that behavior with:

    void blk_queue_bounce_limit(request_queue_t *q, u64 dma_addr);

After this call, any buffer whose physical address is at or above dma_addr will be copied through a bounce buffer. The driver can provide any reasonable address, or one of BLK_BOUNCE_HIGH (bounce high memory pages, the default), BLK_BOUNCE_ANY (do not use bounce buffers at all), or BLK_BOUNCE_ISA (bounce anything above the ISA DMA threshold).

Request clustering control. The block subsystem works hard to coalesce adjacent requests for better performance. Most devices have limits, however, on how large those requests can be. A few functions have been provided to instruct the block subsystem on how not to create requests which must be split apart again.

    void blk_queue_max_sectors(request_queue_t *q, unsigned short max_sectors);

Sets the maximum number of sectors which may be transferred in a single request; default is 255. It is not possible to set the maximum below the number of sectors contained in one page.

    void blk_queue_max_phys_segments(request_queue_t *q,
                                     unsigned short max_segments);
    void blk_queue_max_hw_segments(request_queue_t *q,
                                   unsigned short max_segments);

The maximum number of discontiguous physical segments in a single request; this is the maximum size of a scatter/gather list that could be presented to the device. The first functions controls the number of distinct memory segments in the request; the second does the same, but it takes into account the remapping which can be performed by the system's I/O memory management unit (if any). The default for both is 128 segments.

    void blk_queue_max_segment_size(request_queue_t *q,
                                    unsigned int max_size);

The maximum size that any individual segment within a request can be. The default is 65536 bytes.

    void blk_queue_segment_boundary(request_queue_t *q,
                                    unsigned long mask);

Some devices cannot perform transfers which cross memory boundaries of a certain size. If your device is one of these, you should call blk_queue_segment_boundary() with a mask indicating where the boundary is. If, for example, your hardware has a hard time crossing 4MB boundaries, mask should be set to 0x3fffff. The default is 0xffffffff.

Finally, some devices have more esoteric restrictions on which requests may or may not be clustered together. For situations where the above parameters are insufficient, a block driver can specify a function which can examine (and pass judgement on) each proposed merge.

    typedef int (merge_bvec_fn) (request_queue_t *q, struct bio *bio,
                                 struct bio_vec *bvec);
    void blk_queue_merge_bvec(request_queue_t *q, merge_bvec_fn *fn);

Once the given fn is associated with this queue, it will be called every time a bio_vec entry bvec is being considered for addition to the given bio. It should return the number of bytes from bvec which can be added; zero should be returned if the new segment cannot be added at all. By default, there is no merge_bvec_fn.

Setting the hardware sector size. The old hardsect_size global array is gone and nobody misses it. Block drivers now inform the system of the underlying hardware's sector size with:

    void blk_queue_hardsect_size(request_queue_t *q, unsigned short size);

The default is the usual 512-byte sector. There is one other important change with regard to sector sizes: your driver will always see requests expressed in terms of 512-byte sectors, regardless of the hardware sector size. The block subsystem will not generate requests which go against the hardware sector size, but sector numbers and counts in requests are always in 512-byte units. This change was required as part of the new centralized partition remapping.

DMA support

Most block I/O requests will come down to one more more DMA operations. The 2.6 block layer provides a couple of functions designed to make the task of setting up DMA operations easier.

    void blk_queue_dma_alignment(request_queue_t *q, int mask);

This function sets a mask indicating what sort of memory alignment the hardware needs for DMA requests; the default is 511.

DMA operations to modern devices usually require the creation of a scatter/gather list of segments to be transferred. A block driver can create this "scatterlist" using the generic DMA support routines and the information found in the request. The block subsystem has made life a little easier, though. A simple call to:

    int blk_rq_map_sg(request_queue_t *q, struct request *rq,
                      struct scatterlist *sg);

will construct a scatterlist for the given request; the return value is the number of entries in the resulting list. This scatterlist can then be passed to pci_map_sg() or dma_map_sg() in preparation for the DMA operation.

Going on

The second part of the request queue article series looks at command preparation, tagged command queueing, and writing drivers which do without a request queue altogether.

No comments have been posted. Post one now