The simple block driver
example earlier in this series showed how to write the simplest possible
request function. Most block drivers, however, will need greater control
over how requests are built and processed. This article will get into the
details of how request queues work, with an emphasis on what every driver
writer needs to know to process requests. A second article looks at some of the more
advanced features of request queues in 2.6.
Request queues
Request queues are represented by a pointer to struct request_queue
or to the typedef request_queue_t,
defined in <linux/blkdev.h>. One request queue can be
shared across multiple physical drives, but the normal usage is to create a
separate queue for each drive. Request queues must be
allocated and initialized by the block subsystem; this allocation (and
initialization) is done by:
request_queue_t *blk_init_queue(request_fn_proc *request_fn,
spinlock_t *lock);
Here request_fn is the
driver's function which will process requests, and lock is a
spinlock which controls access to the queue. The return value is a pointer
to the newly-allocated request queue if the initialization succeeded, or
NULL otherwise.
Since setting up a request queue requires memory
allocation, failure is possible. A couple of other changes from 2.4 should be
noted: a spinlock must be provided to control access to the queue
(io_request_lock is no more), and there is no per-major "default"
queue provided in 2.6.
When a driver is done with a request queue, it should pass it back to the
system with:
void blk_cleanup_queue(request_queue_t *q);
Note that neither of these functions is normally called if a "make request"
function is being used (make request functions are covered in part II).
Basic request processing
The request function prototype has not changed from 2.4; it gets the
request queue as its only parameter. The queue lock will be held when the
request function is called.
All request handlers, from the simplest to the most complicated, will find
the next request to process with:
struct request *elv_next_request(request_queue_t *q);
The return value is the next request that should be processed, or
NULL if the queue is empty. If you look through the kernel
source, you will find references to blk_queue_empty() (or
elv_queue_empty()), which tests the state of the queue. Use of
this function in drivers is best avoided, however. In the future, it could
be that a non-empty queue still has no requests that are ready to be
processed.
In 2.4 and prior kernels, a block request contained one or more buffer
heads with sectors to be transferred. In 2.6, a request contains a list of
BIO structures instead. This list can be
accessed via the bio member of the request structure, but
the recommended way of iterating through a request's BIOs is instead:
struct bio *bio;
rq_for_each_bio(bio, req) {
/* Process this BIO */
}
Drivers which use this macro are less likely to break in the future. Do
note, however, that many drivers will never need to iterate through the
list of BIOs in this way; for DMA transfers, use bio_rq_map_sg()
(described below) instead.
As your driver performs the transfers described by the BIO structures, it
will need to update the kernel on its progress. Note that drivers should
not call bio_endio() as transfers complete; the block layer
will take care of that. Instead, the driver should call
end_that_request_first(), which has a different prototype in 2.6:
int end_that_request_first(struct request *req, int uptodate,
int nsectors);
Here, req is the request being handled, uptodate is
nonzero unless an error has occurred, and nsectors is the number
of sectors which were transferred. This function will clean up as many BIO
structures as are covered by the given number of sectors, and return
nonzero if any BIOs remain to be transferred.
When the request is complete (end_that_request_first() returns
zero), the driver should clean up the request. The cleanup task involves
removing the request from the queue, then passing it to
end_that_request_last(), which is unchanged from 2.4. Note that
the queue lock must be held when calling both of these functions:
void blkdev_dequeue_request(struct request *req);
void end_that_request_last(struct request *req);
Note that the driver can dequeue the request at any time (as long as it
keeps track of it, of course). Drivers which keep multiple requests in
flight will need to dequeue each request as it is passed to the drive.
If your device does not have predictable timing behavior, your driver
should contribute its timing information to the system's entropy pool.
That is done with:
void add_disk_randomness(struct gendisk *disk);
BIO walking
The "BIO walking" patch was added in 2.5.70. This patch adds some request
queue fields and a new function to help complicated drivers keep track of
where they are in a given request. Drivers using BIO walking will not use
rq_for_each_bio(); instead, they rely upon the fact that the
cbio field of the request structure refers to the current,
unprocessed BIO, nr_cbio_segments tells how many segments remain
to be processed in that BIO, and nr_cbio_sectors tells how many
sectors are yet to be transferred. The macro:
int blk_rq_idx(struct request *req)
returns the index of the next segment to process. If you need to access the
current segment buffer directly (for programmed I/O, say), you may use:
char *rq_map_buffer(struct request *req, unsigned long *flags);
void rq_unmap_buffer(char *buffer, unsigned long flags);
These functions potentially deal with atomic kmaps, so the usual
constraints apply: no sleeping while the mapping is in effect, and buffers
must be mapped and unmapped in the same function.
When beginning I/O on a set of blocks from the request, your driver can
update the current pointers with:
int process_that_request_first(struct request *req,
unsigned int nr_sectors);
This function will update the various cbio values in the request,
but does not signal completion (you still need
end_that_request_first() for that).
Use of process_that_request_first() is optional; your driver may
call it if you would like the block subsystem to track your current
position in the request for I/O submission independently from how much of
the request has actually been completed.
Barrier requests
Requests will come off the request queue sorted into an order that should
give good performance. Block drivers (and the devices they drive) are free
to reorder those requests within reason, however. Drives which support
features like tagged command queueing and write caching will often complete
operations in an order different from that in which they received the
requests. Most of the time, this reordering leads to improved performance
and is a good thing.
At times, however, it is necessary to inhibit this reordering. The classic
example is that of journaling filesystems, which must be able to force
journal entries to the disk before the operations they describe.
Reordering of requests can undermine the filesystem integrity that a
journaling filesystem is trying to provide.
To meet the needs of higher-level layers, the concept of a "barrier
request" has been added to the 2.6 kernel. Barrier requests are marked by
the REQ_HARDBARRIER flag in the request structure flags
field. When your driver encounters a barrier request, it must complete
that request (and all that preceded it) before beginning any requests after
the barrier request. "Complete," in this case, means that the data has
been physically written to the disk platter - not just transferred to the
drive.
Tweaking request queue parameters
The block subsystem contains a long list of functions which control how I/O
requests are created for your driver. Here's a few of them.
Bounce buffer control: in 2.4, the block code assumed that devices
could not perform DMA to or from high memory addresses. When I/O buffers
were located in high memory, data would be copied to or from low-memory
"bounce" buffers; the driver would then operate on the low-memory buffer.
Most modern devices can handle (at a minimum) full 32-bit DMA addresses, or
even 64-bit addresses. For now, 2.6 will still use bounce buffers for
high-memory addresses. A driver can change that behavior with:
void blk_queue_bounce_limit(request_queue_t *q, u64 dma_addr);
After this call, any buffer whose physical address is at or above
dma_addr will be copied through a bounce buffer. The driver can
provide any reasonable address, or one of BLK_BOUNCE_HIGH (bounce
high memory pages, the default), BLK_BOUNCE_ANY (do not use bounce
buffers at all), or BLK_BOUNCE_ISA (bounce anything above the ISA
DMA threshold).
Request clustering control. The block subsystem works hard to
coalesce adjacent requests for better performance. Most devices have
limits, however, on how large those requests can be. A few functions have
been provided to instruct the block subsystem on how not to create requests
which must be split apart again.
void blk_queue_max_sectors(request_queue_t *q, unsigned short max_sectors);
Sets the maximum number of sectors which may be transferred in a single
request; default is 255. It is not possible to set the maximum below the
number of sectors contained in one page.
void blk_queue_max_phys_segments(request_queue_t *q,
unsigned short max_segments);
void blk_queue_max_hw_segments(request_queue_t *q,
unsigned short max_segments);
The maximum number of discontiguous physical segments in a single request;
this is the maximum size of a scatter/gather list that could be presented
to the device. The first functions controls the number of distinct memory
segments in the request; the second does the same, but it takes into
account the remapping
which can be performed by the system's I/O memory management unit (if
any). The default for both is 128 segments.
void blk_queue_max_segment_size(request_queue_t *q,
unsigned int max_size);
The maximum size that any individual segment within a request can be. The
default is 65536 bytes.
void blk_queue_segment_boundary(request_queue_t *q,
unsigned long mask);
Some devices cannot perform transfers which cross memory boundaries of a
certain size. If your device is one of these, you should call
blk_queue_segment_boundary() with a mask indicating where
the boundary is. If, for example, your hardware has a hard time crossing
4MB boundaries, mask should be set to 0x3fffff. The
default is 0xffffffff.
Finally, some devices have more esoteric restrictions on which requests may
or may not be clustered together. For situations where the above
parameters are insufficient, a block driver can specify a function which
can examine (and pass judgement on) each proposed merge.
typedef int (merge_bvec_fn) (request_queue_t *q, struct bio *bio,
struct bio_vec *bvec);
void blk_queue_merge_bvec(request_queue_t *q, merge_bvec_fn *fn);
Once the given fn is associated with this queue, it will be called
every time a bio_vec entry bvec is being considered for
addition to the given bio. It should return the number of bytes
from bvec which can be added; zero should be returned if the new
segment cannot be added at all. By default, there is no
merge_bvec_fn.
Setting the hardware sector size. The old hardsect_size
global array is gone and nobody misses it. Block drivers now inform the
system of the underlying hardware's sector size with:
void blk_queue_hardsect_size(request_queue_t *q, unsigned short size);
The default is the usual 512-byte sector. There is one other important
change with regard to sector sizes: your driver will always see requests
expressed in terms of 512-byte sectors, regardless of the hardware sector
size. The block subsystem will not generate requests which go against the
hardware sector size, but sector numbers and counts in requests are always
in 512-byte units. This change was required as part of the new centralized
partition remapping.
DMA support
Most block I/O requests will come down to one more more DMA operations.
The 2.6 block layer provides a couple of functions designed to make the
task of setting up DMA operations easier.
void blk_queue_dma_alignment(request_queue_t *q, int mask);
This function sets a mask indicating what sort of memory alignment the
hardware needs for DMA requests; the default is 511.
DMA operations to modern devices usually require the creation of a
scatter/gather list of segments to be transferred. A block driver can
create this "scatterlist" using the generic DMA support routines and the
information found in the request. The block subsystem has made life a
little easier, though. A simple call to:
int blk_rq_map_sg(request_queue_t *q, struct request *rq,
struct scatterlist *sg);
will construct a scatterlist for the given request; the return value is the
number of entries in the resulting list. This scatterlist can then be
passed to pci_map_sg() or dma_map_sg() in preparation for
the DMA operation.
Going on
The second part of the request queue article
series looks at command preparation, tagged command queueing, and writing
drivers which do without a request queue altogether. No comments have been posted.
Post one now
|