Summary page
Return to the Kernel page

Recent Features

Home	Weekly edition	Kernel	Security	Distributions
Archives	Search	Penguin Gallery	Calendar	LWN.net FAQ
Subscriptions	Advertise	Write for LWN	Contact us	Privacy

Driver porting: low-level memory allocation

This article is part of the LWN Porting Drivers to 2.6 series.

The 2.5 development series has brought relatively few changes to the way device drivers will allocate and manage memory. In fact, most drivers should work with no changes in this regard. There are a few improvements that have been made, however, that are worth a mention. These include some changes to page allocation, and the new "mempool" interface. Note that the allocation and management of per-CPU data is described in a separate article.

Allocation flags

The old <linux/malloc.h> include file is gone; it is now necessary to include <linux/slab.h> instead.

The GFP_BUFFER allocation flag is gone (it was actually removed in 2.4.6). That will bother few people, since almost nobody used it. There are two new flags which have replaced it: GFP_NOIO and GFP_NOFS. The GFP_NOIO flag allows sleeping, but no I/O operations will be started to help satisfy the request. GFP_NOFS is a bit less restrictive; some I/O operations can be started (writing to a swap area, for example), but no filesystem operations will be performed.

For reference, here is the full set of allocation flags, from the most restrictive to the least::

GFP_ATOMIC: a high-priority allocation which will not sleep; this is the flag to use in interrupt handlers and other non-blocking situations.
GFP_NOIO: blocking is possible, but no I/O will be performed.
GFP_NOFS: no filesystem operations will be performed.
GFP_KERNEL: a regular, blocking allocation.
GFP_USER: a blocking allocation for user-space pages.
GFP_HIGHUSER: for allocating user-space pages where high memory may be used.

The __GFP_DMA and __GFP_HIGHMEM flags still exist and may be added to the above to direct an allocation to a particular memory zone. In addition, 2.5.69 added some new modifiers:

__GFP_REPEAT This flag tells the page allocater to "try harder," repeating failed allocation attempts if need be. Allocations can still fail, but failure should be less likely.
__GFP_NOFAIL Try even harder; allocations with this flag must not fail. Needless to say, such an allocation could take a long time to satisfy.
__GFP_NORETRY Failed allocations should not be retried; instead, a failure status will be returned to the caller immediately.

The __GFP_NOFAIL flag is sure to be tempting to programmers who would rather not code failure paths, but that temptation should be resisted most of the time. Only allocations which truly cannot be allowed to fail should use this flag.

Page-level allocation

For page-level allocations, the alloc_pages() and get_free_page() functions (and variants) exist as always. They are now defined in <linux/gfp.h>, however, and there are a few new ones as well. On NUMA systems, the allocator will do its best to allocate pages on the same node as the caller. To explicitly allocate pages on a different NUMA node, use:

    struct page *alloc_pages_node(int node_id, 
                                  unsigned int gfp_mask, 
				  unsigned int order);

The memory allocator now distinguishes between "hot" and "cold" pages. A hot page is one that is likely to be represented in the processor's cache; cold pages, instead, must be fetched from RAM. In general, it is preferable to use hot pages whenever possible, since they are already cached. Even if the page is to be overwritten immediately (usually the case with memory allocations, after all), hot pages are better - overwriting them will not push some other, perhaps useful, data from the cache. So alloc_pages() and friends will return hot pages when they are available.

On occasion, however, a cold page is preferable. In particular, pages which will be overwritten via a DMA read from a device might as well be cold, since their cache data will be invalidated anyway. In this sort of situation, the __GFP_COLD flag should be passed into the allocation.

Of course, this whole scheme depends on the memory allocator knowing which pages are likely to be hot. Normally, order-zero allocations (i.e. single pages) are assumed to be hot. If you know the state of a page you are freeing, you can tell the allocator explicitly with one of the following:

    void free_hot_page(struct page *page);
    void free_cold_page(struct page *page);

These functions only work with order-zero allocations; the hot/cold status of larger blocks is not tracked.

Memory pools

Memory pools were one of the very first changes in the 2.5 series - they were added to 2.5.1 to support the new block I/O layer. The purpose of mempools is to help out in situations where a memory allocation must succeed, but sleeping is not an option. To that end, mempools pre-allocate a pool of memory and reserve it until it is needed. Mempools make life easier in some situations, but they should be used with restraint; each mempool takes a chunk of kernel memory out of circulation and raises the minimum amount of memory the kernel needs to run effectively.

To work with mempools, your code should include <linux/mempool.h>. A mempool is created with mempool_create():

    mempool_t *mempool_create(int min_nr, 
                              mempool_alloc_t *alloc_fn,
    			      mempool_free_t *free_fn,
			      void *pool_data);

Here, min_nr is the minimum number of pre-allocated objects that the mempool tries to keep around. The mempool defers the actual allocation and deallocation of objects to user-supplied routines, which have the following prototypes:

    typedef void *(mempool_alloc_t)(int gfp_mask, void *pool_data);
    typedef void (mempool_free_t)(void *element, void *pool_data);

The allocation function should take care not to sleep unless __GFP_WAIT is set in the given gfp_mask. In all of the above cases, pool_data is a private pointer that may be used by the allocation and deallocation functions.

Creators of mempools will often want to use the slab allocator to do the actual object allocation and deallocation. To do that, create the slab, pass it in to mempool_create() as the pool_data value, and give mempool_alloc_slab and mempool_free_slab as the allocation and deallocation functions.

A mempool may be returned to the system by passing it to mempool_destroy(). You must have returned all items to the pool before destroying it, or the mempool code will get upset and oops the system.

Allocating and freeing objects from the mempool is done with:

    void *mempool_alloc(mempool_t *pool, int gfp_mask);
    void mempool_free(void *element, mempool_t *pool);

mempool_alloc() will first call the pool's allocation function to satisfy the request; the pre-allocated pool will only be used if the allocation function fails. The allocation may sleep if the given gfp_mask allows it; it can also fail if memory is tight and the preallocated pool has been exhausted.

Finally, a pool can be resized, if necessary, with:

    int mempool_resize(mempool_t *pool, int new_min_nr, int gfp_mask);

This function will change the size of the pre-allocated pool, using the given gfp_mask to allocate more memory if need be. Note that, as of 2.5.60, mempool_resize() is disabled in the source, since nobody is actually using it.

No comments have been posted. Post one now