linux内核分析之内存管理

1、struct page

  /* Each physical page in the system has a struct page associated with

  * it to keep track of whatever it is we are using the page for at the

  * moment. Note that we have no way to track which tasks are using

  * a page, though if it is a pagecache page, rmap structures can tell us

  * who is mapping it.

  */

 struct page {

     unsigned long flags;        /* Atomic flags, some possibly

                      * updated asynchronously */

     atomic_t _count;        /* Usage count, see below. */

     union {

         atomic_t _mapcount;    /* Count of ptes mapped in mms,

                      * to show when page is mapped

                      * & limit reverse map searches.

                      */

         struct {    /* SLUB uses */

             short unsigned int inuse;

             short unsigned int offset;

         };

     };

     union {

         struct {

         unsigned long private;        /* Mapping-private opaque data:

                           * usually used for buffer_heads

                          * if PagePrivate set; used for

                          * swp_entry_t if PageSwapCache;

                          * indicates order in the buddy

                          * system if PG_buddy is set.

                          */

         struct address_space *mapping;    /* If low bit clear, points to

                          * inode address_space, or NULL.

                          * If page mapped as anonymous

                          * memory, low bit is set, and

                          * it points to anon_vma object:

                          * see PAGE_MAPPING_ANON below.

                          */

         };

 #if NR_CPUS >= CONFIG_SPLIT_PTLOCK_CPUS

         spinlock_t ptl;

 #endif

         struct {            /* SLUB uses */

             void **lockless_freelist;

         struct kmem_cache *slab;    /* Pointer to slab */

         };

         struct {

         struct page *first_page;    /* Compound pages */

         };

     };

     union {

         pgoff_t index;        /* Our offset within mapping. */

         void *freelist;        /* SLUB: freelist req. slab lock */

     };

     struct list_head lru;        /* Pageout list, eg. active_list

                      * protected by zone->lru_lock !

                      */

     /*

      * On machines where all RAM is mapped into kernel address space,

      * we can simply calculate the virtual address. On machines with

      * highmem some memory is mapped into kernel virtual memory

      * dynamically, so we need a place to store that address.

      * Note that this field could be 16 bits on x86 ... ;)

      *

      * Architectures with slow multiplication can define

      * WANT_PAGE_VIRTUAL in asm/page.h

      */

 #if defined(WANT_PAGE_VIRTUAL)

     void *virtual;            /* Kernel virtual address (NULL if

                        not kmapped, ie. highmem) */

 #endif /* WANT_PAGE_VIRTUAL */

 };

　　flags:flag域用来存放页的状态。这些状态包括页是不是脏的，是不是被锁定在内存中。flag的每一位单独表示一种状态，可以表示32种状态。

　　_count:_count域用来存放页的引用计数，也就是页引用了多少次。当计数变为-1时表示当前内核并没有引用这一项，于是在在新的分配中就可以使用它。

　　virtual：virtual域是页的虚拟地址。virtual就是页在虚拟内存中的地址。有些内存(即所谓的高端内存)并不永久的映射到内核地址空间上，在这种情况下，这个域的值为NULL，需要的时候，必须动态的映射这些页。

　　page结构直接与物理页面相关，而并非与虚拟页相关。因此她描述的页是短暂的，即使页中所包含的数据继续存在，由于交换的原因，他们可能并不再和同一个page结构相关联。内核仅仅用这个数据结构来描述当前时刻相关的物理页中存放的东西。这种数据结构的目的在于描述物理内存本身，而不是描述结构体里面的数据。内核用这个结构来管理系统中的所有页，因为内核需要知道一个页是不是空闲(也就是页有没有被分配)。如果页已经被分配，内核还需要知道谁拥有这个页。拥有者可能是用户空间进程、动态分配的内核数据、静态内核代码或页高速缓存等。

2、内存分区

　　因为有些页位于内存中特定的物理地址上，所以不能将其用于一些特定的任务，由于存在这种限制，所以内核把页划分为不同的区。内核使用区对具有相似特性的页进行分组。

 struct zone {

     /* Fields commonly accessed by the page allocator */

     unsigned long        pages_min, pages_low, pages_high;

     /*

      * We don't know if the memory that we're going to allocate will be freeable

      * or/and it will be released eventually, so to avoid totally wasting several

      * GB of ram we must reserve some of the lower zone memory (otherwise we risk

      * to run OOM on the lower zones despite there's tons of freeable ram

      * on the higher zones). This array is recalculated at runtime if the

      * sysctl_lowmem_reserve_ratio sysctl changes.

      */

     unsigned long        lowmem_reserve[MAX_NR_ZONES];

 #ifdef CONFIG_NUMA

     int node;

     /*

      * zone reclaim becomes active if more unmapped pages exist.

      */

     unsigned long        min_unmapped_pages;

     unsigned long        min_slab_pages;

     struct per_cpu_pageset    *pageset[NR_CPUS];

 #else

     struct per_cpu_pageset    pageset[NR_CPUS];

 #endif

     /*

      * free areas of different sizes

      */

     spinlock_t        lock;

 #ifdef CONFIG_MEMORY_HOTPLUG

     /* see spanned/present_pages for more description */

     seqlock_t        span_seqlock;

 #endif

     struct free_area    free_area[MAX_ORDER];

     ZONE_PADDING(_pad1_)

     /* Fields commonly accessed by the page reclaim scanner */

     spinlock_t        lru_lock;

     struct list_head    active_list;

     struct list_head    inactive_list;

     unsigned long        nr_scan_active;

     unsigned long        nr_scan_inactive;

     unsigned long        pages_scanned;       /* since last reclaim */

     int            all_unreclaimable; /* All pages pinned */

     /* A count of how many reclaimers are scanning this zone */

     atomic_t        reclaim_in_progress;

     /* Zone statistics */

     atomic_long_t        vm_stat[NR_VM_ZONE_STAT_ITEMS];

     /*

      * prev_priority holds the scanning priority for this zone.  It is

      * defined as the scanning priority at which we achieved our reclaim

      * target at the previous try_to_free_pages() or balance_pgdat()

      * invokation.

      *

      * We use prev_priority as a measure of how much stress page reclaim is

      * under - it drives the swappiness decision: whether to unmap mapped

      * pages.

      *

      * Access to both this field is quite racy even on uniprocessor.  But

      * it is expected to average out OK.

      */

     int prev_priority;

     ZONE_PADDING(_pad2_)

     /* Rarely used or read-mostly fields */

     /*

      * wait_table        -- the array holding the hash table

      * wait_table_hash_nr_entries    -- the size of the hash table array

      * wait_table_bits    -- wait_table_size == (1 << wait_table_bits)

      *

      * The purpose of all these is to keep track of the people

      * waiting for a page to become available and make them

      * runnable again when possible. The trouble is that this

      * consumes a lot of space, especially when so few things

      * wait on pages at a given time. So instead of using

      * per-page waitqueues, we use a waitqueue hash table.

      *

      * The bucket discipline is to sleep on the same queue when

      * colliding and wake all in that wait queue when removing.

      * When something wakes, it must check to be sure its page is

      * truly available, a la thundering herd. The cost of a

      * collision is great, but given the expected load of the

      * table, they should be so rare as to be outweighed by the

      * benefits from the saved space.

      *

      * __wait_on_page_locked() and unlock_page() in mm/filemap.c, are the

      * primary users of these fields, and in mm/page_alloc.c

      * free_area_init_core() performs the initialization of them.

      */

     wait_queue_head_t    * wait_table;

     unsigned long        wait_table_hash_nr_entries;

     unsigned long        wait_table_bits;

     /*

      * Discontig memory support fields.

      */

     struct pglist_data    *zone_pgdat;

     /* zone_start_pfn == zone_start_paddr >> PAGE_SHIFT */

     unsigned long        zone_start_pfn;

     /*

      * zone_start_pfn, spanned_pages and present_pages are all

      * protected by span_seqlock.  It is a seqlock because it has

      * to be read outside of zone->lock, and it is done in the main

      * allocator path.  But, it is written quite infrequently.

      *

      * The lock is declared along with zone->lock because it is

      * frequently read in proximity to zone->lock.  It's good to

      * give them a chance of being in the same cacheline.

      */

     unsigned long        spanned_pages;    /* total size, including holes */

     unsigned long        present_pages;    /* amount of memory (excluding holes) */

     /*

      * rarely used fields:

      */

     const char        *name;

 } ____cacheline_internodealigned_in_smp;

　　name：name域是一个以NULL结束的字符串表示这个区的名字。内核启动期间初始化这个值，其代码位于mm/page_alloc.c中。三个名字分别为：DMA、Normal、HighMem

　　获得页

　　struct page *alloc_pages(gfp_t gfp_mask,unsigned int order)

　　void *page_address(struct page *page)

　　unsigned long _get_free_pages(gft_t gfp_mask,unsigned int order)

　　struct page *alloc_page(gfp_t gfp_mask)

　　unsigned long _get_free_page(gfp_t gfp_mask)

　　get _zeroed_page(gfp_mask)

　　__get_dma_pages(gfp_mask,order)

　　获得以字节为单位的内存单元

　　void *kmalloc(size_t size,gfp_t flags)

　　void kfree(const void *ptr)

　　void *vmalloc(unsigned long size)

　　void vfree(const void *addr)

　　kmalloc()函数分配的内存物理地址和虚拟地址都是连续的，vmalloc()函数分配的内存虚拟地址是连续的，但是物理地址不连续，这个也是用户空间分配函数的工作方式，malloc()函数返回的地址是虚拟空间的地址，更详细的说是进程堆空间的内存地址，我们知道进程所有的内存都是虚拟内存，但是不能保证这些虚拟地址在物理RAM中也是连续的，同时这种分配方式相比于kmalloc效率要低一点，因为物理地址不连续的内存转换成连续的虚拟地址需要专门建立页表，vmalloc()函数一般在为了获得大块内存的时候使用。

　　伙伴算法

　　一．算法概览
可以在维基百科上找到该算法的描述，大体如是：
分配内存：
1.寻找大小合适的内存块（大于等于所需大小并且最接近2的幂，比如需要27，实际分配32）
1.如果找到了，分配给应用程序。
2.如果没找到，分出合适的内存块。
1.对半分离出高于所需大小的空闲内存块
2.如果分到最低限度，分配这个大小。
3.回溯到步骤1（寻找合适大小的块）
4.重复该步骤直到一个合适的块
释放内存：
1.释放该内存块
1.寻找相邻的块，看其是否释放了。
2.如果相邻块也释放了，合并这两个块，重复上述步骤直到遇上未释放的相邻块，或者达到最高上限（即所有内存都释放了）。

　　看起来蛮晦涩，有人给出了一个便于理解的示意图，如下：

　　　上图中，首先我们假设我们一个内存块有1024K，当我们需要给A分配70K内存的时候，
1. 我们发现1024K的一半大于70K，然后我们就把1024K的内存分成两半，一半512K。
2. 然后我们发现512K的一半仍然大于70K，于是我们再把512K的内存再分成两半，一半是256K。

3.我们发现256k任然大于70k，于是我们将256k再分成两半，一半是128k。
3. 此时，我们发现128K的一半小于70K，于是我们就分配为A分配128K的内存。
后面的，B，C，D都这样，而释放内存时，则会把相邻的块一步一步地合并起来（合并也必需按分裂的逆操作进行合并）。
我们可以看见，这样的算法，用二叉树这个数据结构来实现再合适不过了。

　　算法实现

　　伙伴算法的分配器的实现思路是，通过一个数组形式的完全二叉树来监控管理内存，二叉树的节点用于标记相应内存块的使用状态，高层节点对应大的块，低层节点对应小的块，在分配和释放中我们就通过这些节点的标记属性来进行块的分离合并。如图所示，假设总大小为16单位的内存，我们就建立一个深度为5的满二叉树，根节点从数组下标[0]开始，监控大小16的块；它的左右孩子节点下标[1~2]，监控大小8的块；第三层节点下标[3~6]监控大小4的块……依此类推。

　　在分配阶段，首先要搜索大小适配的块，假设第一次分配3，转换成2的幂是4，我们先要对整个内存进行对半切割，从16切割到4需要两步，那么从下标[0]节点开始深度搜索到下标[3]的节点并将其标记为已分配。第二次再分配3那么就标记下标[4]的节点。第三次分配6，即大小为8，那么搜索下标[2]的节点，因为下标[1]所对应的块被下标[3~4]占用了。
　　在释放阶段，我们依次释放上述第一次和第二次分配的块，即先释放[3]再释放[4]，当释放下标[4]节点后，我们发现之前释放的[3]是相邻的，于是我们立马将这两个节点进行合并，这样一来下次分配大小8的时候，我们就可以搜索到下标[1]适配了。若进一步释放下标[2]，同[1]合并后整个内存就回归到初始状态。

　　slab分配器----什么是slab分配器？-----为什么要用slab分配器？

　　分配和释放数据结构是所有内核中最普遍的操作之一。为了便于数据的频繁分配和回收，编程人员常常会用空链表。空链表包含可供使用的、已经分配好的数据结构块。当代码需要一个新的数据结构实例时，就可以从空链表中抓取一个，而不需要再进行内存分配，这样可以提高效率，当不需要时可以将内存放回空闲链表而不是释放掉。所以空闲链表相当于对象高速缓存---快速存储频繁使用的对象类型。我们常用的进程描述符struct tast_struct就可以用slab进行内存申请。从这个意义上说，空闲链表相当于对象高速缓存-----快速存储频繁使用该的对象类型。

　　linux中设计了slab层(即slab分配器)来实现高速数据结构缓存，slab分配器扮演了通用数据结构缓存层的角色。slab层把不同的对象划分为高速缓存组，每个高速缓存组都存放不同类型的对象，每种对象对应一个高速缓存组。后面会讲申请高速缓存，然后在申请的高速缓存中获取对象。例如，一个高速缓存用于存放进程描述符(task_struct结构的一个空闲链表)，而另一个高速缓存存放索引节点对象(struct inode)。kmalloc()接口建立在slab层之上，使用了一组通用高速缓存。后面讲kmalloc()函数时会讲函数中调用了kmem_cache_alloc(malloc_sizes[i].cs_dmacachep,flags)。这些高速缓存又被划分为slab(这个子系统名字的来由)。slab由一个或者多个物理上连续的页组成。一般情况下，slab也就仅仅由一页组成。每个高速缓存可以由多个slab组成。

　　每个slab都包含一些对象成员，这里的对象指的是被缓存的数据结构。每个slab处于三种状态之一：满、部分满、空。一个满的slab没有空闲的对象(slab中所有的对象已经被分配)，一个空的slab没有分配出任何对象(slab中的所有对象都是空的)。一个部分满的slab有一些对象已经分配出去，还有些对象空闲着。当内核的某一部分需要一个新的对象时，先从部分满的slab中进行分配。如果没有部分满的slab，就从空的slab中进行分配。如果没有空的slab，就要创建一个slab了。

　　slab分配器创建slab

 1 /*

 2  * Interface to system's page allocator. No need to hold the cache-lock.

 3  *

 4  * If we requested dmaable memory, we will get it. Even if we

 5  * did not request dmaable memory, we might get it, but that

 6  * would be relatively rare and ignorable.

 7  */

 8 static void *kmem_getpages(struct kmem_cache *cachep, gfp_t flags, int nodeid)

 9 {

10     struct page *page;

11     int nr_pages;

12     int i;

13

14 #ifndef CONFIG_MMU

15     /*

16      * Nommu uses slab's for process anonymous memory allocations, and thus

17      * requires __GFP_COMP to properly refcount higher order allocations

18      */

19     flags |= __GFP_COMP;

20 #endif

21

22     flags |= cachep->gfpflags;

23

24     page = alloc_pages_node(nodeid, flags, cachep->gfporder);

25     if (!page)

26         return NULL;

27

28     nr_pages = (1 << cachep->gfporder);

29     if (cachep->flags & SLAB_RECLAIM_ACCOUNT)

30         add_zone_page_state(page_zone(page),

31             NR_SLAB_RECLAIMABLE, nr_pages);

32     else

33         add_zone_page_state(page_zone(page),

34             NR_SLAB_UNRECLAIMABLE, nr_pages);

35     for (i = 0; i < nr_pages; i++)

36         __SetPageSlab(page + i);

37     return page_address(page);

38 }

39

40 /*

41  * Interface to system's page release.

42  */

43 static void kmem_freepages(struct kmem_cache *cachep, void *addr)

44 {

45     unsigned long i = (1 << cachep->gfporder);

46     struct page *page = virt_to_page(addr);

47     const unsigned long nr_freed = i;

48

49     if (cachep->flags & SLAB_RECLAIM_ACCOUNT)

50         sub_zone_page_state(page_zone(page),

51                 NR_SLAB_RECLAIMABLE, nr_freed);

52     else

53         sub_zone_page_state(page_zone(page),

54                 NR_SLAB_UNRECLAIMABLE, nr_freed);

55     while (i--) {

56         BUG_ON(!PageSlab(page));

57         __ClearPageSlab(page);

58         page++;

59     }

60     if (current->reclaim_state)

61         current->reclaim_state->reclaimed_slab += nr_freed;

62     free_pages((unsigned long)addr, cachep->gfporder);

63 }

　　slab分配是基于buddy算法的,也就是说slab的页框申请是使用伙伴系统算法的，因为slab层的slab分配器对内存的最底层的分配还是前面讲的分配函数如：_get_free_pages(),这是后面所讲的slab分配器的基础。可以很通俗的将slab分配器主要是提高了内存的使用效率，提高了CPU对内存的使用效率，对于一些需要不断申请和释放的内存，在slab分配器这儿做了一个缓存，直接可以从这儿获得已经申请好的内存，不用在需要内存时再去申请。这对于如文件系统中文件描述符和进程描述符这两种需要不断申请和释放的数据结构，用高速缓存这种方式简直就是一个福音啊。多说一句其实内核在缓存这方面做了很多的工作我们在对块设备进行访问的时候，所做的页高速缓存道理和slab差不多，只不过后者时减少CPU对低速存储设备反复读写效率的损失，用了将块设备在内存中缓存，一定时间进行刷新。

　　如果对存储区的请求不频繁，就用一组普通高速缓存来处理，前面讲的kmalloc()函数接口就建立在slab层之上，使用了一组通用高速缓存。

 static inline void *kmalloc(size_t size, gfp_t flags)

 {

     if (__builtin_constant_p(size)) {

         int i = ;

 #define CACHE(x) \

         if (size <= x) \

             goto found; \

         else \

             i++;

 #include "kmalloc_sizes.h"

 #undef CACHE

         {

             extern void __you_cannot_kmalloc_that_much(void);

             __you_cannot_kmalloc_that_much();

         }

 found:

 #ifdef CONFIG_ZONE_DMA

         if (flags & GFP_DMA)

             return kmem_cache_alloc(malloc_sizes[i].cs_dmacachep,

                         flags);

 #endif

         return kmem_cache_alloc(malloc_sizes[i].cs_cachep, flags);

     }

     return __kmalloc(size, flags);

 }

 static inline void *kzalloc(size_t size, gfp_t flags)

 {

     if (__builtin_constant_p(size)) {

         int i = ;

 #define CACHE(x) \

         if (size <= x) \

             goto found; \

         else \

             i++;

 #include "kmalloc_sizes.h"

 #undef CACHE

         {

             extern void __you_cannot_kzalloc_that_much(void);

             __you_cannot_kzalloc_that_much();

         }

 found:

 #ifdef CONFIG_ZONE_DMA

         if (flags & GFP_DMA)

             return kmem_cache_zalloc(malloc_sizes[i].cs_dmacachep,

                         flags);

 #endif

         return kmem_cache_zalloc(malloc_sizes[i].cs_cachep, flags);

     }

     return __kzalloc(size, flags);

 }

　　slab层的设计

　　kmem_cache结构

　　每个高速缓存都是用kmem_cache结构来表示。这个结构中包含三个链表：slabs_full、slabs_partial、slabs_empty,均放在kmem_list3结构体内，这些链表包含高速缓存中的所有slab.slab描述符struct slab结构体用来描述给个slab。

 struct kmem_cache {

 /* 1) per-cpu data, touched during every alloc/free */

     struct array_cache *array[NR_CPUS];

 /* 2) Cache tunables. Protected by cache_chain_mutex */

     unsigned int batchcount;

     unsigned int limit;

     unsigned int shared;

     unsigned int buffer_size;

     u32 reciprocal_buffer_size;

 /* 3) touched by every alloc & free from the backend */

     unsigned int flags;        /* constant flags */

     unsigned int num;        /* # of objs per slab */

 /* 4) cache_grow/shrink */

     /* order of pgs per slab (2^n) */

     unsigned int gfporder;

     /* force GFP flags, e.g. GFP_DMA */

     gfp_t gfpflags;

     size_t colour;            /* cache colouring range */

     unsigned int colour_off;    /* colour offset */

     struct kmem_cache *slabp_cache;

     unsigned int slab_size;

     unsigned int dflags;        /* dynamic flags */

     /* constructor func */

     void (*ctor) (void *, struct kmem_cache *, unsigned long);

 /* 5) cache creation/removal */

     const char *name;

     struct list_head next;

 /* 6) statistics */

 #if STATS

     unsigned long num_active;

     unsigned long num_allocations;

     unsigned long high_mark;

     unsigned long grown;

     unsigned long reaped;

     unsigned long errors;

     unsigned long max_freeable;

     unsigned long node_allocs;

     unsigned long node_frees;

     unsigned long node_overflow;

     atomic_t allochit;

     atomic_t allocmiss;

     atomic_t freehit;

     atomic_t freemiss;

 #endif

 #if DEBUG

     /*

      * If debugging is enabled, then the allocator can add additional

      * fields and/or padding to every object. buffer_size contains the total

      * object size including these internal fields, the following two

      * variables contain the offset to the user object and its size.

      */

     int obj_offset;

     int obj_size;

 #endif

     /*

      * We put nodelists[] at the end of kmem_cache, because we want to size

      * this array to nr_node_ids slots instead of MAX_NUMNODES

      * (see kmem_cache_init())

      * We still use [MAX_NUMNODES] and not [1] or [0] because cache_cache

      * is statically defined, so we reserve the max number of nodes.

      */

     struct kmem_list3 *nodelists[MAX_NUMNODES];

     /*

      * Do not add fields after nodelists[]

      */

 };

　　kmem_list3结构

 /*

  * The slab lists for all objects.

  */ struct kmem_list3 {

     struct list_head slabs_partial;    /* partial list first, better asm code */

     struct list_head slabs_full;

     struct list_head slabs_free;

     unsigned long free_objects;

     unsigned int free_limit;

     unsigned int colour_next;    /* Per-node cache coloring */

     spinlock_t list_lock;

     struct array_cache *shared;    /* shared per node */

     struct array_cache **alien;    /* on other nodes */

     unsigned long next_reap;    /* updated without locking */

     int free_touched;        /* updated without locking */

 };

　　struct slab结构　

 /*

  * struct slab

  *

  * Manages the objs in a slab. Placed either at the beginning of mem allocated

  * for a slab, or allocated from an general cache.

  * Slabs are chained into three list: fully used, partial, fully free slabs.

  */

 struct slab {

     struct list_head list;

     unsigned long colouroff;

     void *s_mem;        /* including colour offset */

     unsigned int inuse;    /* num of objs active in slab */

     kmem_bufctl_t free;

     unsigned short nodeid;

 };

　　slab分配器的接口

　　一个新的高速缓存可以通过kmem_cache_creat()函数创建

　　struct kmem_cache *kmem_cahce_create(const char *name,size_t size,size_t align,unsigned long flags,void (*ctor)(void *))

　　kmem_cache_creat()在成功时会返回一个指向所创建高速缓存的指针；否则，返回NULL。这个函数不能用于中断上下文中调用，因为他可能会睡眠。

　　如果给定的高速缓存部分既没有满也没有空的slab时可以通过调用页分配函数：kmem_getpages()--------->_get_free_pages()得到内存。当可用内存变得紧张时，系统试图释放更多内存以供使用，或者高速缓存显示的被撤销时调用函数：kmem_freepages()释放内存。

　　要撤销一个高速缓存，则调用：

　　int kmem_cache_destroy(struct kmem_cache *cachep)

　　这个函数可以撤销给定的高速缓存，在模块的注销代码中被调用(这里指的是创建了自己的高速缓存的模块)，这个函数不能从中断上下文中调用这个函数，因为它也可能睡眠，调用该函数的条件：

　　1、高速缓存中的所有slab都必须为空

　　2、在使用该函数的过程中不能再访问这个高速缓存区。该函数成功时返回0，不成功时返回非0值。

　　从高速缓存中分配slab对象

　　创建了高速缓存后可以通过函数获取slab对象：void *kmem_cahce_alloc(struct kmem_cache *cachep,gfp_t flags),该函数从给定的高速缓存cachep中返回一个指向对象的指针。如果高速缓存的所有slab中都没有空闲的对象，那么slab层必须通过kmem_getpages()获取新的页，这个前面讲过slab层分配新的slab。

　　释放一个对象，并把它返回给原先的slab，使用函数：void kmem_cache_free(struct kmem_cache *cachep,void *objp),这样就能把cachep中的对象objp标记为空。

　　slab分配器的使用实例--------->task_struct

　　1、内核用一个全局变量存放指向tast_struct高速缓存的指针：struct kmem_cache *task_struct_cachep;

　　2、在内核初始化期间，在定义与kernel/fork.c的fork_init()中会创建高速缓存：

 task_struct_cachep=kmem_cache_create (const char *name, size_t size, size_t align,unsigned long flags,void (*ctor)(void*, struct kmem_cache *, unsigned long),void (*dtor)(void*, struct kmem_cache *, unsigned long))

 void __init fork_init(unsigned long mempages)

 {

 #ifndef __HAVE_ARCH_TASK_STRUCT_ALLOCATOR

 #ifndef ARCH_MIN_TASKALIGN

 #define ARCH_MIN_TASKALIGN    L1_CACHE_BYTES

 #endif

     /* create a slab on which task_structs can be allocated */

     task_struct_cachep =

         kmem_cache_create("task_struct", sizeof(struct task_struct),

             ARCH_MIN_TASKALIGN, SLAB_PANIC, NULL, NULL);

 #endif

     /*

      * The default maximum number of threads is set to a safe

      * value: the thread structures can take up at most half

      * of memory.

      */

     max_threads = mempages / ( * THREAD_SIZE / PAGE_SIZE);

     /*

      * we need to allow at least 20 threads to boot a system

      */

     if(max_threads < )

         max_threads = ;

     init_task.signal->rlim[RLIMIT_NPROC].rlim_cur = max_threads/;

     init_task.signal->rlim[RLIMIT_NPROC].rlim_max = max_threads/;

     init_task.signal->rlim[RLIMIT_SIGPENDING] =

         init_task.signal->rlim[RLIMIT_NPROC];

 }

　　这样就创建了一个名为task_stuct的高速缓存，其中存放的就是类型为struct task_struct的对象。该对象被创建后存放在slab中偏移量为ARCH_MIN_TASKALIGN个字节的地方。

　　每当进程调用fork()时，一定会创建一个新的进程描述符。这是在dup_tastk_struct()中完成的，该函数会被do_fork()调用：

　　fork()----------->sys_fork()---------->do_fork()----------->copy_process()------------>dup_task_struct()

　　do_fork()

 /*

  *  Ok, this is the main fork-routine.

  *

  * It copies the process, and if successful kick-starts

  * it and waits for it to finish using the VM if required.

  */

 long do_fork(unsigned long clone_flags,

           unsigned long stack_start,

           struct pt_regs *regs,

           unsigned long stack_size,

           int __user *parent_tidptr,

           int __user *child_tidptr)

 {

     struct task_struct *p;

     int trace = ;

     struct pid *pid = alloc_pid();

     long nr;

     if (!pid)

         return -EAGAIN;

     nr = pid->nr;

     if (unlikely(current->ptrace)) {

         trace = fork_traceflag (clone_flags);

         if (trace)

             clone_flags |= CLONE_PTRACE;

     }

 dup_task_struct

     p = copy_process(clone_flags, stack_start, regs, stack_size, parent_tidptr, child_tidptr, pid);

     /*

      * Do this prior waking up the new thread - the thread pointer

      * might get invalid after that point, if the thread exits quickly.

      */

     if (!IS_ERR(p)) {

         struct completion vfork;

         if (clone_flags & CLONE_VFORK) {

             p->vfork_done = &vfork;

             init_completion(&vfork);

         }

         if ((p->ptrace & PT_PTRACED) || (clone_flags & CLONE_STOPPED)) {

             /*

              * We'll start up with an immediate SIGSTOP.

              */

             sigaddset(&p->pending.signal, SIGSTOP);

             set_tsk_thread_flag(p, TIF_SIGPENDING);

         }

         if (!(clone_flags & CLONE_STOPPED))

             wake_up_new_task(p, clone_flags);

         else

             p->state = TASK_STOPPED;

         if (unlikely (trace)) {

             current->ptrace_message = nr;

             ptrace_notify ((trace << ) | SIGTRAP);

         }

         if (clone_flags & CLONE_VFORK) {

             freezer_do_not_count();

             wait_for_completion(&vfork);

             freezer_count();

             if (unlikely (current->ptrace & PT_TRACE_VFORK_DONE)) {

                 current->ptrace_message = nr;

                 ptrace_notify ((PTRACE_EVENT_VFORK_DONE << ) | SIGTRAP);

             }

         }

     } else {

         free_pid(pid);

         nr = PTR_ERR(p);

     }

     return nr;

 }

　　copy_process()

 /*

  * This creates a new process as a copy of the old one,

  * but does not actually start it yet.

  *

  * It copies the registers, and all the appropriate

  * parts of the process environment (as per the clone

  * flags). The actual kick-off is left to the caller.

  */

 static struct task_struct *copy_process(unsigned long clone_flags,

                     unsigned long stack_start,

                     struct pt_regs *regs,

                     unsigned long stack_size,

                     int __user *parent_tidptr,

                     int __user *child_tidptr,

                     struct pid *pid)

 {

     int retval;

     struct task_struct *p = NULL;

     if ((clone_flags & (CLONE_NEWNS|CLONE_FS)) == (CLONE_NEWNS|CLONE_FS))

         return ERR_PTR(-EINVAL);

     /*

      * Thread groups must share signals as well, and detached threads

      * can only be started up within the thread group.

      */

     if ((clone_flags & CLONE_THREAD) && !(clone_flags & CLONE_SIGHAND))

         return ERR_PTR(-EINVAL);

     /*

      * Shared signal handlers imply shared VM. By way of the above,

      * thread groups also imply shared VM. Blocking this case allows

      * for various simplifications in other code.

      */

     if ((clone_flags & CLONE_SIGHAND) && !(clone_flags & CLONE_VM))

         return ERR_PTR(-EINVAL);

     retval = security_task_create(clone_flags);

     if (retval)

         goto fork_out;

     retval = -ENOMEM;

     p = dup_task_struct(current);

     if (!p)

         goto fork_out;

         ..........

}

　　dup_task_struct

 static struct task_struct *dup_task_struct(struct task_struct *orig)

 {

     struct task_struct *tsk;

     struct thread_info *ti;

     prepare_to_copy(orig);

     tsk = alloc_task_struct();

     if (!tsk)

         return NULL;

     ti = alloc_thread_info(tsk);

     if (!ti) {

         free_task_struct(tsk);

         return NULL;

     }

     *tsk = *orig;

     tsk->stack = ti;

     setup_thread_stack(tsk, orig);

 #ifdef CONFIG_CC_STACKPROTECTOR

     tsk->stack_canary = get_random_int();

 #endif

     /* One for us, one for whoever does the "release_task()" (usually parent) */

     atomic_set(&tsk->usage,);

     atomic_set(&tsk->fs_excl, );

 #ifdef CONFIG_BLK_DEV_IO_TRACE

     tsk->btrace_seq = ;

 #endif

     tsk->splice_pipe = NULL;

     return tsk;

　　进程执行完后，如果没有子进程在等待的话，它的进程描述符就会被释放，并返回给task_struct_cachep slab高速缓存。这就是在free_task_struct()中执行的(tsk是目前的进程)

 /**

  * kmem_cache_free - Deallocate an object

  * @cachep: The cache the allocation was from.

  * @objp: The previously allocated object.

  *

  * Free an object which was previously allocated from this

  * cache.

  */

 void kmem_cache_free(struct kmem_cache *cachep, void *objp)

 {

     unsigned long flags;

     BUG_ON(virt_to_cache(objp) != cachep);

     local_irq_save(flags);

     debug_check_no_locks_freed(objp, obj_size(cachep));

     __cache_free(cachep, objp);

     local_irq_restore(flags);

 }

　　kmem_cache_free(task_struct_cachep, tsk);

   由于进程描述符是内核的核心组成部分，时刻都要用到，因此task_struct_cachep高速缓存绝不会被撤销掉，只是释放调用释放函数kmem_cache_free(),而不是kmem_cache_destroy()函数。
   完（本文主要参考了：linux 内核设计与实现、深入理解linux内核、linux内核完全解析0.11版本，本人只是初步学习linux内核，错误之处请指正）

linux内核分析之内存管理的更多相关文章

[转]linux内核分析笔记----内存管理
转自:http://blog.csdn.net/Baiduluckyboy/article/details/9667933 内存管理,不用多说,言简意赅.在内核里分配内存还真不是件容易的事情,根本上是 ...
<Linux内核源码>内存管理模型
题外语:本人对linux内核的了解尚浅,如果有差池欢迎指正,也欢迎提问交流! 首先要理解一下每一个进程是如何维护自己独立的寻址空间的,我的电脑里呢是8G内存空间.了解过的朋友应该都知道这是虚拟内存技术 ...
linux内核--用户态内存管理
在上一篇博客“内核内存管理”中,描述的内核内存管理的相关算法和数据结构,在这里简单描述用户态内存管理的数据结构和算法. 一,相关结构体与进程地址空间相关的全部信息都包含在一个叫做“内存描述符”的数据 ...
初探Linux内核中的内存管理
Linux内核设计与实现之内存管理的读书笔记初探Linux内核管理内核本身不像用户空间那样奢侈的使用内存; 内核不支持简单快捷的内存分配机制, 用户空间支持? 这种简单快捷的内存分配机制是什么呢? ...
Linux内核剖析之内存管理
1. 内存管理区为什么分成不同的内存管理区? ISA总线的DMA处理器有严格的限制:仅仅能对物理内存前16M寻址. 内核线性地址空间仅仅有1G,CPU不能直接訪问全部的物理内存. ZONE_DMA ...
linux内核源码——内存管理：段页式内存及swap
os的内存管理大概可以分成两块:1.段页式管理(虚存)2.swap in 和 swap out 段页式管理段式管理的图像:运行时重定位多级页表的管理图像块表加速用户(程序员)希望用段,物理内 ...
Linux内核笔记：内存管理
逻辑地址由16位segment selector和offset组成根据segment selector到GDT或LDT中去查找segment descriptor 32位base,20位limit, ...
Linux内核高端内存转
Linux内核地址映射模型x86 CPU采用了段页式地址映射模型.进程代码中的地址为逻辑地址,经过段页式地址映射后,才真正访问物理内存. 段页式机制如下图. Linux内核地址空间划分通 ...
Linux内核分析（三）----初识linux内存管理子系统
原文:Linux内核分析(三)----初识linux内存管理子系统 Linux内核分析(三) 昨天我们对内核模块进行了简单的分析,今天为了让我们今后的分析没有太多障碍,我们今天先简单的分析一下linu ...

随机推荐

Tomcat发生异常
The Apache Tomcat Native library which allows optimal performance in production environments was not ...
【MINA】缓存区ByteBuffer和IOBuffer你要了解的常用知识
mina中IOBuffer是Nio中ByteBuffer的衍生类,主要是解决Bytebuffer的两个不足 1.没有提供足够灵活的get/putXXX方法 2.它容量固定,难以写入可变长度的数据特点 ...
Asp_CRUD
Asp_增删改查.逻辑流程启动服务器. 地址为127.0.0.1 端口为随机分配 2607 然后在浏览器中输入http://localhost:2670/CRUD_main.ashx 浏览器像服务器 ...
关于在windows7中使用Virtual Box 按照安卓虚拟机几个注意事项
1.选择安卓原生镜像的问题选择带PC的字眼的,也就是给平板PC使用的那个,我使用的版本是android-x86-4.0-r1-eeepc.iso其他类似版本也是可以的,因为我已经成功实践啦. 下载地 ...
（POJ 3026） Borg Maze 最小生成树+bfs
题目链接:http://poj.org/problem?id=3026. Description The Borg is an immensely powerful race of enhanced ...
（hdu）1257 最少拦截系统
题目链接:http://acm.split.hdu.edu.cn/showproblem.php?pid=1257 Problem Description 某国为了防御敌国的导弹袭击,发展出一种导弹拦 ...
Hyper-V 虚拟机连接外部网络
Hyper-V创建好虚拟机之后,在默认配置下是没有网络连接的,这个时候就需要进行简单的配置,即可让虚拟机连接外部网络: 在Hyper-V管理器中,右键点击后出现菜单,选择"虚拟交换机管理器& ...
内存管理算法--Buddy伙伴算法
Buddy算法的优缺点: 1)尽管伙伴内存算法在内存碎片问题上已经做的相当出色,但是该算法中,一个很小的块往往会阻碍一个大块的合并,一个系统中,对内存块的分配,大小是随机的,一片内存中仅一个小的内存块 ...
vm NAT方式linux上不了网解决方法
环境: vm版本:vm 11.0.0 系统:本机win7 虚拟机:centos 5.5 问题,当vm设置虚拟机上网方式为NAT方式时,两台虚拟centos能互相ping通.主机能拼通虚拟机,但虚拟 ...
JavaScript jQuery 入门回顾 2
JQuery 滑动利用jQuery可以在元素上创建滑动效果. slideDown() 向下滑动元素. slideUp() 向上滑动元素. slideToggle() 在 slideDown() 与 ...

linux内核分析之内存管理

linux内核分析之内存管理的更多相关文章

随机推荐

热门专题