Linux对内核态内存分配请求与用户态内存分配请求处理上分别对待

Linux本身信任自己,因此Linux内核请求分配多少内存,就会马上分配相同数量的内存出来。

但内核本身不相信应用程序,而且通常应用程序分配了一段内存,其实只是预定,并不是马上就去访问。由于应用程序的数目比较多,那么这部分只分配了但是没有立即访问的内存就占了很大的比例。

1. 因此,内核通过Page Fault exception handler来延迟(Defer)对应用程序申请的内存进行分配操作。

2. 用户态的应用程序分配内存时,返回的不是页结构体(struct page),而是允许应用程序开始使用的一段新的线性地址区间。

与内核不同,应用程序对其地址空间的占有不是连续的,而是分成一段一段的区间(Interval)。如果应用程序试图访问这些区间没有覆盖到的范围,就会引起异常,这就是常见的Access Denied的错误来源。

对于区间内的内存进行访问时,因为是只是reserved的空间,并没有准备好相应的page,因此会触发Page Fault,关键是要明确谁来处理Page Fault,比如如果后备存储是file,就会由file/inode来负责准备page,如果是动态申请的页,则会返回一个zeroed page。

很显然,这个异常不是CPU发起的,因为这里面的区间是操作系统定义,因此这个异常也是操作系统抛出的。

内存描述符

Memory Descriptor

保存在mm_struct结构体中,由process descriptor->mm来引用。

   1: struct mm_struct {

   2:     struct vm_area_struct * mmap;        /* list of VMAs */

   3:     struct rb_root mm_rb;

   4:     struct vm_area_struct * mmap_cache;    /* last find_vma result */

   5: #ifdef CONFIG_MMU

   6:     unsigned long (*get_unmapped_area) (struct file *filp,

   7:                 unsigned long addr, unsigned long len,

   8:                 unsigned long pgoff, unsigned long flags);

   9:     void (*unmap_area) (struct mm_struct *mm, unsigned long addr);

  10: #endif

  11:     unsigned long mmap_base;        /* base of mmap area */

  12:     unsigned long task_size;        /* size of task vm space */

  13:     unsigned long cached_hole_size;     /* if non-zero, the largest hole below free_area_cache */

  14:     unsigned long free_area_cache;        /* first hole of size cached_hole_size or larger */

  15:     pgd_t * pgd;

  16:     atomic_t mm_users;            /* How many users with user space? */

  17:     atomic_t mm_count;            /* How many references to "struct mm_struct" (users count as 1) */

  18:     int map_count;                /* number of VMAs */

  19:  

  20:     spinlock_t page_table_lock;        /* Protects page tables and some counters */

  21:     struct rw_semaphore mmap_sem;

  22:  

  23:     struct list_head mmlist;        /* List of maybe swapped mm's.    These are globally strung

  24:                          * together off init_mm.mmlist, and are protected

  25:                          * by mmlist_lock

  26:                          */

  27:  

  28:  

  29:     unsigned long hiwater_rss;    /* High-watermark of RSS usage */

  30:     unsigned long hiwater_vm;    /* High-water virtual memory usage */

  31:  

  32:     unsigned long total_vm, locked_vm, shared_vm, exec_vm;

  33:     unsigned long stack_vm, reserved_vm, def_flags, nr_ptes;

  34:     unsigned long start_code, end_code, start_data, end_data;

  35:     unsigned long start_brk, brk, start_stack;

  36:     unsigned long arg_start, arg_end, env_start, env_end;

  37:  

  38:     unsigned long saved_auxv[AT_VECTOR_SIZE]; /* for /proc/PID/auxv */

  39:  

  40:     /*

  41:      * Special counters, in some configurations protected by the

  42:      * page_table_lock, in other configurations by being atomic.

  43:      */

  44:     struct mm_rss_stat rss_stat;

  45:  

  46:     struct linux_binfmt *binfmt;

  47:  

  48:     cpumask_var_t cpu_vm_mask_var;

  49:  

  50:     /* Architecture-specific MM context */

  51:     mm_context_t context;

  52:  

  53:     /* Swap token stuff */

  54:     /*

  55:      * Last value of global fault stamp as seen by this process.

  56:      * In other words, this value gives an indication of how long

  57:      * it has been since this task got the token.

  58:      * Look at mm/thrash.c

  59:      */

  60:     unsigned int faultstamp;

  61:     unsigned int token_priority;

  62:     unsigned int last_interval;

  63:  

  64:     /* How many tasks sharing this mm are OOM_DISABLE */

  65:     atomic_t oom_disable_count;

  66:  

  67:     unsigned long flags; /* Must use atomic bitops to access the bits */

  68:  

  69:     struct core_state *core_state; /* coredumping support */

  70: #ifdef CONFIG_AIO

  71:     spinlock_t        ioctx_lock;

  72:     struct hlist_head    ioctx_list;

  73: #endif

  74: #ifdef CONFIG_MM_OWNER

  75:     /*

  76:      * "owner" points to a task that is regarded as the canonical

  77:      * user/owner of this mm. All of the following must be true in

  78:      * order for it to be changed:

  79:      *

  80:      * current == mm->owner

  81:      * current->mm != mm

  82:      * new_owner->mm == mm

  83:      * new_owner->alloc_lock is held

  84:      */

  85:     struct task_struct __rcu *owner;

  86: #endif

  87:  

  88:     /* store ref to file /proc/<pid>/exe symlink points to */

  89:     struct file *exe_file;

  90:     unsigned long num_exe_file_vmas;

  91: #ifdef CONFIG_MMU_NOTIFIER

  92:     struct mmu_notifier_mm *mmu_notifier_mm;

  93: #endif

  94: #ifdef CONFIG_TRANSPARENT_HUGEPAGE

  95:     pgtable_t pmd_huge_pte; /* protected by page_table_lock */

  96: #endif

  97: #ifdef CONFIG_CPUMASK_OFFSTACK

  98:     struct cpumask cpumask_allocation;

  99: #endif

 100: };

每个内存区间,通过struct vm_area_struct来描述。

每个内存描述符中,通过两种方式来管理内存区间:

1. mmap,以数组的形式保存属于该内存中的所有内存区间。【方便遍历,从低地址到高地址】

2. mm_rb,通过“红黑树”的方式管理内存中的所有内存区间。【方便查找】

其他重要的成员:

pgd, 保存本进程相关的页目录地址。【页目录的线性地址, 参考:http://www.cnblogs.com/long123king/p/3506893.html

map_count, 该内存中内存区间的个数。

mmlist,(struct list_head结构体),将该内存描述符保存在一个链表中。

start_code/end_code/start_data/end_data, 顾名思义。

brk, 当前进程的堆的地址。

context, 当前进程的上下文信息,其实就是LDT的地址

内核线程的内存描述符

   1: struct task_struct {

   2:  

   3: ......

   4:  

   5: struct mm_struct *mm, *active_mm;

   6: ......

   7:  

   8: }

内核线程只在内核态下运行,因此它不使用0~3GB的线性地址空间,也没有memory region的概念,因为内核态的线性地址空间是连续的。

因为每个进程的页表中,对于3GB-4GB的地址空间的页表项都是相同的,因此,内核态的进程使用哪个进程的页目录项都是一样的,而且为了避免反复地刷新TLB和CPU的硬件缓存,内核就尽量使用上一个进程的页目录。

因此在task_struct中有两个mm_struct(内存描述符成员):mm和active_mm

对于普通进程,这两个成员是相同的,都指向当前进程的内存描述符成员;

对于内核线程,它没有对应的内存描述符,因此mm为NULL。而当一个内核线程被调度执行时,它的task_struct结构体中的active_mm成员被初始化为前一个正在执行的进程的active_mm成员。

每当内核态对应的页表项需要重新映射时,内核更新swapper_pg_dir中定义的内核全局页表规范集合。

然后,在Page Fault handler处理时,会把这次的改动更新到每个进程的页目录中。

内存区间(Memory Region)

   1: /*

   2:  * This struct defines a memory VMM memory area. There is one of these

   3:  * per VM-area/task.  A VM area is any part of the process virtual memory

   4:  * space that has a special rule for the page-fault handlers (ie a shared

   5:  * library, the executable area etc).

   6:  */

   7: struct vm_area_struct {

   8:     struct mm_struct * vm_mm;    /* The address space we belong to. */

   9:     unsigned long vm_start;        /* Our start address within vm_mm. */

  10:     unsigned long vm_end;        /* The first byte after our end address

  11:                        within vm_mm. */

  12:  

  13:     /* linked list of VM areas per task, sorted by address */

  14:     struct vm_area_struct *vm_next, *vm_prev;

  15:  

  16:     pgprot_t vm_page_prot;        /* Access permissions of this VMA. */

  17:     unsigned long vm_flags;        /* Flags, see mm.h. */

  18:  

  19:     struct rb_node vm_rb;

  20:  

  21:     /*

  22:      * For areas with an address space and backing store,

  23:      * linkage into the address_space->i_mmap prio tree, or

  24:      * linkage to the list of like vmas hanging off its node, or

  25:      * linkage of vma in the address_space->i_mmap_nonlinear list.

  26:      */

  27:     union {

  28:         struct {

  29:             struct list_head list;

  30:             void *parent;    /* aligns with prio_tree_node parent */

  31:             struct vm_area_struct *head;

  32:         } vm_set;

  33:  

  34:         struct raw_prio_tree_node prio_tree_node;

  35:     } shared;

  36:  

  37:     /*

  38:      * A file's MAP_PRIVATE vma can be in both i_mmap tree and anon_vma

  39:      * list, after a COW of one of the file pages.    A MAP_SHARED vma

  40:      * can only be in the i_mmap tree.  An anonymous MAP_PRIVATE, stack

  41:      * or brk vma (with NULL file) can only be in an anon_vma list.

  42:      */

  43:     struct list_head anon_vma_chain; /* Serialized by mmap_sem &

  44:                       * page_table_lock */

  45:     struct anon_vma *anon_vma;    /* Serialized by page_table_lock */

  46:  

  47:     /* Function pointers to deal with this struct. */

  48:     const struct vm_operations_struct *vm_ops;

  49:  

  50:     /* Information about our backing store: */

  51:     unsigned long vm_pgoff;        /* Offset (within vm_file) in PAGE_SIZE

  52:                        units, *not* PAGE_CACHE_SIZE */

  53:     struct file * vm_file;        /* File we map to (can be NULL). */

  54:     void * vm_private_data;        /* was vm_pte (shared mem) */

  55:  

  56: #ifndef CONFIG_MMU

  57:     struct vm_region *vm_region;    /* NOMMU mapping region */

  58: #endif

  59: #ifdef CONFIG_NUMA

  60:     struct mempolicy *vm_policy;    /* NUMA policy for the VMA */

  61: #endif

  62: };

在mm_struct中,通过两种方式可以索引到vm_area_struct,分别是mmap成员,它维护一条按线性内存地址升序的双链表;还有一个是mm_rb,它维护一个“红黑树”。

但是,实际上,对于一个内存区间结构体(vm_area_struct)的对象来说,只有一个实例来代表这个内存区间,只不过是通过两种数据结构来共同维护它。

这样做的好处,是两种数据结构可以分别用于不同的目的:

1, 红黑树,用来根据一个指定的线性地址,快速地找到它所在的内存区间。

2, 双链表,用于按顺序遍历全部的内存区间。

页的访问属性

保存在三个地方:

1. 每个页表项中,有相应的flag,代表其对应的页的访问属性。这是x86硬件用来检查页是否可以访问的依据;

2. 每个页描述符struct page中,有相应的flag。这是为操作系统的检查提供的;

3. 每个内存区间vm_area_struct中,有相应的flag,代表该区间中的所有的页的访问属性。

对于Memory Region的几种操作

find_vma, 找到与目标地址最靠近的内存区间

   1: /* Look up the first VMA which satisfies  addr < vm_end,  NULL if none. */

   2: struct vm_area_struct *find_vma(struct mm_struct *mm, unsigned long addr)

   3: {

   4:     struct vm_area_struct *vma = NULL;

   5:  

   6:     if (mm) {

   7:         /* Check the cache first. */

   8:         /* (Cache hit rate is typically around 35%.) */

   9:         vma = ACCESS_ONCE(mm->mmap_cache);

  10:         if (!(vma && vma->vm_end > addr && vma->vm_start <= addr)) {

  11:             struct rb_node * rb_node;

  12:  

  13:             rb_node = mm->mm_rb.rb_node;

  14:             vma = NULL;

  15:  

  16:             while (rb_node) {

  17:                 struct vm_area_struct * vma_tmp;

  18:  

  19:                 vma_tmp = rb_entry(rb_node,

  20:                         struct vm_area_struct, vm_rb);

  21:  

  22:                 if (vma_tmp->vm_end > addr) {

  23:                     vma = vma_tmp;

  24:                     if (vma_tmp->vm_start <= addr)

  25:                         break;

  26:                     rb_node = rb_node->rb_left;

  27:                 } else

  28:                     rb_node = rb_node->rb_right;

  29:             }

  30:             if (vma)

  31:                 mm->mmap_cache = vma;

  32:         }

  33:     }

  34:     return vma;

  35: }

find_vma_intersection, 找到与目标地址范围相交的内存区间

   1: /* Look up the first VMA which intersects the interval start_addr..end_addr-1,

   2:    NULL if none.  Assume start_addr < end_addr. */

   3: static inline struct vm_area_struct * find_vma_intersection(struct mm_struct * mm, unsigned long start_addr, unsigned long end_addr)

   4: {

   5:     struct vm_area_struct * vma = find_vma(mm,start_addr);

   6:  

   7:     if (vma && end_addr <= vma->vm_start)

   8:         vma = NULL;

   9:     return vma;

  10: }

get_unmapped_area, 找到符合指定长度的内存区间之间的空洞,可以作为新建的内存区间的候选位置

   1: unsigned long

   2: get_unmapped_area(struct file *file, unsigned long addr, unsigned long len,

   3:         unsigned long pgoff, unsigned long flags)

   4: {

   5:     unsigned long (*get_area)(struct file *, unsigned long,

   6:                   unsigned long, unsigned long, unsigned long);

   7:  

   8:     unsigned long error = arch_mmap_check(addr, len, flags);

   9:     if (error)

  10:         return error;

  11:  

  12:     /* Careful about overflows.. */

  13:     if (len > TASK_SIZE)

  14:         return -ENOMEM;

  15:  

  16:     get_area = current->mm->get_unmapped_area;

  17:     if (file && file->f_op && file->f_op->get_unmapped_area)

  18:         get_area = file->f_op->get_unmapped_area;

  19:     addr = get_area(file, addr, len, pgoff, flags);

  20:     if (IS_ERR_VALUE(addr))

  21:         return addr;

  22:  

  23:     if (addr > TASK_SIZE - len)

  24:         return -ENOMEM;

  25:     if (addr & ~PAGE_MASK)

  26:         return -EINVAL;

  27:  

  28:     return arch_rebalance_pgtables(addr, len);

  29: }

insert_vm_struct,把一个指定的内存区间,添加到指定的内存描述符中

   1: /* Insert vm structure into process list sorted by address

   2:  * and into the inode's i_mmap tree.  If vm_file is non-NULL

   3:  * then i_mmap_mutex is taken here.

   4:  */

   5: int insert_vm_struct(struct mm_struct * mm, struct vm_area_struct * vma)

   6: {

   7:     struct vm_area_struct * __vma, * prev;

   8:     struct rb_node ** rb_link, * rb_parent;

   9:  

  10:     /*

  11:      * The vm_pgoff of a purely anonymous vma should be irrelevant

  12:      * until its first write fault, when page's anon_vma and index

  13:      * are set.  But now set the vm_pgoff it will almost certainly

  14:      * end up with (unless mremap moves it elsewhere before that

  15:      * first wfault), so /proc/pid/maps tells a consistent story.

  16:      *

  17:      * By setting it to reflect the virtual start address of the

  18:      * vma, merges and splits can happen in a seamless way, just

  19:      * using the existing file pgoff checks and manipulations.

  20:      * Similarly in do_mmap_pgoff and in do_brk.

  21:      */

  22:     if (!vma->vm_file) {

  23:         BUG_ON(vma->anon_vma);

  24:         vma->vm_pgoff = vma->vm_start >> PAGE_SHIFT;

  25:     }

  26:     __vma = find_vma_prepare(mm,vma->vm_start,&prev,&rb_link,&rb_parent);

  27:     if (__vma && __vma->vm_start < vma->vm_end)

  28:         return -ENOMEM;

  29:     if ((vma->vm_flags & VM_ACCOUNT) &&

  30:          security_vm_enough_memory_mm(mm, vma_pages(vma)))

  31:         return -ENOMEM;

  32:     vma_link(mm, vma, prev, rb_link, rb_parent);

  33:     return 0;

  34: }

do_mmap, 分配一个线性地址内存区间,实现中调用do_mmap_pgoff和mmap_region完成。

/*
* 'kernel.h' contains some often-used function prototypes etc
*/
#define __ALIGN_KERNEL(x, a)        __ALIGN_KERNEL_MASK(x, (typeof(x))(a) - 1)
#define __ALIGN_KERNEL_MASK(x, mask)    (((x) + (mask)) & ~(mask))

   1: static inline unsigned long do_mmap(struct file *file, unsigned long addr,

   2:     unsigned long len, unsigned long prot,

   3:     unsigned long flag, unsigned long offset)

   4: {

   5:     unsigned long ret = -EINVAL;

   6:     if ((offset + PAGE_ALIGN(len)) < offset)

   7:         goto out;

   8:     if (!(offset & ~PAGE_MASK))

   9:         ret = do_mmap_pgoff(file, addr, len, prot, flag, offset >> PAGE_SHIFT);

  10: out:

  11:     return ret;

  12: }

  13:  

  14: /*

  15:  * The caller must hold down_write(&current->mm->mmap_sem).

  16:  */

  17:  

  18: unsigned long do_mmap_pgoff(struct file *file, unsigned long addr,

  19:             unsigned long len, unsigned long prot,

  20:             unsigned long flags, unsigned long pgoff)

  21: {

  22:     struct mm_struct * mm = current->mm;

  23:     struct inode *inode;

  24:     vm_flags_t vm_flags;

  25:     int error;

  26:     unsigned long reqprot = prot;

  27:  

  28:     /*

  29:      * Does the application expect PROT_READ to imply PROT_EXEC?

  30:      *

  31:      * (the exception is when the underlying filesystem is noexec

  32:      *  mounted, in which case we dont add PROT_EXEC.)

  33:      */

  34:     if ((prot & PROT_READ) && (current->personality & READ_IMPLIES_EXEC))

  35:         if (!(file && (file->f_path.mnt->mnt_flags & MNT_NOEXEC)))

  36:             prot |= PROT_EXEC;

  37:  

  38:     if (!len)

  39:         return -EINVAL;

  40:  

  41:     if (!(flags & MAP_FIXED))

  42:         addr = round_hint_to_min(addr);

  43:  

  44:     /* Careful about overflows.. */

  45:     len = PAGE_ALIGN(len);

  46:     if (!len)

  47:         return -ENOMEM;

  48:  

  49:     /* offset overflow? */

  50:     if ((pgoff + (len >> PAGE_SHIFT)) < pgoff)

  51:                return -EOVERFLOW;

  52:  

  53:     /* Too many mappings? */

  54:     if (mm->map_count > sysctl_max_map_count)

  55:         return -ENOMEM;

  56:  

  57:     /* Obtain the address to map to. we verify (or select) it and ensure

  58:      * that it represents a valid section of the address space.

  59:      */

  60:     addr = get_unmapped_area(file, addr, len, pgoff, flags);

  61:     if (addr & ~PAGE_MASK)

  62:         return addr;

  63:  

  64:     /* Do simple checking here so the lower-level routines won't have

  65:      * to. we assume access permissions have been handled by the open

  66:      * of the memory object, so we don't do any here.

  67:      */

  68:     vm_flags = calc_vm_prot_bits(prot) | calc_vm_flag_bits(flags) |

  69:             mm->def_flags | VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC;

  70:  

  71:     if (flags & MAP_LOCKED)

  72:         if (!can_do_mlock())

  73:             return -EPERM;

  74:  

  75:     /* mlock MCL_FUTURE? */

  76:     if (vm_flags & VM_LOCKED) {

  77:         unsigned long locked, lock_limit;

  78:         locked = len >> PAGE_SHIFT;

  79:         locked += mm->locked_vm;

  80:         lock_limit = rlimit(RLIMIT_MEMLOCK);

  81:         lock_limit >>= PAGE_SHIFT;

  82:         if (locked > lock_limit && !capable(CAP_IPC_LOCK))

  83:             return -EAGAIN;

  84:     }

  85:  

  86:     inode = file ? file->f_path.dentry->d_inode : NULL;

  87:  

  88:     if (file) {

  89:         switch (flags & MAP_TYPE) {

  90:         case MAP_SHARED:

  91:             if ((prot&PROT_WRITE) && !(file->f_mode&FMODE_WRITE))

  92:                 return -EACCES;

  93:  

  94:             /*

  95:              * Make sure we don't allow writing to an append-only

  96:              * file..

  97:              */

  98:             if (IS_APPEND(inode) && (file->f_mode & FMODE_WRITE))

  99:                 return -EACCES;

 100:  

 101:             /*

 102:              * Make sure there are no mandatory locks on the file.

 103:              */

 104:             if (locks_verify_locked(inode))

 105:                 return -EAGAIN;

 106:  

 107:             vm_flags |= VM_SHARED | VM_MAYSHARE;

 108:             if (!(file->f_mode & FMODE_WRITE))

 109:                 vm_flags &= ~(VM_MAYWRITE | VM_SHARED);

 110:  

 111:             /* fall through */

 112:         case MAP_PRIVATE:

 113:             if (!(file->f_mode & FMODE_READ))

 114:                 return -EACCES;

 115:             if (file->f_path.mnt->mnt_flags & MNT_NOEXEC) {

 116:                 if (vm_flags & VM_EXEC)

 117:                     return -EPERM;

 118:                 vm_flags &= ~VM_MAYEXEC;

 119:             }

 120:  

 121:             if (!file->f_op || !file->f_op->mmap)

 122:                 return -ENODEV;

 123:             break;

 124:  

 125:         default:

 126:             return -EINVAL;

 127:         }

 128:     } else {

 129:         switch (flags & MAP_TYPE) {

 130:         case MAP_SHARED:

 131:             /*

 132:              * Ignore pgoff.

 133:              */

 134:             pgoff = 0;

 135:             vm_flags |= VM_SHARED | VM_MAYSHARE;

 136:             break;

 137:         case MAP_PRIVATE:

 138:             /*

 139:              * Set pgoff according to addr for anon_vma.

 140:              */

 141:             pgoff = addr >> PAGE_SHIFT;

 142:             break;

 143:         default:

 144:             return -EINVAL;

 145:         }

 146:     }

 147:  

 148:     error = security_file_mmap(file, reqprot, prot, flags, addr, 0);

 149:     if (error)

 150:         return error;

 151:  

 152:     return mmap_region(file, addr, len, flags, vm_flags, pgoff);

 153: }

 154: EXPORT_SYMBOL(do_mmap_pgoff);

 155:  

 156: unsigned long mmap_region(struct file *file, unsigned long addr,

 157:               unsigned long len, unsigned long flags,

 158:               vm_flags_t vm_flags, unsigned long pgoff)

 159: {

 160:     struct mm_struct *mm = current->mm;

 161:     struct vm_area_struct *vma, *prev;

 162:     int correct_wcount = 0;

 163:     int error;

 164:     struct rb_node **rb_link, *rb_parent;

 165:     unsigned long charged = 0;

 166:     struct inode *inode =  file ? file->f_path.dentry->d_inode : NULL;

 167:  

 168:     /* Clear old maps */

 169:     error = -ENOMEM;

 170: munmap_back:

 171:     vma = find_vma_prepare(mm, addr, &prev, &rb_link, &rb_parent);

 172:     if (vma && vma->vm_start < addr + len) {

 173:         if (do_munmap(mm, addr, len))

 174:             return -ENOMEM;

 175:         goto munmap_back;

 176:     }

 177:  

 178:     /* Check against address space limit. */

 179:     if (!may_expand_vm(mm, len >> PAGE_SHIFT))

 180:         return -ENOMEM;

 181:  

 182:     /*

 183:      * Set 'VM_NORESERVE' if we should not account for the

 184:      * memory use of this mapping.

 185:      */

 186:     if ((flags & MAP_NORESERVE)) {

 187:         /* We honor MAP_NORESERVE if allowed to overcommit */

 188:         if (sysctl_overcommit_memory != OVERCOMMIT_NEVER)

 189:             vm_flags |= VM_NORESERVE;

 190:  

 191:         /* hugetlb applies strict overcommit unless MAP_NORESERVE */

 192:         if (file && is_file_hugepages(file))

 193:             vm_flags |= VM_NORESERVE;

 194:     }

 195:  

 196:     /*

 197:      * Private writable mapping: check memory availability

 198:      */

 199:     if (accountable_mapping(file, vm_flags)) {

 200:         charged = len >> PAGE_SHIFT;

 201:         if (security_vm_enough_memory(charged))

 202:             return -ENOMEM;

 203:         vm_flags |= VM_ACCOUNT;

 204:     }

 205:  

 206:     /*

 207:      * Can we just expand an old mapping?

 208:      */

 209:     vma = vma_merge(mm, prev, addr, addr + len, vm_flags, NULL, file, pgoff, NULL);

 210:     if (vma)

 211:         goto out;

 212:  

 213:     /*

 214:      * Determine the object being mapped and call the appropriate

 215:      * specific mapper. the address has already been validated, but

 216:      * not unmapped, but the maps are removed from the list.

 217:      */

 218:     vma = kmem_cache_zalloc(vm_area_cachep, GFP_KERNEL);

 219:     if (!vma) {

 220:         error = -ENOMEM;

 221:         goto unacct_error;

 222:     }

 223:  

 224:     vma->vm_mm = mm;

 225:     vma->vm_start = addr;

 226:     vma->vm_end = addr + len;

 227:     vma->vm_flags = vm_flags;

 228:     vma->vm_page_prot = vm_get_page_prot(vm_flags);

 229:     vma->vm_pgoff = pgoff;

 230:     INIT_LIST_HEAD(&vma->anon_vma_chain);

 231:  

 232:     if (file) {

 233:         error = -EINVAL;

 234:         if (vm_flags & (VM_GROWSDOWN|VM_GROWSUP))

 235:             goto free_vma;

 236:         if (vm_flags & VM_DENYWRITE) {

 237:             error = deny_write_access(file);

 238:             if (error)

 239:                 goto free_vma;

 240:             correct_wcount = 1;

 241:         }

 242:         vma->vm_file = file;

 243:         get_file(file);

 244:         error = file->f_op->mmap(file, vma);

 245:         if (error)

 246:             goto unmap_and_free_vma;

 247:         if (vm_flags & VM_EXECUTABLE)

 248:             added_exe_file_vma(mm);

 249:  

 250:         /* Can addr have changed??

 251:          *

 252:          * Answer: Yes, several device drivers can do it in their

 253:          *         f_op->mmap method. -DaveM

 254:          */

 255:         addr = vma->vm_start;

 256:         pgoff = vma->vm_pgoff;

 257:         vm_flags = vma->vm_flags;

 258:     } else if (vm_flags & VM_SHARED) {

 259:         error = shmem_zero_setup(vma);

 260:         if (error)

 261:             goto free_vma;

 262:     }

 263:  

 264:     if (vma_wants_writenotify(vma)) {

 265:         pgprot_t pprot = vma->vm_page_prot;

 266:  

 267:         /* Can vma->vm_page_prot have changed??

 268:          *

 269:          * Answer: Yes, drivers may have changed it in their

 270:          *         f_op->mmap method.

 271:          *

 272:          * Ensures that vmas marked as uncached stay that way.

 273:          */

 274:         vma->vm_page_prot = vm_get_page_prot(vm_flags & ~VM_SHARED);

 275:         if (pgprot_val(pprot) == pgprot_val(pgprot_noncached(pprot)))

 276:             vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);

 277:     }

 278:  

 279:     vma_link(mm, vma, prev, rb_link, rb_parent);

 280:     file = vma->vm_file;

 281:  

 282:     /* Once vma denies write, undo our temporary denial count */

 283:     if (correct_wcount)

 284:         atomic_inc(&inode->i_writecount);

 285: out:

 286:     perf_event_mmap(vma);

 287:  

 288:     mm->total_vm += len >> PAGE_SHIFT;

 289:     vm_stat_account(mm, vm_flags, file, len >> PAGE_SHIFT);

 290:     if (vm_flags & VM_LOCKED) {

 291:         if (!mlock_vma_pages_range(vma, addr, addr + len))

 292:             mm->locked_vm += (len >> PAGE_SHIFT);

 293:     } else if ((flags & MAP_POPULATE) && !(flags & MAP_NONBLOCK))

 294:         make_pages_present(addr, addr + len);

 295:     return addr;

 296:  

 297: unmap_and_free_vma:

 298:     if (correct_wcount)

 299:         atomic_inc(&inode->i_writecount);

 300:     vma->vm_file = NULL;

 301:     fput(file);

 302:  

 303:     /* Undo any partial mapping done by a device driver. */

 304:     unmap_region(mm, vma, prev, vma->vm_start, vma->vm_end);

 305:     charged = 0;

 306: free_vma:

 307:     kmem_cache_free(vm_area_cachep, vma);

 308: unacct_error:

 309:     if (charged)

 310:         vm_unacct_memory(charged);

 311:     return error;

 312: }

do_munmap, 释放一个内存区间

   1: /* Munmap is split into 2 main parts -- this part which finds

   2:  * what needs doing, and the areas themselves, which do the

   3:  * work.  This now handles partial unmappings.

   4:  * Jeremy Fitzhardinge <jeremy@goop.org>

   5:  */

   6: int do_munmap(struct mm_struct *mm, unsigned long start, size_t len)

   7: {

   8:     unsigned long end;

   9:     struct vm_area_struct *vma, *prev, *last;

  10:  

  11:     if ((start & ~PAGE_MASK) || start > TASK_SIZE || len > TASK_SIZE-start)

  12:         return -EINVAL;

  13:  

  14:     if ((len = PAGE_ALIGN(len)) == 0)

  15:         return -EINVAL;

  16:  

  17:     /* Find the first overlapping VMA */

  18:     vma = find_vma(mm, start);

  19:     if (!vma)

  20:         return 0;

  21:     prev = vma->vm_prev;

  22:     /* we have  start < vma->vm_end  */

  23:  

  24:     /* if it doesn't overlap, we have nothing.. */

  25:     end = start + len;

  26:     if (vma->vm_start >= end)

  27:         return 0;

  28:  

  29:     /*

  30:      * If we need to split any vma, do it now to save pain later.

  31:      *

  32:      * Note: mremap's move_vma VM_ACCOUNT handling assumes a partially

  33:      * unmapped vm_area_struct will remain in use: so lower split_vma

  34:      * places tmp vma above, and higher split_vma places tmp vma below.

  35:      */

  36:     if (start > vma->vm_start) {

  37:         int error;

  38:  

  39:         /*

  40:          * Make sure that map_count on return from munmap() will

  41:          * not exceed its limit; but let map_count go just above

  42:          * its limit temporarily, to help free resources as expected.

  43:          */

  44:         if (end < vma->vm_end && mm->map_count >= sysctl_max_map_count)

  45:             return -ENOMEM;

  46:  

  47:         error = __split_vma(mm, vma, start, 0);

  48:         if (error)

  49:             return error;

  50:         prev = vma;

  51:     }

  52:  

  53:     /* Does it split the last one? */

  54:     last = find_vma(mm, end);

  55:     if (last && end > last->vm_start) {

  56:         int error = __split_vma(mm, last, end, 1);

  57:         if (error)

  58:             return error;

  59:     }

  60:     vma = prev? prev->vm_next: mm->mmap;

  61:  

  62:     /*

  63:      * unlock any mlock()ed ranges before detaching vmas

  64:      */

  65:     if (mm->locked_vm) {

  66:         struct vm_area_struct *tmp = vma;

  67:         while (tmp && tmp->vm_start < end) {

  68:             if (tmp->vm_flags & VM_LOCKED) {

  69:                 mm->locked_vm -= vma_pages(tmp);

  70:                 munlock_vma_pages_all(tmp);

  71:             }

  72:             tmp = tmp->vm_next;

  73:         }

  74:     }

  75:  

  76:     /*

  77:      * Remove the vma's, and unmap the actual pages

  78:      */

  79:     detach_vmas_to_be_unmapped(mm, vma, prev, end);

  80:     unmap_region(mm, vma, prev, start, end);

  81:  

  82:     /* Fix up all other VM information */

  83:     remove_vma_list(mm, vma);

  84:  

  85:     return 0;

  86: }

Page Fault缺页异常

   1: /*

   2:  * This routine handles page faults.  It determines the address,

   3:  * and the problem, and then passes it off to one of the appropriate

   4:  * routines.

   5:  */

   6: dotraplinkage void __kprobes

   7: do_page_fault(struct pt_regs *regs, unsigned long error_code)

   8: {

   9:     struct vm_area_struct *vma;

  10:     struct task_struct *tsk;

  11:     unsigned long address;

  12:     struct mm_struct *mm;

  13:     int fault;

  14:     int write = error_code & PF_WRITE;

  15:     unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE |

  16:                     (write ? FAULT_FLAG_WRITE : 0);

  17:  

  18:     tsk = current;

  19:     mm = tsk->mm;

  20:  

  21:     /* Get the faulting address: */

  22:     address = read_cr2();

  23:  

  24:     /*

  25:      * Detect and handle instructions that would cause a page fault for

  26:      * both a tracked kernel page and a userspace page.

  27:      */

  28:     if (kmemcheck_active(regs))

  29:         kmemcheck_hide(regs);

  30:     prefetchw(&mm->mmap_sem);

  31:  

  32:     if (unlikely(kmmio_fault(regs, address)))

  33:         return;

  34:  

  35:     /*

  36:      * We fault-in kernel-space virtual memory on-demand. The

  37:      * 'reference' page table is init_mm.pgd.

  38:      *

  39:      * NOTE! We MUST NOT take any locks for this case. We may

  40:      * be in an interrupt or a critical region, and should

  41:      * only copy the information from the master page table,

  42:      * nothing more.

  43:      *

  44:      * This verifies that the fault happens in kernel space

  45:      * (error_code & 4) == 0, and that the fault was not a

  46:      * protection error (error_code & 9) == 0.

  47:      */

  48:     if (unlikely(fault_in_kernel_space(address))) {

  49:         if (!(error_code & (PF_RSVD | PF_USER | PF_PROT))) {

  50:             if (vmalloc_fault(address) >= 0)

  51:                 return;

  52:  

  53:             if (kmemcheck_fault(regs, address, error_code))

  54:                 return;

  55:         }

  56:  

  57:         /* Can handle a stale RO->RW TLB: */

  58:         if (spurious_fault(error_code, address))

  59:             return;

  60:  

  61:         /* kprobes don't want to hook the spurious faults: */

  62:         if (notify_page_fault(regs))

  63:             return;

  64:         /*

  65:          * Don't take the mm semaphore here. If we fixup a prefetch

  66:          * fault we could otherwise deadlock:

  67:          */

  68:         bad_area_nosemaphore(regs, error_code, address);

  69:  

  70:         return;

  71:     }

  72:  

  73:     /* kprobes don't want to hook the spurious faults: */

  74:     if (unlikely(notify_page_fault(regs)))

  75:         return;

  76:     /*

  77:      * It's safe to allow irq's after cr2 has been saved and the

  78:      * vmalloc fault has been handled.

  79:      *

  80:      * User-mode registers count as a user access even for any

  81:      * potential system fault or CPU buglet:

  82:      */

  83:     if (user_mode_vm(regs)) {

  84:         local_irq_enable();

  85:         error_code |= PF_USER;

  86:     } else {

  87:         if (regs->flags & X86_EFLAGS_IF)

  88:             local_irq_enable();

  89:     }

  90:  

  91:     if (unlikely(error_code & PF_RSVD))

  92:         pgtable_bad(regs, error_code, address);

  93:  

  94:     perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, 0, regs, address);

  95:  

  96:     /*

  97:      * If we're in an interrupt, have no user context or are running

  98:      * in an atomic region then we must not take the fault:

  99:      */

 100:     if (unlikely(in_atomic() || !mm)) {

 101:         bad_area_nosemaphore(regs, error_code, address);

 102:         return;

 103:     }

 104:  

 105:     /*

 106:      * When running in the kernel we expect faults to occur only to

 107:      * addresses in user space.  All other faults represent errors in

 108:      * the kernel and should generate an OOPS.  Unfortunately, in the

 109:      * case of an erroneous fault occurring in a code path which already

 110:      * holds mmap_sem we will deadlock attempting to validate the fault

 111:      * against the address space.  Luckily the kernel only validly

 112:      * references user space from well defined areas of code, which are

 113:      * listed in the exceptions table.

 114:      *

 115:      * As the vast majority of faults will be valid we will only perform

 116:      * the source reference check when there is a possibility of a

 117:      * deadlock. Attempt to lock the address space, if we cannot we then

 118:      * validate the source. If this is invalid we can skip the address

 119:      * space check, thus avoiding the deadlock:

 120:      */

 121:     if (unlikely(!down_read_trylock(&mm->mmap_sem))) {

 122:         if ((error_code & PF_USER) == 0 &&

 123:             !search_exception_tables(regs->ip)) {

 124:             bad_area_nosemaphore(regs, error_code, address);

 125:             return;

 126:         }

 127: retry:

 128:         down_read(&mm->mmap_sem);

 129:     } else {

 130:         /*

 131:          * The above down_read_trylock() might have succeeded in

 132:          * which case we'll have missed the might_sleep() from

 133:          * down_read():

 134:          */

 135:         might_sleep();

 136:     }

 137:  

 138:     vma = find_vma(mm, address);

 139:     if (unlikely(!vma)) {

 140:         bad_area(regs, error_code, address);

 141:         return;

 142:     }

 143:     if (likely(vma->vm_start <= address))

 144:         goto good_area;

 145:     if (unlikely(!(vma->vm_flags & VM_GROWSDOWN))) {

 146:         bad_area(regs, error_code, address);

 147:         return;

 148:     }

 149:     if (error_code & PF_USER) {

 150:         /*

 151:          * Accessing the stack below %sp is always a bug.

 152:          * The large cushion allows instructions like enter

 153:          * and pusha to work. ("enter $65535, $31" pushes

 154:          * 32 pointers and then decrements %sp by 65535.)

 155:          */

 156:         if (unlikely(address + 65536 + 32 * sizeof(unsigned long) < regs->sp)) {

 157:             bad_area(regs, error_code, address);

 158:             return;

 159:         }

 160:     }

 161:     if (unlikely(expand_stack(vma, address))) {

 162:         bad_area(regs, error_code, address);

 163:         return;

 164:     }

 165:  

 166:     /*

 167:      * Ok, we have a good vm_area for this memory access, so

 168:      * we can handle it..

 169:      */

 170: good_area:

 171:     if (unlikely(access_error(error_code, vma))) {

 172:         bad_area_access_error(regs, error_code, address);

 173:         return;

 174:     }

 175:  

 176:     /*

 177:      * If for any reason at all we couldn't handle the fault,

 178:      * make sure we exit gracefully rather than endlessly redo

 179:      * the fault:

 180:      */

 181:     fault = handle_mm_fault(mm, vma, address, flags);

 182:  

 183:     if (unlikely(fault & (VM_FAULT_RETRY|VM_FAULT_ERROR))) {

 184:         if (mm_fault_error(regs, error_code, address, fault))

 185:             return;

 186:     }

 187:  

 188:     /*

 189:      * Major/minor page fault accounting is only done on the

 190:      * initial attempt. If we go through a retry, it is extremely

 191:      * likely that the page will be found in page cache at that point.

 192:      */

 193:     if (flags & FAULT_FLAG_ALLOW_RETRY) {

 194:         if (fault & VM_FAULT_MAJOR) {

 195:             tsk->maj_flt++;

 196:             perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, 1, 0,

 197:                       regs, address);

 198:         } else {

 199:             tsk->min_flt++;

 200:             perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1, 0,

 201:                       regs, address);

 202:         }

 203:         if (fault & VM_FAULT_RETRY) {

 204:             /* Clear FAULT_FLAG_ALLOW_RETRY to avoid any risk

 205:              * of starvation. */

 206:             flags &= ~FAULT_FLAG_ALLOW_RETRY;

 207:             goto retry;

 208:         }

 209:     }

 210:  

 211:     check_v8086_mode(regs, address, tsk);

 212:  

 213:     up_read(&mm->mmap_sem);

 214: }

引起缺页异常的线性地址,被保存在CR2寄存器中

/* Get the faulting address: */
    address = read_cr2();

Demand Paging按需调页

   1: /*

   2:  * By the time we get here, we already hold the mm semaphore

   3:  */

   4: int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,

   5:         unsigned long address, unsigned int flags)

   6: {

   7:     pgd_t *pgd;

   8:     pud_t *pud;

   9:     pmd_t *pmd;

  10:     pte_t *pte;

  11:  

  12:     __set_current_state(TASK_RUNNING);

  13:  

  14:     count_vm_event(PGFAULT);

  15:     mem_cgroup_count_vm_event(mm, PGFAULT);

  16:  

  17:     /* do counter updates before entering really critical section. */

  18:     check_sync_rss_stat(current);

  19:  

  20:     if (unlikely(is_vm_hugetlb_page(vma)))

  21:         return hugetlb_fault(mm, vma, address, flags);

  22:  

  23: retry:

  24:     pgd = pgd_offset(mm, address);

  25:     pud = pud_alloc(mm, pgd, address);

  26:     if (!pud)

  27:         return VM_FAULT_OOM;

  28:     pmd = pmd_alloc(mm, pud, address);

  29:     if (!pmd)

  30:         return VM_FAULT_OOM;

  31:     if (pmd_none(*pmd) && transparent_hugepage_enabled(vma)) {

  32:         if (!vma->vm_ops)

  33:             return do_huge_pmd_anonymous_page(mm, vma, address,

  34:                               pmd, flags);

  35:     } else {

  36:         pmd_t orig_pmd = *pmd;

  37:         int ret;

  38:  

  39:         barrier();

  40:         if (pmd_trans_huge(orig_pmd)) {

  41:             if (flags & FAULT_FLAG_WRITE &&

  42:                 !pmd_write(orig_pmd) &&

  43:                 !pmd_trans_splitting(orig_pmd)) {

  44:                 ret = do_huge_pmd_wp_page(mm, vma, address, pmd,

  45:                               orig_pmd);

  46:                 /*

  47:                  * If COW results in an oom, the huge pmd will

  48:                  * have been split, so retry the fault on the

  49:                  * pte for a smaller charge.

  50:                  */

  51:                 if (unlikely(ret & VM_FAULT_OOM))

  52:                     goto retry;

  53:                 return ret;

  54:             }

  55:             return 0;

  56:         }

  57:     }

  58:  

  59:     /*

  60:      * Use __pte_alloc instead of pte_alloc_map, because we can't

  61:      * run pte_offset_map on the pmd, if an huge pmd could

  62:      * materialize from under us from a different thread.

  63:      */

  64:     if (unlikely(pmd_none(*pmd)) && __pte_alloc(mm, vma, pmd, address))

  65:         return VM_FAULT_OOM;

  66:     /* if an huge pmd materialized from under us just retry later */

  67:     if (unlikely(pmd_trans_huge(*pmd)))

  68:         return 0;

  69:     /*

  70:      * A regular pmd is established and it can't morph into a huge pmd

  71:      * from under us anymore at this point because we hold the mmap_sem

  72:      * read mode and khugepaged takes it in write mode. So now it's

  73:      * safe to run pte_offset_map().

  74:      */

  75:     pte = pte_offset_map(pmd, address);

  76:  

  77:     return handle_pte_fault(mm, vma, address, pte, pmd, flags);

  78: }

   1: /*

   2:  * These routines also need to handle stuff like marking pages dirty

   3:  * and/or accessed for architectures that don't do it in hardware (most

   4:  * RISC architectures).  The early dirtying is also good on the i386.

   5:  *

   6:  * There is also a hook called "update_mmu_cache()" that architectures

   7:  * with external mmu caches can use to update those (ie the Sparc or

   8:  * PowerPC hashed page tables that act as extended TLBs).

   9:  *

  10:  * We enter with non-exclusive mmap_sem (to exclude vma changes,

  11:  * but allow concurrent faults), and pte mapped but not yet locked.

  12:  * We return with mmap_sem still held, but pte unmapped and unlocked.

  13:  */

  14: int handle_pte_fault(struct mm_struct *mm,

  15:              struct vm_area_struct *vma, unsigned long address,

  16:              pte_t *pte, pmd_t *pmd, unsigned int flags)

  17: {

  18:     pte_t entry;

  19:     spinlock_t *ptl;

  20:  

  21:     entry = *pte;

  22:     if (!pte_present(entry)) {

  23:         if (pte_none(entry)) {

  24:             if (vma->vm_ops) {

  25:                 if (likely(vma->vm_ops->fault))

  26:                     return do_linear_fault(mm, vma, address,

  27:                         pte, pmd, flags, entry);

  28:             }

  29:             return do_anonymous_page(mm, vma, address,

  30:                          pte, pmd, flags);

  31:         }

  32:         if (pte_file(entry))

  33:             return do_nonlinear_fault(mm, vma, address,

  34:                     pte, pmd, flags, entry);

  35:         return do_swap_page(mm, vma, address,

  36:                     pte, pmd, flags, entry);

  37:     }

  38:  

  39:     ptl = pte_lockptr(mm, pmd);

  40:     spin_lock(ptl);

  41:     if (unlikely(!pte_same(*pte, entry)))

  42:         goto unlock;

  43:     if (flags & FAULT_FLAG_WRITE) {

  44:         if (!pte_write(entry))

  45:             return do_wp_page(mm, vma, address,

  46:                     pte, pmd, ptl, entry);

  47:         entry = pte_mkdirty(entry);

  48:     }

  49:     entry = pte_mkyoung(entry);

  50:     if (ptep_set_access_flags(vma, address, pte, entry, flags & FAULT_FLAG_WRITE)) {

  51:         update_mmu_cache(vma, address, pte);

  52:     } else {

  53:         /*

  54:          * This is needed only for protection faults but the arch code

  55:          * is not yet telling us if this is a protection fault or not.

  56:          * This still avoids useless tlb flushes for .text page faults

  57:          * with threads.

  58:          */

  59:         if (flags & FAULT_FLAG_WRITE)

  60:             flush_tlb_fix_spurious_fault(vma, address);

  61:     }

  62: unlock:

  63:     pte_unmap_unlock(pte, ptl);

  64:     return 0;

  65: }

entry = *pte;
if (!pte_present(entry)) {
    if (pte_none(entry)) {
        if (vma->vm_ops) {
            if (likely(vma->vm_ops->fault))
                return do_linear_fault(mm, vma, address,
                    pte, pmd, flags, entry);
        }
        return do_anonymous_page(mm, vma, address,
                     pte, pmd, flags);
    }
    if (pte_file(entry))
        return do_nonlinear_fault(mm, vma, address,
                pte, pmd, flags, entry);
    return do_swap_page(mm, vma, address,
                pte, pmd, flags, entry);
}

如果页对应的是一个文件映射的话,就调用do_nonlinear_fault处理。(To Be Discussed!)

如果是按需调页的情况,就调用do_swap_page。

Copy on Write写时复制

Modern Unix kernels, including Linux, follow a more efficient approach called Copy On Write(COW).
The idea is quite simple: instead of duplicating page frames, they are shared between the parent
and the child process. However, as long as they are shared, they cannot be modified. Whenever the
parent or the child process attempts to write into a shared page frame, an exception occurs. At this
point, the kernel duplicates the page into a new page frame that it marks as writable. The original
page frame remains write-protected: when the other process tries to write into it, the kernel checks
whether the writing process is the only owner of the page frame; in such a case, it makes the page
frame writable for the process.

【对于内存的处理,基本上都是由exception handler来驱动的:

Page Fault: 动态地去准备好相应的页;

Copy-on-Write: 父子进程间共享内存页,页设置为read-only,如果尝试write,才会激发COW机制,才会创建另外的page】

【其实Linux内核的思想,与面向对象的思想很相似,只不过内核是通过exception驱动的,在exception handler中,根据情况(通常是各个标志)去调用不同的实现。】

【在细节上,struct中的函数指针成员,与c++中的vtable很相似,可以用来实现polymorphism】

在创建子进程时,将父子进程的页指向同一页帧(struct page),并将其设置为write-protected,即如果以该页帧进行写操作时,会触发异常。

异常被操作系统捕获后,如果此时该页仍然被多个进程共享,就复制该页,将复制后的页设置为write-enable,提供给请求写操作的进程。如果此时只有请求进程自己在使用该页,就直接将该页设置为write-enable,允许请求进程进行写操作。

brk调整进程堆大小

   1: SYSCALL_DEFINE1(brk, unsigned long, brk)

   2: {

   3:     unsigned long rlim, retval;

   4:     unsigned long newbrk, oldbrk;

   5:     struct mm_struct *mm = current->mm;

   6:     unsigned long min_brk;

   7:  

   8:     down_write(&mm->mmap_sem);

   9:  

  10: #ifdef CONFIG_COMPAT_BRK

  11:     /*

  12:      * CONFIG_COMPAT_BRK can still be overridden by setting

  13:      * randomize_va_space to 2, which will still cause mm->start_brk

  14:      * to be arbitrarily shifted

  15:      */

  16:     if (current->brk_randomized)

  17:         min_brk = mm->start_brk;

  18:     else

  19:         min_brk = mm->end_data;

  20: #else

  21:     min_brk = mm->start_brk;

  22: #endif

  23:     if (brk < min_brk)

  24:         goto out;

  25:  

  26:     /*

  27:      * Check against rlimit here. If this check is done later after the test

  28:      * of oldbrk with newbrk then it can escape the test and let the data

  29:      * segment grow beyond its set limit the in case where the limit is

  30:      * not page aligned -Ram Gupta

  31:      */

  32:     rlim = rlimit(RLIMIT_DATA);

  33:     if (rlim < RLIM_INFINITY && (brk - mm->start_brk) +

  34:             (mm->end_data - mm->start_data) > rlim)

  35:         goto out;

  36:  

  37:     newbrk = PAGE_ALIGN(brk);

  38:     oldbrk = PAGE_ALIGN(mm->brk);

  39:     if (oldbrk == newbrk)

  40:         goto set_brk;

  41:  

  42:     /* Always allow shrinking brk. */

  43:     if (brk <= mm->brk) {

  44:         if (!do_munmap(mm, newbrk, oldbrk-newbrk))

  45:             goto set_brk;

  46:         goto out;

  47:     }

  48:  

  49:     /* Check against existing mmap mappings. */

  50:     if (find_vma_intersection(mm, oldbrk, newbrk+PAGE_SIZE))

  51:         goto out;

  52:  

  53:     /* Ok, looks good - let it rip. */

  54:     if (do_brk(oldbrk, newbrk-oldbrk) != oldbrk)

  55:         goto out;

  56: set_brk:

  57:     mm->brk = brk;

  58: out:

  59:     retval = mm->brk;

  60:     up_write(&mm->mmap_sem);

  61:     return retval;

  62: }

Linux对用户态的动态内存管理的更多相关文章

  1. 十天学Linux内核之第三天---内存管理方式

    原文:十天学Linux内核之第三天---内存管理方式 昨天分析的进程的代码让自己还在头昏目眩,脑子中这几天都是关于Linux内核的,对于自己出现的一些问题我会继续改正,希望和大家好好分享,共同进步.今 ...

  2. 动态内存管理详解:malloc/free/new/delete/brk/mmap

    c++ 内存获取和释放 new/delete,new[]/delete[] c 内存获取和释放 malloc/free, calloc/realloc 上述8个函数/操作符是c/c++语言里常用来做动 ...

  3. 动态内存管理:malloc/free/new/delete/brk/mmap

    这是我去腾讯面试的时候遇到的一个问题——malloc()是如何申请内存的? c++ 内存获取和释放 new/delete,new[]/delete[] c 内存获取和释放 malloc/free, c ...

  4. uCGUI动态内存管理

    动态内存的堆区 /* 堆区共用体定义 */ typedef union { /* 可以以4字节来访问堆区,也可以以1个字节来访问 */ ]; /* required for proper aligne ...

  5. FreeRTOS 动态内存管理

    以下转载自安富莱电子: http://forum.armfly.com/forum.php 本章节为大家讲解 FreeRTOS 动态内存管理,动态内存管理是 FreeRTOS 非常重要的一项功能,前面 ...

  6. oracle结构-内存结构与动态内存管理

    内存结构与动态内存管理 内存是影响数据库性能的重要因素. oracle8i使用静态内存管理,即,SGA内是预先在参数中配置好的,数据库启动时就按这些配置来进行内在分配,oracle10g引入了动态内存 ...

  7. 动态内存管理---new&amp;delete

    动态内存管理 动态对象(堆对象)是程序在执行过程中在动态内存中用new运算符创建的对象. 因为是用户自己用new运算符创建的.因此也要求用户自己用delete运算符释放,即用户必须自己管理动态内存. ...

  8. C语言之动态内存管理

    C语言之动态内存管理 大纲: 储存器原理 为什么存在动态内存的开辟 malloc() free() calloc() realloc() 常见错误 例题 柔性数组 零(上).存储器原理 之前我们提到了 ...

  9. C++动态内存管理之shared_ptr、unique_ptr

    C++中的动态内存管理是通过new和delete两个操作符来完成的.new操作符,为对象分配内存并调用对象所属类的构造函数,返回一个指向该对象的指针.delete调用时,销毁对象,并释放对象所在的内存 ...

随机推荐

  1. Oracle-随笔笔记

    1.重命名数据库表.重命名字段 alter table tablename1 rename to tablename2; alter table tablename1 rename column co ...

  2. python 分析 知乎粉丝数据

    昨天花了一下午写了一个小爬虫,用来分析自己的粉丝数据.这个真好玩!今天帮了群里好多大V也爬了他们的数据.运行速度:每分钟5千粉丝以上.暂时先写成这样,这两天要准备补考,没有时间继续玩这个. 下次要改进 ...

  3. 71.Edit Distance(编辑距离)

    Level:   Hard 题目描述: Given two words word1 and word2, find the minimum number of operations required ...

  4. npm安装教程[转载的,版权归原作者]

    详情在里面:https://www.cnblogs.com/lgx5/p/10732016.html 详情二:https://www.cnblogs.com/lolDragon/p/6268345.h ...

  5. sig文件制作

    一 配置环境变量 将link.exe,pcf.exe,sigmake.exe添加进PATH环境变量(选择“我的电脑” >“属性”>“高级” >“环境变量”>将文件地址添加进“p ...

  6. 浏览器如何减少 reflow/repaint

    1.不要一条一条地修改 DOM 的样式.与其这样,还不如预先定义好 css 的 class,然后修改 DOM 的 className. 2)把 DOM 离线后修改.如: 使用 documentFrag ...

  7. Linux学习笔记之磁盘与文件系统的管理

    三.Linux磁盘与文件系统的管理 MBR扇区(512B) 磁盘的分区组成 Boot sector    扇区(用来装引导程序) Super block   记录inode与Block的信息 Inod ...

  8. Kvm --05 密码保护:Kvm管理之WebVirtMgr

    目录 密码保护:Kvm管理之WebVirtMgr 1. 前言 2. 特点 3. 功能 4. 部署 1).安装相关依赖 2).安装Python需求环境 3).配置Nginx 4). 远程连接 5).更新 ...

  9. github托管代码

    安装git客户端 github是服务端,要想在自己电脑上使用git我们还需要一个git客户端, windows用户请下载 http://msysgit.github.com/ mac用户请下载 htt ...

  10. MVC模式设计的Web层框架初识

    struts是个什么东西? struts是一个按MVC模式设计的Web层框架,其实它就是一个大大的servlet,这个Servlet名为ActionServlet,或是ActionServlet的子类 ...