Linux内核内存管理-内存访问与缺页中断【转】

转自：https://yq.aliyun.com/articles/5865

摘要：简单描述了x86 32位体系结构下Linux内核的用户进程和内核线程的线性地址空间和物理内存的联系，分析了高端内存的引入与缺页中断的具体处理流程。先介绍了用户态进程的执行流程，然后对比了内核线程，引入高端内存的概念，最后分析了缺页中断的流程。用户进程 fork之后的用户态进...

简单描述了x86 32位体系结构下Linux内核的用户进程和内核线程的线性地址空间和物理内存的联系，分析了高端内存的引入与缺页中断的具体处理流程。先介绍了用户态进程的执行流程，然后对比了内核线程，引入高端内存的概念，最后分析了缺页中断的流程。

用户进程
fork之后的用户态进程已经建立好了所需的数据结构，比如task struct，thread info，mm struct等，将编译链接好的可执行程序的地址区域与进程结构中内存区域做好映射，等开始执行的时候，访问并未经过映射的用户地址空间，会发生缺页中断，然后内核态的对应中断处理程序负责分配page，并将用户进程空间导致缺页的地址与page关联，然后检查是否有相同程序文件的buffer，因为可能其他进程执行同一个程序文件，已经将程序读到buffer里边了，如果没有，则将磁盘上的程序部分读到buffer，而buffer head通常是与分配的页面相关联的，所以实际上会读到对应页面代表的物理内存之中，返回到用户态导致缺页的地址继续执行，此时经过mmu的翻译，用户态地址成功映射到对应页面和物理地址，然后读取指令执行。在上述过程中，如果由于内存耗尽或者权限的问题，可能会返回-NOMEM或segment fault错误给用户态进程。
内核线程
没有独立的mm结构，所有内核线程共享一个内核地址空间与内核页表，由于为了方便系统调用等，在用户态进程规定了内核的地址空间是高1G的线性地址，而低3G线性地址空间供用户态使用。注意这部分是和用户态进程的线性地址是重合的，经过mmu的翻译，会转换到相同的物理地址，即前1G的物理地址（准确来讲后128M某些部分的物理地址可能会变化），内核线程访问内存也是要经过mmu的，所以借助用户态进程的页表，虽然内核有自己的内核页表，但不直接使用（为了减少用户态和内核态页表切换的消耗？），用户进程页表的高1G部分实际上是共享内核页表的映射的，访问高1G的线性地址时能访问到低1G的物理地址。而且，由于从用户进程角度看，内核地址空间只有3G－4G这一段（内核是无法直接访问0－3G的线性地址空间的，因为这一段是用户进程所有，一方面如果内核直接读写0－3G的线性地址可能会毁坏进程数据结构，另一方面，不同用户态进程线性地址空间实际映射到不同的物理内存地址，所以可能此刻内核线程借助这个用户态进程的页表成功映射到某个物理地址，但是到下一刻，借助下一个用户态进程的页表，相同的线性地址就可能映射到不同的物理内存地址了）。
高端内存
那么，如何让内核访问到大于1G的物理内存？由此引入高端内存的概念，基本思路就是将3G－4G这1G的内核线性地址空间（从用户进程的角度看，从内核线程的角度看是0－1G）取出一部分挪作他用，而不是固定映射，即重用部分内核线性地址空间，映射到1G之上的物理内存。所以，对于x86 32位体系上的Linux内核将3G－4G的线性地址空间分为0－896m和896m－1G的部分，前面部分使用固定映射，当内核使用进程页表访问3G－3G＋896m的线性地址时，不会发生缺页中断，但是当访问3G＋896m以上的线性地址时，可能由于内核页表被更新，而进程页表还未和内核页表同步，此时会发生内核地址空间的缺页中断，从而将内核页表同步到当前进程页表。注意，使用vmalloc分配内存的时候，可能已经设置好了内核页表，等到下一次借助进程页表访问内核空间地址发生缺页时才会触发内核页表和当前页表的同步。
Linux x86 32位下的线性地址空间与物理地址空间
(图片出自《understanding the linux virtual memory manager》)
缺页
page fault的处理过程如下：在用户空间上下文和内核上下文下都可能访问缺页的线性地址导致缺页中断，但有些情况没有实际意义。
- 如果缺页地址位于内核线性地址空间
  - 如果在vmalloc区，则同步内核页表和用户进程页表，否则挂掉。注意此处未分具体上下文
- 如果发生在中断上下文或者!mm，则检查exception table，如果没有则挂掉。
- 如果缺页地址发生在用户进程线性地址空间
  - 如果在内核上下文，则查exception table，如果没有，则挂掉。这种情况没多大实际意义
  - 如果在用户进程上下文
    - 查找vma，找到，先判断是否需要栈扩张，否则进入通常的处理流程
    - 查找vma，未找到，bad area，通常返回segment fault
具体的缺页中断流程图及代码如下：
(图片出自《understanding the linux virtual memory manager》)

（Linux 3.19.3 arch/x86/mm/fault.c 1044）

/*

 * This routine handles page faults.  It determines the address,

 * and the problem, and then passes it off to one of the appropriate

 * routines.

 *

 * This function must have noinline because both callers

 * {,trace_}do_page_fault() have notrace on. Having this an actual function

 * guarantees there's a function trace entry.

 */

//处理缺页中断

//参数：寄存器值，错误码，缺页地址

static noinline void

__do_page_fault(struct pt_regs *regs, unsigned long error_code,

        unsigned long address)

{

    struct vm_area_struct *vma;

    struct task_struct *tsk;

    struct mm_struct *mm;

    int fault, major = 0;

    unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;

    tsk = current;

    mm = tsk->mm;

    /*

     * Detect and handle instructions that would cause a page fault for

     * both a tracked kernel page and a userspace page.

     */

    if (kmemcheck_active(regs))

        kmemcheck_hide(regs);

    prefetchw(&mm->mmap_sem);

    if (unlikely(kmmio_fault(regs, address)))

        return;

    /*

     * We fault-in kernel-space virtual memory on-demand. The

     * 'reference' page table is init_mm.pgd.

     *

     * NOTE! We MUST NOT take any locks for this case. We may

     * be in an interrupt or a critical region, and should

     * only copy the information from the master page table,

     * nothing more.

     *

     * This verifies that the fault happens in kernel space

     * (error_code & 4) == 0, and that the fault was not a

     * protection error (error_code & 9) == 0.

     */

    //如果缺页地址位于内核空间

    if (unlikely(fault_in_kernel_space(address))) {

        if (!(error_code & (PF_RSVD | PF_USER | PF_PROT))) { //位于内核上下文

            if (vmalloc_fault(address) >= 0) //如果位于vmalloc区域 vmalloc_sync_one同步内核页表进程页表

                return;

            if (kmemcheck_fault(regs, address, error_code))

                return;

        }

        /* Can handle a stale RO->RW TLB: */

        if (spurious_fault(error_code, address))

            return;

        /* kprobes don't want to hook the spurious faults: */

        if (kprobes_fault(regs))

            return;

        /*

         * Don't take the mm semaphore here. If we fixup a prefetch

         * fault we could otherwise deadlock:

         */

        bad_area_nosemaphore(regs, error_code, address);

        return;

    }

    /* kprobes don't want to hook the spurious faults: */

    if (unlikely(kprobes_fault(regs)))

        return;

    if (unlikely(error_code & PF_RSVD))

        pgtable_bad(regs, error_code, address);

    if (unlikely(smap_violation(error_code, regs))) {

        bad_area_nosemaphore(regs, error_code, address);

        return;

    }

    /*

     * If we're in an interrupt, have no user context or are running

     * in an atomic region then we must not take the fault:

     */

    //如果位于中断上下文或者!mm, 出错

    if (unlikely(in_atomic() || !mm)) {

        bad_area_nosemaphore(regs, error_code, address);

        return;

    }

    /*

     * It's safe to allow irq's after cr2 has been saved and the

     * vmalloc fault has been handled.

     *

     * User-mode registers count as a user access even for any

     * potential system fault or CPU buglet:

     */

    if (user_mode_vm(regs)) {

        local_irq_enable();

        error_code |= PF_USER;

        flags |= FAULT_FLAG_USER;

    } else {

        if (regs->flags & X86_EFLAGS_IF)

            local_irq_enable();

    }

    perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);

    if (error_code & PF_WRITE)

        flags |= FAULT_FLAG_WRITE;

    /*

     * When running in the kernel we expect faults to occur only to

     * addresses in user space.  All other faults represent errors in

     * the kernel and should generate an OOPS.  Unfortunately, in the

     * case of an erroneous fault occurring in a code path which already

     * holds mmap_sem we will deadlock attempting to validate the fault

     * against the address space.  Luckily the kernel only validly

     * references user space from well defined areas of code, which are

     * listed in the exceptions table.

     *

     * As the vast majority of faults will be valid we will only perform

     * the source reference check when there is a possibility of a

     * deadlock. Attempt to lock the address space, if we cannot we then

     * validate the source. If this is invalid we can skip the address

     * space check, thus avoiding the deadlock:

     */

    if (unlikely(!down_read_trylock(&mm->mmap_sem))) {

        if ((error_code & PF_USER) == 0 &&

            !search_exception_tables(regs->ip)) {

            bad_area_nosemaphore(regs, error_code, address);

            return;

        }

retry:

        down_read(&mm->mmap_sem);

    } else {

        /*

         * The above down_read_trylock() might have succeeded in

         * which case we'll have missed the might_sleep() from

         * down_read():

         */

        might_sleep();

    }

    //缺页中断地址位于用户空间

    //查找vma

    vma = find_vma(mm, address);

    //没找到，出错

    if (unlikely(!vma)) {

        bad_area(regs, error_code, address);

        return;

    }

    //检查在vma的地址的合法性

    if (likely(vma->vm_start <= address))

        goto good_area;

    if (unlikely(!(vma->vm_flags & VM_GROWSDOWN))) {

        bad_area(regs, error_code, address);

        return;

    }

    //如果在用户上下文

    if (error_code & PF_USER) {

        /*

         * Accessing the stack below %sp is always a bug.

         * The large cushion allows instructions like enter

         * and pusha to work. ("enter $65535, $31" pushes

         * 32 pointers and then decrements %sp by 65535.)

         */

        if (unlikely(address + 65536 + 32 * sizeof(unsigned long) < regs->sp)) {

            bad_area(regs, error_code, address);

            return;

        }

    }

    //栈扩张

    if (unlikely(expand_stack(vma, address))) {

        bad_area(regs, error_code, address);

        return;

    }

    /*

     * Ok, we have a good vm_area for this memory access, so

     * we can handle it..

     */

    //vma合法

good_area:

    if (unlikely(access_error(error_code, vma))) {

        bad_area_access_error(regs, error_code, address);

        return;

    }

    /*

     * If for any reason at all we couldn't handle the fault,

     * make sure we exit gracefully rather than endlessly redo

     * the fault.  Since we never set FAULT_FLAG_RETRY_NOWAIT, if

     * we get VM_FAULT_RETRY back, the mmap_sem has been unlocked.

     */

    //调用通用的缺页处理

    fault = handle_mm_fault(mm, vma, address, flags);

    major |= fault & VM_FAULT_MAJOR;

    /*

     * If we need to retry the mmap_sem has already been released,

     * and if there is a fatal signal pending there is no guarantee

     * that we made any progress. Handle this case first.

     */

    if (unlikely(fault & VM_FAULT_RETRY)) {

        /* Retry at most once */

        if (flags & FAULT_FLAG_ALLOW_RETRY) {

            flags &= ~FAULT_FLAG_ALLOW_RETRY;

            flags |= FAULT_FLAG_TRIED;

            if (!fatal_signal_pending(tsk))

                goto retry;

        }

        /* User mode? Just return to handle the fatal exception */

        if (flags & FAULT_FLAG_USER)

            return;

        /* Not returning to user mode? Handle exceptions or die: */

        no_context(regs, error_code, address, SIGBUS, BUS_ADRERR);

        return;

    }

    up_read(&mm->mmap_sem);

    if (unlikely(fault & VM_FAULT_ERROR)) {

        mm_fault_error(regs, error_code, address, fault);

        return;

    }

    /*

     * Major/minor page fault accounting. If any of the events

     * returned VM_FAULT_MAJOR, we account it as a major fault.

     */

    if (major) {

        tsk->maj_flt++;

        perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, 1, regs, address);

    } else {

        tsk->min_flt++;

        perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1, regs, address);

    }

    check_v8086_mode(regs, address, tsk);

}

NOKPROBE_SYMBOL(__do_page_fault);

扫我，和云栖在线交流

【云栖快讯】首届阿里巴巴在线技术峰会，将于7月19日-21日20:00-21:30在线举办。峰会邀请到阿里集团9位技术大V，分享电商架构、安全、数据处理、数据库、多应用部署、互动技术、Docker持续交付与微服务等一线实战经验，解读最新技术在阿里集团的应用实践。详情请点击

Linux内核内存管理-内存访问与缺页中断【转】的更多相关文章

Linux内核高端内存转
Linux内核地址映射模型x86 CPU采用了段页式地址映射模型.进程代码中的地址为逻辑地址,经过段页式地址映射后,才真正访问物理内存. 段页式机制如下图. Linux内核地址空间划分通 ...
Linux内核高端内存
Linux内核地址映射模型 x86 CPU采用了段页式地址映射模型.进程代码中的地址为逻辑地址,经过段页式地址映射后,才真正访问物理内存. 段页式机制如下图. Linux内核地址空间划分通常32位L ...
Linux内核电源管理综述
资料:http://blog.csdn.net/bingqingsuimeng/article/category/1228414http://os.chinaunix.net/a2006/0519/1 ...
<Linux内核源码>内存管理模型
题外语:本人对linux内核的了解尚浅,如果有差池欢迎指正,也欢迎提问交流! 首先要理解一下每一个进程是如何维护自己独立的寻址空间的,我的电脑里呢是8G内存空间.了解过的朋友应该都知道这是虚拟内存技术 ...
linux内核分析之内存管理
1.struct page /* Each physical page in the system has a struct page associated with * it to keep tra ...
初探Linux内核中的内存管理
Linux内核设计与实现之内存管理的读书笔记初探Linux内核管理内核本身不像用户空间那样奢侈的使用内存; 内核不支持简单快捷的内存分配机制, 用户空间支持? 这种简单快捷的内存分配机制是什么呢? ...
[转]linux内核分析笔记----内存管理
转自:http://blog.csdn.net/Baiduluckyboy/article/details/9667933 内存管理,不用多说,言简意赅.在内核里分配内存还真不是件容易的事情,根本上是 ...
linux内核--用户态内存管理
在上一篇博客“内核内存管理”中,描述的内核内存管理的相关算法和数据结构,在这里简单描述用户态内存管理的数据结构和算法. 一,相关结构体与进程地址空间相关的全部信息都包含在一个叫做“内存描述符”的数据 ...
Linux内核剖析之内存管理
1. 内存管理区为什么分成不同的内存管理区? ISA总线的DMA处理器有严格的限制:仅仅能对物理内存前16M寻址. 内核线性地址空间仅仅有1G,CPU不能直接訪问全部的物理内存. ZONE_DMA ...

随机推荐

26-dotnet watch run 和attach到进程调试
1-打开vscode, 按下Ctrl+`,打开命令行窗口创建一个donet core mvc项目 2-打开刚刚创建的文件夹 3-输入 dotnet run 访问网站 4 -F5键即可调试 5-更改代 ...
零基础学css第二天
内边距与外边距: <!DOCTYPE html> <html> <head> <title></title> <style type= ...
深度学习 GPU环境 Ubuntu 16.04 + Nvidia GTX 1080 + Python 3.6 + CUDA 9.0 + cuDNN 7.1 + TensorFlow 1.6 环境配置
本节详细说明一下深度学习环境配置,Ubuntu 16.04 + Nvidia GTX 1080 + Python 3.6 + CUDA 9.0 + cuDNN 7.1 + TensorFlow 1.6 ...
集合源码分析之 HashSet
一知识准备 HashSet 是Set接口的实现类,Set存在的最大意义区别于List就是,Set中存放的元素不能够重复,就是不能够有两个相同的元素存放在Set中,那么怎样的两个元素才算是相同的,这里 ...
3771: Triple
3771: Triple 链接题意 n个斧头,每个斧头的价值都不同(开始时没注意到),可以取1个,2个,3个斧头组成不同的价值,求每种价值有多少种组成方案(顺序不同算一种) 分析: 生成函数 + 容 ...
PJMEID学习之视频的捕捉与播放
pjmedia是pjsip的视频部分,官网明确提示,要想使用pjmedia离不开directshow/sdl/ffmpeg这三个库. 软件版本的限制: ffmpeg不能高于1.25.(建议下载1.01 ...
每天一个Linux命令（8）：chmod命令
chmod命令用来变更文件或目录的权限. 权限范围的表示法如下: u User,即文件或目录的拥有者:g Group,即文件或目录的所属群组:o Other,除了文件或目录拥有者或所属群组之 ...
Python全栈工程师（exercises）
# # 练习: # # 1. 用map函数求: # # 1**3 + 2**3 + 3 ** 3 + .... 9**3 的和 def add_(x): return x ** 3 print(sum ...
[OpenCV]Mat类详解
http://blog.csdn.net/yang_xian521/article/details/7107786 Preface Mat:Matrix Mat类可以被看做是opencv中C++版本的 ...
[g2o]C++图优化库
g2o以图模型表达上述最小二乘问题:比较适合解决SLAM问题 http://openslam.org http://wiki.ros.org/g2o

Linux内核内存管理-内存访问与缺页中断【转】

Linux内核内存管理-内存访问与缺页中断【转】的更多相关文章

随机推荐

热门专题