Linux mem 1.2 用户态进程空间布局 --- mmap()详解
1. 原理介绍
为了进行权限管理,linux的地址分为用户地址空间和内核地址空间。一般来说这两者大小相等,各占总空间的一半。
例如:
地址模式 | 单个空间 | 用户地址空间 | 内核地址空间 |
---|---|---|---|
32位 | 2G | 0x00000000 - 0x7FFFFFFF | 0x80000000 - 0xFFFFFFFF |
64位(48bit) | 128T | 0x00000000 00000000 - 0x00007FFF FFFFFFFF | 0xFFFF8000 00000000 - 0xFFFFFFFF FFFFFFFF |
64位(57bit) | 64P | 0x00000000 00000000 - 0x00FFFFFF FFFFFFFF | 0xFF000000 00000000 - 0xFFFFFFFF FFFFFFFF |
本篇我们先讨论用户地址空间。
1.1 映射
一般来说地址空间就是实现虚拟地址和物理地址之间的映射,另外用户地址空间还有一个重要的因素参与映射那就是文件。所以用户地址空间的映射分为两类:文件映射(虚拟地址、物理地址、文件)、匿名映射(虚拟地址、物理地址)。
- 1、文件映射
虚拟地址、物理地址、文件,三者的详细映射关系如下图:
地址空间的映射为什么要把文件牵扯进来?因为在系统中内存的资源相比较文件存储资源往往比较小,需要把文件当成内存的一个慢速备份使用:
1、在加载的时候使用“lazy”的延迟分配策略,分配虚拟地址空间时,并不会马上分配对应的物理内存并建立mmu映射,而是在虚拟地址真正使用时触发缺页异常,在异常处理中再进行物理内存分配、数据读取和建立mmu映射。
2、在系统中内存不够需要进行内存回收时,文件映射的物理内存可以暂时释放掉。因为文件中还有备份,在真正使用的时候通过缺页异常再进行加载。
内核空间为什么不使用这种策略?
1、内核对文件的映射很少,一般就是vmlinux,不会消耗太多内存资源。
2、内核空间的代码一般要求快速响应,缺页处理这种会让内核速度未知。
3、内核操作可能处于各种复杂的锁上下文中,在这种上下文中处理缺页异常,会触发新的异常。
- 2、匿名映射
没有慢速备份的空间就只能进行匿名映射了,这类空间一般是数据类的,比如:全局数据
、栈
、堆
。这类空间的处理策略:
1、分配时可以"lazy"延迟分配,真正使用时触发缺页异常,在异常处理中分配全零物理内存、建立mmu映射。
2、在内存回收时,这类空间的数据不能直接释放掉,只能通过swap把数据交换到文件/zram当中后,才能释放映射的物理内存。
1.2 vma管理
用户地址空间是和进程绑定的,每个进程的用户地址空间都是独立的。进程使用mm结构来管理用户地址空间(内核进程没有用户地址空间所以他的mm=NULL)。
active_mm
mm指向的是进程拥有的内存描述符,而active_mm指向进程运行时使用的内存描述符。对普通进程来说,两个字段值相同。然而内核线程没有内存描述符,所以mm一直是NULL。内核线程得以运行时,active_mm字段会被初始化为上一个运行进程的active_mm值(参考schedule()调度函数)。
进程使用vma来管理一段虚拟地址,vma结构中关键的数据成员:
struct vm_area_struct {
unsigned long vm_start; // 起始虚拟地址
unsigned long vm_end; // 虚拟地址长度
unsigned long vm_pgoff; // 起始虚拟地址对应的文件偏移
struct file * vm_file; // 虚拟地址对应的映射文件
}
mm结构使用红黑树来管理所有的虚拟地址vma,同时使用链表串联起所有的vma:
对vma的操作通常有查找(find_vma())、插入(vma_link())。
需要特别注意的是find_vma(mm, addr)。
从字面上理解是根据addr找到一个vma满足条件vma->vm_start < addr < vma->vm_end
,但实质上他是根据addr找到第一个满足addr < vma->vm_end
的vma。所以要注意你虽然使用find_vma()查找到了一个vma,但并不意味着你的addr位于这个vma当中。非常容易犯的错误!
1.3 mmap
上面的章节介绍了各种背景,而建立用户地址映射的具体实施是从mmap()开始的,mmap操作在进程执行的过程中被大量使用。
通常来说mmap()只会分配一段vma空间,并且记录vma和file的关系,并不会马上进行实质性的映射。分配物理内存page、拷贝文件内容、创建mmu映射的这些后续动作在缺页异常中进行。
1.4 缺页异常
当有用户访问上一节mmap的地址时,因为没有对应的mmu映射,缺页异常发生了。
在异常处理中进行:分配物理内存page、拷贝文件内容、创建mmu映射。
1.5 layout
除了execve()时把text、data映射完成以后,后续so的加载等都需要使用mmap()进行映射,mmap映射的起始地址称为mm->mmap_base。
根据mmap_base的不同映射方式,整个用户地址空间有两种layout模式。
- 1、legacy layout (mmap从低向高增长)
可以看到传统layout模式下,mm_base大概从用户空间的1/3处开始。stack的最大空间基本是固定的,能灵活增长的是heap和mmap,都是从下向上增长。
这种布局有什么问题吗?最大的问题是heap和mmap的空间不能共享,heap空间最大为1/3 task_size
左右,mmap空间最大为2/3 task_size
左右。
- 2、modern layout (mmap从高向低增长)
现在普遍采用新型的layout布局,能实现heap和mmap空间共享,heap从下向上增长、mmap从上向下增长。实现了更大空间的可能。
例如:
$ cat /proc/3389/maps
00400000-00401000 r-xp 00000000 fd:00 104079935 /home/ipu/hook/build/output/bin/sample-target
00600000-00601000 r--p 00000000 fd:00 104079935 /home/ipu/hook/build/output/bin/sample-target
00601000-00602000 rw-p 00001000 fd:00 104079935 /home/ipu/hook/build/output/bin/sample-target
01b9f000-01bc0000 rw-p 00000000 00:00 0 [heap]
7f5c77ad9000-7f5c77c9c000 r-xp 00000000 fd:00 191477 /usr/lib64/libc-2.17.so
7f5c77c9c000-7f5c77e9c000 ---p 001c3000 fd:00 191477 /usr/lib64/libc-2.17.so
7f5c77e9c000-7f5c77ea0000 r--p 001c3000 fd:00 191477 /usr/lib64/libc-2.17.so
7f5c77ea0000-7f5c77ea2000 rw-p 001c7000 fd:00 191477 /usr/lib64/libc-2.17.so
7f5c77ea2000-7f5c77ea7000 rw-p 00000000 00:00 0
7f5c77ea7000-7f5c77ec9000 r-xp 00000000 fd:00 191470 /usr/lib64/ld-2.17.so
7f5c7809d000-7f5c780a0000 rw-p 00000000 00:00 0
7f5c780c6000-7f5c780c8000 rw-p 00000000 00:00 0
7f5c780c8000-7f5c780c9000 r--p 00021000 fd:00 191470 /usr/lib64/ld-2.17.so
7f5c780c9000-7f5c780ca000 rw-p 00022000 fd:00 191470 /usr/lib64/ld-2.17.so
7f5c780ca000-7f5c780cb000 rw-p 00000000 00:00 0
7ffe4a3ef000-7ffe4a410000 rw-p 00000000 00:00 0 [stack]
7ffe4a418000-7ffe4a41a000 r-xp 00000000 00:00 0 [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall]
1.6 brk
前面说过heap
是一块特殊的mmap匿名映射区域,系统有一个专门的系统调用brk()来负责这个区域mmap的大小。
brk()和mmap()原理上是一样的的,相当于mmap()的一个特例,它调整heap
区域结束指针mm->brk
。
2. 代码详解
2.1 关键数据结构
mm:
/**
* 内存描述符。task_struct的mm字段指向它。
* 它包含了进程地址空间有关的全部信息。
*/
struct mm_struct {
/**
* 指向线性区对象的链表头。
*/
struct vm_area_struct * mmap; /* list of VMAs */
/**
* 指向线性区对象的红-黑树的根
*/
struct rb_root mm_rb;
/**
* 指向最后一个引用的线性区对象。
*/
struct vm_area_struct * mmap_cache; /* last find_vma result */
/**
* 在进程地址空间中搜索有效线性地址区的方法。
*/
unsigned long (*get_unmapped_area) (struct file *filp,
unsigned long addr, unsigned long len,
unsigned long pgoff, unsigned long flags);
/**
* 释放线性地址区间时调用的方法。
*/
void (*unmap_area) (struct vm_area_struct *area);
/**
* 标识第一个分配的匿名线性区或文件内存映射的线性地址。
*/
unsigned long mmap_base; /* base of mmap area */
/**
* 内核从这个地址开始搜索进程地址空间中线性地址的空间区间。
*/
unsigned long free_area_cache; /* first hole */
/**
* 指向页全局目录。
*/
pgd_t * pgd;
/**
* 次使用计数器。存放共享mm_struct数据结构的轻量级进程的个数。
*/
atomic_t mm_users; /* How many users with user space? */
/**
* 主使用计数器。每当mm_count递减时,内核都要检查它是否变为0,如果是,就要解除这个内存描述符。
*/
atomic_t mm_count; /* How many references to "struct mm_struct" (users count as 1) */
/**
* 线性区的个数。
*/
int map_count; /* number of VMAs */
/**
* 内存描述符的读写信号量。
* 由于描述符可能在几个轻量级进程间共享,通过这个信号量可以避免竞争条件。
*/
struct rw_semaphore mmap_sem;
/**
* 线性区和页表的自旋锁。
*/
spinlock_t page_table_lock; /* Protects page tables, mm->rss, mm->anon_rss */
/**
* 指向内存描述符链表中的相邻元素。
*/
struct list_head mmlist; /* List of maybe swapped mm's. These are globally strung
* together off init_mm.mmlist, and are protected
* by mmlist_lock
*/
/**
* start_code-可执行代码的起始地址。
* end_code-可执行代码的最后地址。
* start_data-已初始化数据的起始地址。
* end_data--已初始化数据的结束地址。
*/
unsigned long start_code, end_code, start_data, end_data;
/**
* start_brk-堆的超始地址。
* brk-堆的当前最后地址。
* start_stack-用户态堆栈的起始地址。
*/
unsigned long start_brk, brk, start_stack;
/**
* arg_start-命令行参数的起始地址。
* arg_end-命令行参数的结束地址。
* env_start-环境变量的起始地址。
* env_end-环境变量的结束地址。
*/
unsigned long arg_start, arg_end, env_start, env_end;
/**
* rss-分配给进程的页框总数
* anon_rss-分配给匿名内存映射的页框数。s
* total_vm-进程地址空间的大小(页框数)
* locked_vm-锁住而不能换出的页的个数。
* shared_vm-共享文件内存映射中的页数。
*/
unsigned long rss, anon_rss, total_vm, locked_vm, shared_vm;
/**
* exec_vm-可执行内存映射的页数。
* stack_vm-用户态堆栈中的页数。
* reserved_vm-在保留区中的页数或在特殊线性区中的页数。
* def_flags-线性区默认的访问标志。
* nr_ptes-this进程的页表数。
*/
unsigned long exec_vm, stack_vm, reserved_vm, def_flags, nr_ptes;
/**
* 开始执行elf程序时使用。
*/
unsigned long saved_auxv[42]; /* for /proc/PID/auxv */
/**
* 表示是否可以产生内存信息转储的标志。
*/
unsigned dumpable:1;
/**
* 懒惰TLB交换的位掩码。
*/
cpumask_t cpu_vm_mask;
/* Architecture-specific MM context */
/**
* 特殊体系结构信息的表。
* 如80X86平台上的LDT地址。
*/
mm_context_t context;
/* Token based thrashing protection. */
/**
* 进程有资格获得交换标记的时间。
*/
unsigned long swap_token_time;
/**
* 如果最近发生了主缺页。则设置该标志。
*/
char recent_pagein;
/* coredumping support */
/**
* 正在把进程地址空间的内容卸载到转储文件中的轻量级进程的数量。
*/
int core_waiters;
/**
* core_startup_done-指向创建内存转储文件时的补充原语。
* core_done-创建内存转储文件时使用的补充原语。
*/
struct completion *core_startup_done, core_done;
/* aio bits */
/**
* 用于保护异步IO上下文链表的锁。
*/
rwlock_t ioctx_list_lock;
/**
* 异步IO上下文链表。
* 一个应用可以创建多个AIO环境,一个给定进程的所有的kioctx描述符存放在一个单向链表中,该链表位于ioctx_list字段
*/
struct kioctx *ioctx_list;
/**
* 默认的异步IO上下文。
*/
struct kioctx default_kioctx;
/**
* 进程所拥有的最大页框数。
*/
unsigned long hiwater_rss; /* High-water RSS usage */
/**
* 进程线性区中的最大页数。
*/
unsigned long hiwater_vm; /* High-water virtual memory usage */
};
vma:
/*
* This struct defines a memory VMM memory area. There is one of these
* per VM-area/task. A VM area is any part of the process virtual memory
* space that has a special rule for the page-fault handlers (ie a shared
* library, the executable area etc).
*/
/**
* 线性区描述符。
*/
struct vm_area_struct {
/**
* 指向线性区所在的内存描述符。
*/
struct mm_struct * vm_mm; /* The address space we belong to. */
/**
* 线性区内的第一个线性地址。
*/
unsigned long vm_start; /* Our start address within vm_mm. */
/**
* 线性区之后的第一个线性地址。
*/
unsigned long vm_end; /* The first byte after our end address
within vm_mm. */
/* linked list of VM areas per task, sorted by address */
/**
* 进程链表中的下一个线性区。
*/
struct vm_area_struct *vm_next;
/**
* 线性区中页框的访问许可权。
*/
pgprot_t vm_page_prot; /* Access permissions of this VMA. */
/**
* 线性区的标志。
*/
unsigned long vm_flags; /* Flags, listed below. */
/**
* 用于红黑树的数据。
*/
struct rb_node vm_rb;
/*
* For areas with an address space and backing store,
* linkage into the address_space->i_mmap prio tree, or
* linkage to the list of like vmas hanging off its node, or
* linkage of vma in the address_space->i_mmap_nonlinear list.
*/
/**
* 链接到反映射所使用的数据结构。
*/
union {
/**
* 如果在优先搜索树中,存在两个节点的基索引、堆索引、大小索引完全相同,那么这些相同的节点会被链接到一个链表,而vm_set就是这个链表的元素。
*/
struct {
struct list_head list;
void *parent; /* aligns with prio_tree_node parent */
struct vm_area_struct *head;
} vm_set;
/**
* 如果是文件映射,那么prio_tree_node用于将线性区插入到优先搜索树中。作为搜索树的一个节点。
*/
struct raw_prio_tree_node prio_tree_node;
} shared;
/*
* A file's MAP_PRIVATE vma can be in both i_mmap tree and anon_vma
* list, after a COW of one of the file pages. A MAP_SHARED vma
* can only be in the i_mmap tree. An anonymous MAP_PRIVATE, stack
* or brk vma (with NULL file) can only be in an anon_vma list.
*/
/**
* 指向匿名线性区链表的指针(参见"映射页的反映射")。
* 页框结构有一个anon_vma指针,指向该页的第一个线性区,随后的线性区通过此字段链接起来。
* 通过此字段,可以将线性区链接到此链表中。
*/
struct list_head anon_vma_node; /* Serialized by anon_vma->lock */
/**
* 指向anon_vma数据结构的指针(参见"映射页的反映射")。此指针也存放在页结构的mapping字段中。
*/
struct anon_vma *anon_vma; /* Serialized by page_table_lock */
/* Function pointers to deal with this struct. */
/**
* 指向线性区的方法。
*/
struct vm_operations_struct * vm_ops;
/* Information about our backing store: */
/**
* 在映射文件中的偏移量(以页为单位)。对匿名页,它等于0或vm_start/PAGE_SIZE
*/
unsigned long vm_pgoff; /* Offset (within vm_file) in PAGE_SIZE
units, *not* PAGE_CACHE_SIZE */
/**
* 指向映射文件的文件对象(如果有的话)
*/
struct file * vm_file; /* File we map to (can be NULL). */
/**
* 指向内存区的私有数据。
*/
void * vm_private_data; /* was vm_pte (shared mem) */
/**
* 释放非线性文件内存映射中的一个线性地址区间时使用。
*/
unsigned long vm_truncate_count;/* truncate_count or restart_addr */
#ifndef CONFIG_MMU
atomic_t vm_usage; /* refcount (VMAs shared if !MMU) */
#endif
#ifdef CONFIG_NUMA
struct mempolicy *vm_policy; /* NUMA policy for the VMA */
#endif
};
2.2 mmap()
arch\x86\kernel\sys_x86_64.c:
SYSCALL_DEFINE6(mmap, unsigned long, addr, unsigned long, len,
unsigned long, prot, unsigned long, flags,
unsigned long, fd, unsigned long, off)
{
long error;
error = -EINVAL;
if (off & ~PAGE_MASK)
goto out;
error = sys_mmap_pgoff(addr, len, prot, flags, fd, off >> PAGE_SHIFT);
out:
return error;
}
↓
mm\mmap.c:
SYSCALL_DEFINE6(mmap_pgoff, unsigned long, addr, unsigned long, len,
unsigned long, prot, unsigned long, flags,
unsigned long, fd, unsigned long, pgoff)
{
struct file *file = NULL;
unsigned long retval;
/* (1) 文件内存映射(非匿名内存映射) */
if (!(flags & MAP_ANONYMOUS)) {
audit_mmap_fd(fd, flags);
/* (1.1) 根据fd获取到file */
file = fget(fd);
if (!file)
return -EBADF;
/* (1.2) 如果文件是hugepage,根据hugepage计算长度 */
if (is_file_hugepages(file))
len = ALIGN(len, huge_page_size(hstate_file(file)));
retval = -EINVAL;
/* (1.3) 如果指定了huge tlb映射,但是文件不是huge page,出错返回 */
if (unlikely(flags & MAP_HUGETLB && !is_file_hugepages(file)))
goto out_fput;
/* (2) 匿名内存映射 且huge tlb */
} else if (flags & MAP_HUGETLB) {
struct user_struct *user = NULL;
struct hstate *hs;
hs = hstate_sizelog((flags >> MAP_HUGE_SHIFT) & MAP_HUGE_MASK);
if (!hs)
return -EINVAL;
len = ALIGN(len, huge_page_size(hs));
/*
* VM_NORESERVE is used because the reservations will be
* taken when vm_ops->mmap() is called
* A dummy user value is used because we are not locking
* memory so no accounting is necessary
*/
file = hugetlb_file_setup(HUGETLB_ANON_FILE, len,
VM_NORESERVE,
&user, HUGETLB_ANONHUGE_INODE,
(flags >> MAP_HUGE_SHIFT) & MAP_HUGE_MASK);
if (IS_ERR(file))
return PTR_ERR(file);
}
/* (3) 清除用户设置的以下两个标志 */
flags &= ~(MAP_EXECUTABLE | MAP_DENYWRITE);
/* (4) 进一步调用 */
retval = vm_mmap_pgoff(file, addr, len, prot, flags, pgoff);
out_fput:
if (file)
fput(file);
return retval;
}
↓
unsigned long vm_mmap_pgoff(struct file *file, unsigned long addr,
unsigned long len, unsigned long prot,
unsigned long flag, unsigned long pgoff)
{
unsigned long ret;
struct mm_struct *mm = current->mm;
unsigned long populate;
LIST_HEAD(uf);
ret = security_mmap_file(file, prot, flag);
if (!ret) {
if (down_write_killable(&mm->mmap_sem))
return -EINTR;
/* 进一步调用 */
ret = do_mmap_pgoff(file, addr, len, prot, flag, pgoff,
&populate, &uf);
up_write(&mm->mmap_sem);
userfaultfd_unmap_complete(mm, &uf);
/* 如果需要,立即填充vma对应的内存 */
if (populate)
mm_populate(ret, populate);
}
return ret;
}
↓
do_mmap_pgoff()
↓
unsigned long do_mmap(struct file *file, unsigned long addr,
unsigned long len, unsigned long prot,
unsigned long flags, vm_flags_t vm_flags,
unsigned long pgoff, unsigned long *populate,
struct list_head *uf)
{
struct mm_struct *mm = current->mm;
int pkey = 0;
*populate = 0;
if (!len)
return -EINVAL;
/*
* Does the application expect PROT_READ to imply PROT_EXEC?
* 应用程序是否期望PROT_READ暗示PROT_EXEC?
*
* (the exception is when the underlying filesystem is noexec
* mounted, in which case we dont add PROT_EXEC.)
* (例外是在底层文件系统未执行noexec挂载时,在这种情况下,我们不添加PROT_EXEC。)
*/
/* (4.1.1) PROT_READ暗示PROT_EXEC */
if ((prot & PROT_READ) && (current->personality & READ_IMPLIES_EXEC))
if (!(file && path_noexec(&file->f_path)))
prot |= PROT_EXEC;
/* (4.1.2) 如果没有设置固定地址的标志,给地址按page取整,且使其不小于mmap_min_addr */
if (!(flags & MAP_FIXED))
addr = round_hint_to_min(addr);
/* Careful about overflows.. */
/* (4.1.3) 给长度按page取整 */
len = PAGE_ALIGN(len);
if (!len)
return -ENOMEM;
/* offset overflow? */
/* (4.1.4) 判断page offset + 长度,是否已经溢出 */
if ((pgoff + (len >> PAGE_SHIFT)) < pgoff)
return -EOVERFLOW;
/* Too many mappings? */
/* (4.1.5) 判断本进程mmap的区段个数已经超标 */
if (mm->map_count > sysctl_max_map_count)
return -ENOMEM;
/* Obtain the address to map to. we verify (or select) it and ensure
* that it represents a valid section of the address space.
*/
/* (4.2) 从本进程的线性地址红黑树中分配一块空白地址 */
addr = get_unmapped_area(file, addr, len, pgoff, flags);
if (offset_in_page(addr))
return addr;
/* (4.3.1) 如果prot只指定了exec */
if (prot == PROT_EXEC) {
pkey = execute_only_pkey(mm);
if (pkey < 0)
pkey = 0;
}
/* Do simple checking here so the lower-level routines won't have
* to. we assume access permissions have been handled by the open
* of the memory object, so we don't do any here.
* 在这里进行简单的检查,因此不必执行较低级别的例程。 我们假定访问权限已由内存对象的打开处理,因此在此不做任何操作。
*/
/* (4.3.2) 计算初始vm flags:
根据prot计算相关的vm flag
根据flags计算相关的vm flag
再综合mm的默认flags等
*/
vm_flags |= calc_vm_prot_bits(prot, pkey) | calc_vm_flag_bits(flags) |
mm->def_flags | VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC;
/* (4.3.3) 如果指定了内存lock标志,但是当前进程不能lock,出错返回 */
if (flags & MAP_LOCKED)
if (!can_do_mlock())
return -EPERM;
/* (4.3.4) 如果指定了内存lock标志,但是lock的长度超标,出错返回 */
if (mlock_future_check(mm, vm_flags, len))
return -EAGAIN;
/* (4.4) 文件内存映射的一系列判断和处理 */
if (file) {
struct inode *inode = file_inode(file);
unsigned long flags_mask;
/* (4.4.1) 指定的page offset和len,需要在文件的合法长度内 */
if (!file_mmap_ok(file, inode, pgoff, len))
return -EOVERFLOW;
/* (4.4.2) 本文件支持的mask,和mmap()传递下来的flags进行判断 */
flags_mask = LEGACY_MAP_MASK | file->f_op->mmap_supported_flags;
switch (flags & MAP_TYPE) {
/* (4.4.3) 共享映射 */
case MAP_SHARED:
/*
* Force use of MAP_SHARED_VALIDATE with non-legacy
* flags. E.g. MAP_SYNC is dangerous to use with
* MAP_SHARED as you don't know which consistency model
* you will get. We silently ignore unsupported flags
* with MAP_SHARED to preserve backward compatibility.
*/
flags &= LEGACY_MAP_MASK;
/* fall through */
/* (4.4.4) 共享&校验映射 */
case MAP_SHARED_VALIDATE:
if (flags & ~flags_mask)
return -EOPNOTSUPP;
if ((prot&PROT_WRITE) && !(file->f_mode&FMODE_WRITE))
return -EACCES;
/*
* Make sure we don't allow writing to an append-only
* file..
*/
if (IS_APPEND(inode) && (file->f_mode & FMODE_WRITE))
return -EACCES;
/*
* Make sure there are no mandatory locks on the file.
*/
if (locks_verify_locked(file))
return -EAGAIN;
vm_flags |= VM_SHARED | VM_MAYSHARE;
if (!(file->f_mode & FMODE_WRITE))
vm_flags &= ~(VM_MAYWRITE | VM_SHARED);
/* fall through */
/* (4.4.3) 私有映射 */
case MAP_PRIVATE:
if (!(file->f_mode & FMODE_READ))
return -EACCES;
if (path_noexec(&file->f_path)) {
if (vm_flags & VM_EXEC)
return -EPERM;
vm_flags &= ~VM_MAYEXEC;
}
if (!file->f_op->mmap)
return -ENODEV;
if (vm_flags & (VM_GROWSDOWN|VM_GROWSUP))
return -EINVAL;
break;
default:
return -EINVAL;
}
/* (4.5) 匿名内存映射的一系列判断和处理 */
} else {
switch (flags & MAP_TYPE) {
case MAP_SHARED:
if (vm_flags & (VM_GROWSDOWN|VM_GROWSUP))
return -EINVAL;
/*
* Ignore pgoff.
*/
pgoff = 0;
vm_flags |= VM_SHARED | VM_MAYSHARE;
break;
case MAP_PRIVATE:
/*
* Set pgoff according to addr for anon_vma.
*/
pgoff = addr >> PAGE_SHIFT;
break;
default:
return -EINVAL;
}
}
/*
* Set 'VM_NORESERVE' if we should not account for the
* memory use of this mapping.
* 如果我们不应该考虑此映射的内存使用,则设置“ VM_NORESERVE”。
*/
if (flags & MAP_NORESERVE) {
/* We honor MAP_NORESERVE if allowed to overcommit */
if (sysctl_overcommit_memory != OVERCOMMIT_NEVER)
vm_flags |= VM_NORESERVE;
/* hugetlb applies strict overcommit unless MAP_NORESERVE */
if (file && is_file_hugepages(file))
vm_flags |= VM_NORESERVE;
}
/* (4.6) 根据查找到的地址、flags,正式在线性地址红黑树中插入一个新的VMA */
addr = mmap_region(file, addr, len, vm_flags, pgoff, uf);
/* (4.7) 默认只是分配vma,不进行实际的内存分配和mmu映射,延迟到page_fault时才处理
如果设置了立即填充的标志,在分配vma时就分配好内存
*/
if (!IS_ERR_VALUE(addr) &&
((vm_flags & VM_LOCKED) ||
(flags & (MAP_POPULATE | MAP_NONBLOCK)) == MAP_POPULATE))
*populate = len;
return addr;
}
2.2.1 mm->mmap_base(mmap基地址)
current->mm->get_unmapped_area默认赋值的是arch_get_unmapped_area()/arch_get_unmapped_area_topdown(),它是在进程创建时被赋值。在这个函数中还有另外一件重要的事情,给mmap base进行赋值:
sys_execve() ... → load_elf_binary() → setup_new_exec() → arch_pick_mmap_layout()
void arch_pick_mmap_layout(struct mm_struct *mm)
{
/* (1) 给get_unmapped_area成员赋值 */
if (mmap_is_legacy())
mm->get_unmapped_area = arch_get_unmapped_area;
else
mm->get_unmapped_area = arch_get_unmapped_area_topdown;
/* (2) 计算64bit模式下,mmap的基地址
传统layout模式:用户空间的1/3处,再加上随机偏移
现代layout模式:用户空间顶端减去堆栈和堆栈随机偏移,再减去随机偏移
*/
arch_pick_mmap_base(&mm->mmap_base, &mm->mmap_legacy_base,
arch_rnd(mmap64_rnd_bits), task_size_64bit(0));
#ifdef CONFIG_HAVE_ARCH_COMPAT_MMAP_BASES
/*
* The mmap syscall mapping base decision depends solely on the
* syscall type (64-bit or compat). This applies for 64bit
* applications and 32bit applications. The 64bit syscall uses
* mmap_base, the compat syscall uses mmap_compat_base.
*/
/* (3) 计算32bit兼容模式下,mmap的基地址
arch_pick_mmap_base(&mm->mmap_compat_base, &mm->mmap_compat_legacy_base,
arch_rnd(mmap32_rnd_bits), task_size_32bit());
#endif
}
|→
// 用户地址空间的最大值:2^47-0x1000 = 0x7FFFFFFFF000 // 约128T
#define TASK_SIZE_MAX ((1UL << __VIRTUAL_MASK_SHIFT) - PAGE_SIZE)
#define DEFAULT_MAP_WINDOW ((1UL << 47) - PAGE_SIZE)
/* This decides where the kernel will search for a free chunk of vm
* space during mmap's.
*/
#define IA32_PAGE_OFFSET ((current->personality & ADDR_LIMIT_3GB) ? \
0xc0000000 : 0xFFFFe000)
#define TASK_SIZE_LOW (test_thread_flag(TIF_ADDR32) ? \
IA32_PAGE_OFFSET : DEFAULT_MAP_WINDOW)
#define TASK_SIZE (test_thread_flag(TIF_ADDR32) ? \
IA32_PAGE_OFFSET : TASK_SIZE_MAX)
#define TASK_SIZE_OF(child) ((test_tsk_thread_flag(child, TIF_ADDR32)) ? \
IA32_PAGE_OFFSET : TASK_SIZE_MAX)
#define STACK_TOP TASK_SIZE_LOW
#define STACK_TOP_MAX TASK_SIZE_MAX
|→
// 根据定义的随机bit,来计算随机偏移多少个page
static unsigned long arch_rnd(unsigned int rndbits)
{
if (!(current->flags & PF_RANDOMIZE))
return 0;
return (get_random_long() & ((1UL << rndbits) - 1)) << PAGE_SHIFT;
}
unsigned long task_size_64bit(int full_addr_space)
{
return full_addr_space ? TASK_SIZE_MAX : DEFAULT_MAP_WINDOW;
}
|→
static void arch_pick_mmap_base(unsigned long *base, unsigned long *legacy_base,
unsigned long random_factor, unsigned long task_size)
{
/* (2.1) 传统layout模式下,mmap的基址:
PAGE_ALIGN(task_size / 3) + rnd // 用户空间的1/3处,再加上随机偏移
*/
*legacy_base = mmap_legacy_base(random_factor, task_size);
if (mmap_is_legacy())
*base = *legacy_base;
else
/* (2.2) 现代layout模式下,mmap的基址:
PAGE_ALIGN(task_size - stask_gap - rnd) // 用户空间顶端减去堆栈和堆栈随机偏移,再减去随机偏移
*/
*base = mmap_base(random_factor, task_size);
}
||→
static unsigned long mmap_legacy_base(unsigned long rnd,
unsigned long task_size)
{
return __TASK_UNMAPPED_BASE(task_size) + rnd;
}
#define __TASK_UNMAPPED_BASE(task_size) (PAGE_ALIGN(task_size / 3))
||→
static unsigned long mmap_base(unsigned long rnd, unsigned long task_size)
{
// 堆栈的最大值
unsigned long gap = rlimit(RLIMIT_STACK);
// 堆栈的最大随机偏移 + 1M
unsigned long pad = stack_maxrandom_size(task_size) + stack_guard_gap;
unsigned long gap_min, gap_max;
/* Values close to RLIM_INFINITY can overflow. */
if (gap + pad > gap)
gap += pad;
/*
* Top of mmap area (just below the process stack).
* Leave an at least ~128 MB hole with possible stack randomization.
*/
// 最小不小于128M,最大不大于用户空间的5/6
gap_min = SIZE_128M;
gap_max = (task_size / 6) * 5;
if (gap < gap_min)
gap = gap_min;
else if (gap > gap_max)
gap = gap_max;
return PAGE_ALIGN(task_size - gap - rnd);
}
static unsigned long stack_maxrandom_size(unsigned long task_size)
{
unsigned long max = 0;
if (current->flags & PF_RANDOMIZE) {
max = (-1UL) & __STACK_RND_MASK(task_size == task_size_32bit());
max <<= PAGE_SHIFT;
}
return max;
}
/* 1GB for 64bit, 8MB for 32bit */
// 堆栈的最大随机偏移:64bit下16G,32bit下8M
#define __STACK_RND_MASK(is32bit) ((is32bit) ? 0x7ff : 0x3fffff)
/* enforced gap between the expanding stack and other mappings. */
// 1M 的stack gap空间
unsigned long stack_guard_gap = 256UL<<PAGE_SHIFT;
2.2.2 get_unmapped_area()
get_unmapped_area()函数是从当前进程的用户地址空间中找出一块符合要求的空闲空间,给新的vma。
unsigned long
get_unmapped_area(struct file *file, unsigned long addr, unsigned long len,
unsigned long pgoff, unsigned long flags)
{
unsigned long (*get_area)(struct file *, unsigned long,
unsigned long, unsigned long, unsigned long);
unsigned long error = arch_mmap_check(addr, len, flags);
if (error)
return error;
/* Careful about overflows.. */
if (len > TASK_SIZE)
return -ENOMEM;
/* (1.1) 默认,使用本进程的mm->get_unmapped_area函数 */
get_area = current->mm->get_unmapped_area;
if (file) {
/* (1.2) 文件内存映射,且文件有自己的get_unmapped_area,则使用file->f_op->get_unmapped_area */
if (file->f_op->get_unmapped_area)
get_area = file->f_op->get_unmapped_area;
} else if (flags & MAP_SHARED) {
/*
* mmap_region() will call shmem_zero_setup() to create a file,
* so use shmem's get_unmapped_area in case it can be huge.
* do_mmap_pgoff() will clear pgoff, so match alignment.
*/
pgoff = 0;
/* (1.3) 匿名共享内存映射,使用shmem_get_unmapped_area函数 */
get_area = shmem_get_unmapped_area;
}
/* (2) 实际的获取线性区域 */
addr = get_area(file, addr, len, pgoff, flags);
if (IS_ERR_VALUE(addr))
return addr;
if (addr > TASK_SIZE - len)
return -ENOMEM;
if (offset_in_page(addr))
return -EINVAL;
error = security_mmap_addr(addr);
return error ? error : addr;
}
2.2.2.1 arch_get_unmapped_area()
进程地址空间,传统/经典布局:
经典布局的缺点:在x86_32,虚拟地址空间从0到0xc0000000,每个用户进程有3GB可用。TASK_UNMAPPED_BASE一般起始于0x4000000(即1GB)。这意味着堆只有1GB的空间可供使用,继续增长则进入到mmap区域。这时mmap区域是自底向上扩展的。
传统layout模式下,mmap分配是从低到高的,从&mm->mmap_base到task_size。默认调用current->mm->get_unmapped_area -> arch_get_unmapped_area():
arch\x86\kernel\sys_x86_64.c:
unsigned long
arch_get_unmapped_area(struct file *filp, unsigned long addr,
unsigned long len, unsigned long pgoff, unsigned long flags)
{
struct mm_struct *mm = current->mm;
struct vm_area_struct *vma;
struct vm_unmapped_area_info info;
unsigned long begin, end;
/* (2.1) 地址和长度不能大于 47 bits */
addr = mpx_unmapped_area_check(addr, len, flags);
if (IS_ERR_VALUE(addr))
return addr;
/* (2.2) 如果是固定映射,则返回原地址 */
if (flags & MAP_FIXED)
return addr;
/* (2.3) 获取mmap区域的开始、结束地址:
begin :mm->mmap_base
end :task_size
*/
find_start_end(addr, flags, &begin, &end);
/* (2.4) 超长出错返回 */
if (len > end)
return -ENOMEM;
/* (2.5) 按照用户给出的原始addr和len查看这里是否有空洞 */
if (addr) {
addr = PAGE_ALIGN(addr);
vma = find_vma(mm, addr);
/* 这里容易有个误解:
开始以为find_vma()的作用是找到一个vma满足:vma->vm_start <= addr < vma->vm_end
实际find_vma()的作用是找到一个地址最小的vma满足: addr < vma->vm_end
*/
/* 在vma红黑树之外找到空间:
1、如果addr的值在vma红黑树之上:!vma ,且有足够的空间:end - len >= addr
2、如果addr的值在vma红黑树之下,且有足够的空间:addr + len <= vm_start_gap(vma)
*/
if (end - len >= addr &&
(!vma || addr + len <= vm_start_gap(vma)))
return addr;
}
info.flags = 0;
info.length = len;
info.low_limit = begin;
info.high_limit = end;
info.align_mask = 0;
info.align_offset = pgoff << PAGE_SHIFT;
if (filp) {
info.align_mask = get_align_mask();
info.align_offset += get_align_bits();
}
/* (2.6) 否则只能废弃掉用户指定的地址,根据长度重新给他找一个合适的地址
优先在vma红黑树的空洞中找,其次在空白位置找
*/
return vm_unmapped_area(&info);
}
|→
/* Look up the first VMA which satisfies addr < vm_end, NULL if none. */
/* 查找最小的VMA,满足addr < vma->vm_end */
struct vm_area_struct *find_vma(struct mm_struct *mm, unsigned long addr)
{
struct rb_node *rb_node;
struct vm_area_struct *vma;
/* Check the cache first. */
/* (2.5.1) 查找vma cache,是否有vma的区域能包含addr地址 */
vma = vmacache_find(mm, addr);
if (likely(vma))
return vma;
/* (2.5.2) 查找vma红黑树,是否有vma的区域能包含addr地址 */
rb_node = mm->mm_rb.rb_node;
while (rb_node) {
struct vm_area_struct *tmp;
tmp = rb_entry(rb_node, struct vm_area_struct, vm_rb);
if (tmp->vm_end > addr) {
vma = tmp;
if (tmp->vm_start <= addr)
break;
rb_node = rb_node->rb_left;
} else
rb_node = rb_node->rb_right;
}
/* (2.5.3) 利用查找到的vma来更新vma cache */
if (vma)
vmacache_update(addr, vma);
return vma;
}
|→
/*
* Search for an unmapped address range.
*
* We are looking for a range that:
* - does not intersect with any VMA; // 不与任何VMA相交;
* - is contained within the [low_limit, high_limit) interval; // 包含在[low_limit,high_limit)间隔内;
* - is at least the desired size. // 至少是所需的大小。
* - satisfies (begin_addr & align_mask) == (align_offset & align_mask) // 满足如下条件
*/
static inline unsigned long
vm_unmapped_area(struct vm_unmapped_area_info *info)
{
/* (2.6.1) 从高往低查找 */
if (info->flags & VM_UNMAPPED_AREA_TOPDOWN)
return unmapped_area_topdown(info);
/* (2.6.2) 默认从低往高查找 */
else
return unmapped_area(info);
}
||→
unsigned long unmapped_area(struct vm_unmapped_area_info *info)
{
/*
* We implement the search by looking for an rbtree node that
* immediately follows a suitable gap. That is,
* 我们查找红黑树,找到一个合适的洞。需要满足以下条件:
* - gap_start = vma->vm_prev->vm_end <= info->high_limit - length;
* - gap_end = vma->vm_start >= info->low_limit + length;
* - gap_end - gap_start >= length
*/
struct mm_struct *mm = current->mm;
struct vm_area_struct *vma;
unsigned long length, low_limit, high_limit, gap_start, gap_end;
/* Adjust search length to account for worst case alignment overhead */
/* (2.6.2.1) 长度加上mask开销 */
length = info->length + info->align_mask;
if (length < info->length)
return -ENOMEM;
/* Adjust search limits by the desired length */
/* (2.6.2.2) 计算high_limit,gap_start<=high_limit */
if (info->high_limit < length)
return -ENOMEM;
high_limit = info->high_limit - length;
/* (2.6.2.3) 计算low_limit,gap_end>=low_limit */
if (info->low_limit > high_limit)
return -ENOMEM;
low_limit = info->low_limit + length;
/* Check if rbtree root looks promising */
if (RB_EMPTY_ROOT(&mm->mm_rb))
goto check_highest;
vma = rb_entry(mm->mm_rb.rb_node, struct vm_area_struct, vm_rb);
/*
* rb_subtree_gap的定义:
* Largest free memory gap in bytes to the left of this VMA.
* 此VMA左侧的最大可用内存空白(以字节为单位)。
* Either between this VMA and vma->vm_prev, or between one of the
* VMAs below us in the VMA rbtree and its ->vm_prev. This helps
* get_unmapped_area find a free area of the right size.
* 在此VMA和vma-> vm_prev之间,或在VMA rbtree中我们下面的VMA之一与其-> vm_prev之间。 这有助于get_unmapped_area找到合适大小的空闲区域。
*/
if (vma->rb_subtree_gap < length)
goto check_highest;
/* (2.6.2.4) 查找红黑树根节点的左子树中是否有符合要求的空洞。
有个疑问:
根节点的右子树不需要搜索了吗?还是根节点没有右子树?
*/
while (true) {
/* Visit left subtree if it looks promising */
/* (2.6.2.4.1) 一直往左找,找到最左边有合适大小的节点
因为最左边的地址最小
*/
gap_end = vm_start_gap(vma);
if (gap_end >= low_limit && vma->vm_rb.rb_left) {
struct vm_area_struct *left =
rb_entry(vma->vm_rb.rb_left,
struct vm_area_struct, vm_rb);
if (left->rb_subtree_gap >= length) {
vma = left;
continue;
}
}
gap_start = vma->vm_prev ? vm_end_gap(vma->vm_prev) : 0;
check_current:
/* Check if current node has a suitable gap */
if (gap_start > high_limit)
return -ENOMEM;
/* (2.6.2.4.2) 如果已找到合适的洞,则跳出循环 */
if (gap_end >= low_limit &&
gap_end > gap_start && gap_end - gap_start >= length)
goto found;
/* Visit right subtree if it looks promising */
/* (2.6.2.4.3) 如果左子树查找失败,从当前vm的右子树查找 */
if (vma->vm_rb.rb_right) {
struct vm_area_struct *right =
rb_entry(vma->vm_rb.rb_right,
struct vm_area_struct, vm_rb);
if (right->rb_subtree_gap >= length) {
vma = right;
continue;
}
}
/* Go back up the rbtree to find next candidate node */
/* (2.6.2.4.4) 如果左右子树都搜寻失败,向回搜寻父节点 */
while (true) {
struct rb_node *prev = &vma->vm_rb;
if (!rb_parent(prev))
goto check_highest;
vma = rb_entry(rb_parent(prev),
struct vm_area_struct, vm_rb);
if (prev == vma->vm_rb.rb_left) {
gap_start = vm_end_gap(vma->vm_prev);
gap_end = vm_start_gap(vma);
goto check_current;
}
}
}
/* (2.6.2.5) 如果红黑树中没有合适的空洞,从highest空间查找是否有合适的
highest空间是还没有vma分配的空白空间
但是优先查找已分配vma之间的空洞
*/
check_highest:
/* Check highest gap, which does not precede any rbtree node */
gap_start = mm->highest_vm_end;
gap_end = ULONG_MAX; /* Only for VM_BUG_ON below */
if (gap_start > high_limit)
return -ENOMEM;
/* (2.6.2.6) 搜索到了合适的空间,返回开始地址 */
found:
/* We found a suitable gap. Clip it with the original low_limit. */
if (gap_start < info->low_limit)
gap_start = info->low_limit;
/* Adjust gap address to the desired alignment */
gap_start += (info->align_offset - gap_start) & info->align_mask;
VM_BUG_ON(gap_start + info->length > info->high_limit);
VM_BUG_ON(gap_start + info->length > gap_end);
return gap_start;
}
2.2.2.2 arch_get_unmapped_area_topdown()
新的虚拟地址空间::
与经典布局不同的是:使用固定值限制栈的最大长度。由于栈是有界的,因此安置内存映射的区域可以在栈末端的下方立即开始。这时mmap区是自顶向下扩展的。由于堆仍然位于虚拟地址空间中较低的区域并向上增长,因此mmap区域和堆可以相对扩展,直至耗尽虚拟地址空间中剩余的区域。
现代layout模式下,mmap分配是从高到低的,调用current->mm->get_unmapped_area -> arch_get_unmapped_area_topdown():
unsigned long
arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,
const unsigned long len, const unsigned long pgoff,
const unsigned long flags)
{
struct vm_area_struct *vma;
struct mm_struct *mm = current->mm;
unsigned long addr = addr0;
struct vm_unmapped_area_info info;
addr = mpx_unmapped_area_check(addr, len, flags);
if (IS_ERR_VALUE(addr))
return addr;
/* requested length too big for entire address space */
if (len > TASK_SIZE)
return -ENOMEM;
/* No address checking. See comment at mmap_address_hint_valid() */
if (flags & MAP_FIXED)
return addr;
/* for MAP_32BIT mappings we force the legacy mmap base */
if (!in_compat_syscall() && (flags & MAP_32BIT))
goto bottomup;
/* requesting a specific address */
if (addr) {
addr &= PAGE_MASK;
if (!mmap_address_hint_valid(addr, len))
goto get_unmapped_area;
vma = find_vma(mm, addr);
if (!vma || addr + len <= vm_start_gap(vma))
return addr;
}
get_unmapped_area:
info.flags = VM_UNMAPPED_AREA_TOPDOWN;
info.length = len;
info.low_limit = PAGE_SIZE;
info.high_limit = get_mmap_base(0);
/*
* If hint address is above DEFAULT_MAP_WINDOW, look for unmapped area
* in the full address space.
*
* !in_compat_syscall() check to avoid high addresses for x32.
*/
if (addr > DEFAULT_MAP_WINDOW && !in_compat_syscall())
info.high_limit += TASK_SIZE_MAX - DEFAULT_MAP_WINDOW;
info.align_mask = 0;
info.align_offset = pgoff << PAGE_SHIFT;
if (filp) {
info.align_mask = get_align_mask();
info.align_offset += get_align_bits();
}
addr = vm_unmapped_area(&info);
if (!(addr & ~PAGE_MASK))
return addr;
VM_BUG_ON(addr != -ENOMEM);
bottomup:
/*
* A failed mmap() very likely causes application failure,
* so fall back to the bottom-up function here. This scenario
* can happen with large stack limits and large mmap()
* allocations.
*/
return arch_get_unmapped_area(filp, addr0, len, pgoff, flags);
}
↓
unmapped_area_topdown()
unmapped_area_topdown()函数和unmapped_area()函数的逻辑类似,不过unmapped_area_topdown()是优先高地址的,所以它会优先搜寻右子树的。
2.2.2.3 file->f_op->get_unmapped_area
如果文件系统有get_unmapped_area函数会调用自己的,以ext4为例,它的get_unmapped_area函数也为空:
const struct file_operations ext4_file_operations = {
.llseek = ext4_llseek,
.read_iter = ext4_file_read_iter,
.write_iter = ext4_file_write_iter,
.unlocked_ioctl = ext4_ioctl,
#ifdef CONFIG_COMPAT
.compat_ioctl = ext4_compat_ioctl,
#endif
.mmap = ext4_file_mmap,
.mmap_supported_flags = MAP_SYNC,
.open = ext4_file_open,
.release = ext4_release_file,
.fsync = ext4_sync_file,
.get_unmapped_area = thp_get_unmapped_area,
.splice_read = generic_file_splice_read,
.splice_write = iter_file_splice_write,
.fallocate = ext4_fallocate,
};
2.2.3 mmap_region()
unsigned long mmap_region(struct file *file, unsigned long addr,
unsigned long len, vm_flags_t vm_flags, unsigned long pgoff,
struct list_head *uf)
{
struct mm_struct *mm = current->mm;
struct vm_area_struct *vma, *prev;
int error;
struct rb_node **rb_link, *rb_parent;
unsigned long charged = 0;
/* Check against address space limit. */
/* (1) 判断地址空间大小是否已经超标
总的空间:mm->total_vm + npages > rlimit(RLIMIT_AS) >> PAGE_SHIFT
数据空间:mm->data_vm + npages > rlimit(RLIMIT_DATA) >> PAGE_SHIFT
*/
if (!may_expand_vm(mm, vm_flags, len >> PAGE_SHIFT)) {
unsigned long nr_pages;
/*
* MAP_FIXED may remove pages of mappings that intersects with
* requested mapping. Account for the pages it would unmap.
*/
/* (1.1) 固定映射指定地址的情况下,地址空间可能和已有的VMA重叠,其他情况下不会重叠
需要先unmap移除掉和新地址交错的vma地址
所以可以先减去这部分空间,再判断大小是否超标
*/
nr_pages = count_vma_pages_range(mm, addr, addr + len);
if (!may_expand_vm(mm, vm_flags,
(len >> PAGE_SHIFT) - nr_pages))
return -ENOMEM;
}
/* Clear old maps */
/* (2) 如果新地址和旧的vma有覆盖的情况:
把覆盖的地址范围的vma分割出来,先释放掉
*/
while (find_vma_links(mm, addr, addr + len, &prev, &rb_link,
&rb_parent)) {
if (do_munmap(mm, addr, len, uf))
return -ENOMEM;
}
/*
* Private writable mapping: check memory availability
* 私有可写映射:检查内存可用性
*/
if (accountable_mapping(file, vm_flags)) {
charged = len >> PAGE_SHIFT;
if (security_vm_enough_memory_mm(mm, charged))
return -ENOMEM;
vm_flags |= VM_ACCOUNT;
}
/*
* Can we just expand an old mapping?
*/
/* (3) 尝试和临近的vma进行merge */
vma = vma_merge(mm, prev, addr, addr + len, vm_flags,
NULL, file, pgoff, NULL, NULL_VM_UFFD_CTX);
if (vma)
goto out;
/*
* Determine the object being mapped and call the appropriate
* specific mapper. the address has already been validated, but
* not unmapped, but the maps are removed from the list.
*/
/* (4.1) 分配新的vma结构体 */
vma = vm_area_alloc(mm);
if (!vma) {
error = -ENOMEM;
goto unacct_error;
}
/* (4.2) 结构体相关成员 */
vma->vm_start = addr;
vma->vm_end = addr + len;
vma->vm_flags = vm_flags;
vma->vm_page_prot = vm_get_page_prot(vm_flags);
vma->vm_pgoff = pgoff;
/* (4.3) 文件内存映射 */
if (file) {
if (vm_flags & VM_DENYWRITE) {
error = deny_write_access(file);
if (error)
goto free_vma;
}
if (vm_flags & VM_SHARED) {
error = mapping_map_writable(file->f_mapping);
if (error)
goto allow_write_and_free_vma;
}
/* ->mmap() can change vma->vm_file, but must guarantee that
* vma_link() below can deny write-access if VM_DENYWRITE is set
* and map writably if VM_SHARED is set. This usually means the
* new file must not have been exposed to user-space, yet.
*/
/* (4.3.1) 给vma->vm_file赋值 */
vma->vm_file = get_file(file);
/* (4.3.2) 调用file->f_op->mmap,给vma->vm_ops赋值
例如ext4:vma->vm_ops = &ext4_file_vm_ops;
*/
error = call_mmap(file, vma);
if (error)
goto unmap_and_free_vma;
/* Can addr have changed??
*
* Answer: Yes, several device drivers can do it in their
* f_op->mmap method. -DaveM
* Bug: If addr is changed, prev, rb_link, rb_parent should
* be updated for vma_link()
*/
WARN_ON_ONCE(addr != vma->vm_start);
addr = vma->vm_start;
vm_flags = vma->vm_flags;
/* (4.4) 匿名共享内存映射 */
} else if (vm_flags & VM_SHARED) {
/* (4.4.1) 赋值:
vma->vm_file = shmem_kernel_file_setup("dev/zero", size, vma->vm_flags);
vma->vm_ops = &shmem_vm_ops;
*/
error = shmem_zero_setup(vma);
if (error)
goto free_vma;
}
/* (4.4) 匿名私有内存映射呢?
vma->vm_file = NULL?
vma->vm_ops = NULL?
*/
/* (4.6) 将新的vma插入 */
vma_link(mm, vma, prev, rb_link, rb_parent);
/* Once vma denies write, undo our temporary denial count */
if (file) {
if (vm_flags & VM_SHARED)
mapping_unmap_writable(file->f_mapping);
if (vm_flags & VM_DENYWRITE)
allow_write_access(file);
}
file = vma->vm_file;
out:
perf_event_mmap(vma);
vm_stat_account(mm, vm_flags, len >> PAGE_SHIFT);
if (vm_flags & VM_LOCKED) {
if (!((vm_flags & VM_SPECIAL) || is_vm_hugetlb_page(vma) ||
vma == get_gate_vma(current->mm)))
mm->locked_vm += (len >> PAGE_SHIFT);
else
vma->vm_flags &= VM_LOCKED_CLEAR_MASK;
}
if (file)
uprobe_mmap(vma);
/*
* New (or expanded) vma always get soft dirty status.
* Otherwise user-space soft-dirty page tracker won't
* be able to distinguish situation when vma area unmapped,
* then new mapped in-place (which must be aimed as
* a completely new data area).
*/
vma->vm_flags |= VM_SOFTDIRTY;
vma_set_page_prot(vma);
return addr;
unmap_and_free_vma:
vma_fput(vma);
vma->vm_file = NULL;
/* Undo any partial mapping done by a device driver. */
unmap_region(mm, vma, prev, vma->vm_start, vma->vm_end);
charged = 0;
if (vm_flags & VM_SHARED)
mapping_unmap_writable(file->f_mapping);
allow_write_and_free_vma:
if (vm_flags & VM_DENYWRITE)
allow_write_access(file);
free_vma:
vm_area_free(vma);
unacct_error:
if (charged)
vm_unacct_memory(charged);
return error;
}
↓
vma_link()
2.2.3.1 vma_link()
static void vma_link(struct mm_struct *mm, struct vm_area_struct *vma,
struct vm_area_struct *prev, struct rb_node **rb_link,
struct rb_node *rb_parent)
{
struct address_space *mapping = NULL;
if (vma->vm_file) {
mapping = vma->vm_file->f_mapping;
i_mmap_lock_write(mapping);
}
/* (4.6.1) 将新的vma插入到vma红黑树和vma链表中,并且更新树上的各种参数 */
__vma_link(mm, vma, prev, rb_link, rb_parent);
/* (4.6.2) 将vma插入到文件的file->f_mapping->i_mmap缓存树中 */
__vma_link_file(vma);
if (mapping)
i_mmap_unlock_write(mapping);
mm->map_count++;
validate_mm(mm);
}
Mmap映射以后用户虚拟地址和文件磁盘缓存之间的映射关系如下:
f_op->mmap()函数并没有立即建起其vma所描述的用户地址和文件缓存之间的映射,而只是把vma->vm_ops函数操作赋值。在用户访问vma线性地址时,发生缺页异常,调用vma->vm_ops->nopage函数建立映射关系。
2.2.4 do_munmap()
/* Munmap is split into 2 main parts -- this part which finds
* what needs doing, and the areas themselves, which do the
* work. This now handles partial unmappings.
* Jeremy Fitzhardinge <jeremy@goop.org>
*/
int do_munmap(struct mm_struct *mm, unsigned long start, size_t len,
struct list_head *uf)
{
unsigned long end;
struct vm_area_struct *vma, *prev, *last;
if ((offset_in_page(start)) || start > TASK_SIZE || len > TASK_SIZE-start)
return -EINVAL;
len = PAGE_ALIGN(len);
if (len == 0)
return -EINVAL;
/* Find the first overlapping VMA */
/* (1) 找到第一个可能重叠的VMA */
vma = find_vma(mm, start);
if (!vma)
return 0;
prev = vma->vm_prev;
/* we have start < vma->vm_end */
/* if it doesn't overlap, we have nothing.. */
/* (2) 如果地址没有重叠,直接返回 */
end = start + len;
if (vma->vm_start >= end)
return 0;
/* (3) 如果有unmap区域和vma有重叠,先尝试把unmap区域切分成独立的小块vma,再unmap掉 */
/*
* If we need to split any vma, do it now to save pain later.
*
* Note: mremap's move_vma VM_ACCOUNT handling assumes a partially
* unmapped vm_area_struct will remain in use: so lower split_vma
* places tmp vma above, and higher split_vma places tmp vma below.
*/
/* (3.1) 如果start和vma重叠,切一刀 */
if (start > vma->vm_start) {
int error;
/*
* Make sure that map_count on return from munmap() will
* not exceed its limit; but let map_count go just above
* its limit temporarily, to help free resources as expected.
*/
if (end < vma->vm_end && mm->map_count >= sysctl_max_map_count)
return -ENOMEM;
error = __split_vma(mm, vma, start, 0);
if (error)
return error;
prev = vma;
}
/* Does it split the last one? */
/* (3.2) 如果end和vma冲切,切一刀 */
last = find_vma(mm, end);
if (last && end > last->vm_start) {
int error = __split_vma(mm, last, end, 1);
if (error)
return error;
}
vma = prev ? prev->vm_next : mm->mmap;
if (unlikely(uf)) {
/*
* If userfaultfd_unmap_prep returns an error the vmas
* will remain splitted, but userland will get a
* highly unexpected error anyway. This is no
* different than the case where the first of the two
* __split_vma fails, but we don't undo the first
* split, despite we could. This is unlikely enough
* failure that it's not worth optimizing it for.
*/
int error = userfaultfd_unmap_prep(vma, start, end, uf);
if (error)
return error;
}
/*
* unlock any mlock()ed ranges before detaching vmas
*/
/* (4) 移除目标vma上的相关lock */
if (mm->locked_vm) {
struct vm_area_struct *tmp = vma;
while (tmp && tmp->vm_start < end) {
if (tmp->vm_flags & VM_LOCKED) {
mm->locked_vm -= vma_pages(tmp);
munlock_vma_pages_all(tmp);
}
tmp = tmp->vm_next;
}
}
/* (5) 移除目标vma */
/*
* Remove the vma's, and unmap the actual pages
*/
/* (5.1) 从vma红黑树中移除vma */
detach_vmas_to_be_unmapped(mm, vma, prev, end);
/* (5.2) 释放掉vma空间对应的mmu映射表以及内存 */
unmap_region(mm, vma, prev, start, end);
/* (5.3) arch相关的vma释放 */
arch_unmap(mm, vma, start, end);
/* (5.4) 移除掉vma的其他信息,最后释放掉vma结构体 */
/* Fix up all other VM information */
remove_vma_list(mm, vma);
return 0;
}
|→
static void unmap_region(struct mm_struct *mm,
struct vm_area_struct *vma, struct vm_area_struct *prev,
unsigned long start, unsigned long end)
{
struct vm_area_struct *next = prev ? prev->vm_next : mm->mmap;
struct mmu_gather tlb;
lru_add_drain();
tlb_gather_mmu(&tlb, mm, start, end);
update_hiwater_rss(mm);
unmap_vmas(&tlb, vma, start, end);
free_pgtables(&tlb, vma, prev ? prev->vm_end : FIRST_USER_ADDRESS,
next ? next->vm_start : USER_PGTABLES_CEILING);
tlb_finish_mmu(&tlb, start, end);
}
||→
free_pgtables()
2.2.5 mm_populate()
static inline void mm_populate(unsigned long addr, unsigned long len)
{
/* Ignore errors */
(void) __mm_populate(addr, len, 1);
}
↓
int __mm_populate(unsigned long start, unsigned long len, int ignore_errors)
{
struct mm_struct *mm = current->mm;
unsigned long end, nstart, nend;
struct vm_area_struct *vma = NULL;
int locked = 0;
long ret = 0;
end = start + len;
for (nstart = start; nstart < end; nstart = nend) {
/*
* We want to fault in pages for [nstart; end) address range.
* Find first corresponding VMA.
*/
/* (1) 根据目标地址,查找vma */
if (!locked) {
locked = 1;
down_read(&mm->mmap_sem);
vma = find_vma(mm, nstart);
} else if (nstart >= vma->vm_end)
vma = vma->vm_next;
if (!vma || vma->vm_start >= end)
break;
/*
* Set [nstart; nend) to intersection of desired address
* range with the first VMA. Also, skip undesirable VMA types.
*/
/* (2) 计算目标地址和vma的重叠部分 */
nend = min(end, vma->vm_end);
if (vma->vm_flags & (VM_IO | VM_PFNMAP))
continue;
if (nstart < vma->vm_start)
nstart = vma->vm_start;
/*
* Now fault in a range of pages. populate_vma_page_range()
* double checks the vma flags, so that it won't mlock pages
* if the vma was already munlocked.
*/
/* (3) 对重叠部分进行page内存填充 */
ret = populate_vma_page_range(vma, nstart, nend, &locked);
if (ret < 0) {
if (ignore_errors) {
ret = 0;
continue; /* continue at next VMA */
}
break;
}
nend = nstart + ret * PAGE_SIZE;
ret = 0;
}
if (locked)
up_read(&mm->mmap_sem);
return ret; /* 0 or negative error code */
}
↓
long populate_vma_page_range(struct vm_area_struct *vma,
unsigned long start, unsigned long end, int *nonblocking)
{
struct mm_struct *mm = vma->vm_mm;
unsigned long nr_pages = (end - start) / PAGE_SIZE;
int gup_flags;
VM_BUG_ON(start & ~PAGE_MASK);
VM_BUG_ON(end & ~PAGE_MASK);
VM_BUG_ON_VMA(start < vma->vm_start, vma);
VM_BUG_ON_VMA(end > vma->vm_end, vma);
VM_BUG_ON_MM(!rwsem_is_locked(&mm->mmap_sem), mm);
gup_flags = FOLL_TOUCH | FOLL_POPULATE | FOLL_MLOCK;
if (vma->vm_flags & VM_LOCKONFAULT)
gup_flags &= ~FOLL_POPULATE;
/*
* We want to touch writable mappings with a write fault in order
* to break COW, except for shared mappings because these don't COW
* and we would not want to dirty them for nothing.
*/
if ((vma->vm_flags & (VM_WRITE | VM_SHARED)) == VM_WRITE)
gup_flags |= FOLL_WRITE;
/*
* We want mlock to succeed for regions that have any permissions
* other than PROT_NONE.
*/
if (vma->vm_flags & (VM_READ | VM_WRITE | VM_EXEC))
gup_flags |= FOLL_FORCE;
/*
* We made sure addr is within a VMA, so the following will
* not result in a stack expansion that recurses back here.
*/
return __get_user_pages(current, mm, start, nr_pages, gup_flags,
NULL, NULL, nonblocking);
}
↓
/**
* __get_user_pages() - pin user pages in memory // 将用户页面固定在内存中
* @tsk: task_struct of target task
* @mm: mm_struct of target mm
* @start: starting user address
* @nr_pages: number of pages from start to pin
* @gup_flags: flags modifying pin behaviour
* @pages: array that receives pointers to the pages pinned.
* Should be at least nr_pages long. Or NULL, if caller
* only intends to ensure the pages are faulted in.
* @vmas: array of pointers to vmas corresponding to each page.
* Or NULL if the caller does not require them.
* @nonblocking: whether waiting for disk IO or mmap_sem contention
*
* Returns number of pages pinned. This may be fewer than the number
* requested. If nr_pages is 0 or negative, returns 0. If no pages
* were pinned, returns -errno. Each page returned must be released
* with a put_page() call when it is finished with. vmas will only
* remain valid while mmap_sem is held.
* 返回固定的页数。这可能少于请求的数量。如果nr_pages为0或负数,则返回0。如果没有固定页面,则返回-errno。完成后,返回的每个页面都必须使用put_page()调用释放。只有保留mmap_sem时,vmas才保持有效。
*
* Must be called with mmap_sem held. It may be released. See below.
* 必须在保持mmap_sem的情况下调用。它可能会被释放。见下文。
*
* __get_user_pages walks a process's page tables and takes a reference to
* each struct page that each user address corresponds to at a given
* instant. That is, it takes the page that would be accessed if a user
* thread accesses the given user virtual address at that instant.
* __get_user_pages遍历进程的页表,并引用给定瞬间每个用户地址所对应的每个struct页。也就是说,如果用户线程在该时刻访问给定的用户虚拟地址,则它将占用要访问的页面。
*
* This does not guarantee that the page exists in the user mappings when
* __get_user_pages returns, and there may even be a completely different
* page there in some cases (eg. if mmapped pagecache has been invalidated
* and subsequently re faulted). However it does guarantee that the page
* won't be freed completely. And mostly callers simply care that the page
* contains data that was valid *at some point in time*. Typically, an IO
* or similar operation cannot guarantee anything stronger anyway because
* locks can't be held over the syscall boundary.
* 这不能保证在__get_user_pages返回时该页面存在于用户映射中,并且在某些情况下甚至可能存在一个完全不同的页面(例如,如果映射的页面缓存已失效并随后发生故障)。但是,它可以确保不会完全释放该页面。而且大多数调用者只是在乎页面是否包含在某个时间点有效的数据。通常,由于无法在系统调用边界上保持锁,因此IO或类似操作无论如何都无法保证更强大。
*
* If @gup_flags & FOLL_WRITE == 0, the page must not be written to. If
* the page is written to, set_page_dirty (or set_page_dirty_lock, as
* appropriate) must be called after the page is finished with, and
* before put_page is called.
* 如果@gup_flags和FOLL_WRITE == 0,则不得写入该页面。如果要写入页面,则必须在页面完成之后且在调用put_page之前调用set_page_dirty(或适当的set_page_dirty_lock)。
*
* If @nonblocking != NULL, __get_user_pages will not wait for disk IO
* or mmap_sem contention, and if waiting is needed to pin all pages,
* *@nonblocking will be set to 0. Further, if @gup_flags does not
* include FOLL_NOWAIT, the mmap_sem will be released via up_read() in
* this case.
* 如果@nonblocking!= NULL,则__get_user_pages将不等待磁盘IO或mmap_sem争用,并且如果需要等待固定所有页面,则* @ nonblocking将设置为0。此外,如果@gup_flags不包含FOLL_NOWAIT,则mmap_sem在这种情况下将通过up_read()释放。
*
* A caller using such a combination of @nonblocking and @gup_flags
* must therefore hold the mmap_sem for reading only, and recognize
* when it's been released. Otherwise, it must be held for either
* reading or writing and will not be released.
* 因此,使用@nonblocking和@gup_flags的组合的调用者必须将mmap_sem保留为只读,并识别它何时被释放。否则,必须保留它以进行读取或书写,并且不会被释放。
*
* In most cases, get_user_pages or get_user_pages_fast should be used
* instead of __get_user_pages. __get_user_pages should be used only if
* you need some special @gup_flags.
* 在大多数情况下,应使用get_user_pages或get_user_pages_fast而不是__get_user_pages。 __get_user_pages仅在需要一些特殊的@gup_flags时使用。
*/
static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
unsigned long start, unsigned long nr_pages,
unsigned int gup_flags, struct page **pages,
struct vm_area_struct **vmas, int *nonblocking)
{
long i = 0;
unsigned int page_mask;
struct vm_area_struct *vma = NULL;
if (!nr_pages)
return 0;
VM_BUG_ON(!!pages != !!(gup_flags & FOLL_GET));
/*
* If FOLL_FORCE is set then do not force a full fault as the hinting
* fault information is unrelated to the reference behaviour of a task
* using the address space
*/
if (!(gup_flags & FOLL_FORCE))
gup_flags |= FOLL_NUMA;
do {
struct page *page;
unsigned int foll_flags = gup_flags;
unsigned int page_increm;
/* first iteration or cross vma bound */
/* (3.1) 地址跨区域的处理 */
if (!vma || start >= vma->vm_end) {
vma = find_extend_vma(mm, start);
if (!vma && in_gate_area(mm, start)) {
int ret;
ret = get_gate_page(mm, start & PAGE_MASK,
gup_flags, &vma,
pages ? &pages[i] : NULL);
if (ret)
return i ? : ret;
page_mask = 0;
goto next_page;
}
if (!vma || check_vma_flags(vma, gup_flags))
return i ? : -EFAULT;
if (is_vm_hugetlb_page(vma)) {
i = follow_hugetlb_page(mm, vma, pages, vmas,
&start, &nr_pages, i,
gup_flags, nonblocking);
continue;
}
}
retry:
/*
* If we have a pending SIGKILL, don't keep faulting pages and
* potentially allocating memory.
*/
if (unlikely(fatal_signal_pending(current)))
return i ? i : -ERESTARTSYS;
cond_resched();
/* (3.2) 逐个查询vma中地址对应的page是否已经分配 */
page = follow_page_mask(vma, start, foll_flags, &page_mask);
if (!page) {
int ret;
/* (3.3) 如果page没有分配,则使用缺页处理来分配page */
ret = faultin_page(tsk, vma, start, &foll_flags,
nonblocking);
switch (ret) {
case 0:
goto retry;
case -EFAULT:
case -ENOMEM:
case -EHWPOISON:
return i ? i : ret;
case -EBUSY:
return i;
case -ENOENT:
goto next_page;
}
BUG();
} else if (PTR_ERR(page) == -EEXIST) {
/*
* Proper page table entry exists, but no corresponding
* struct page.
*/
goto next_page;
} else if (IS_ERR(page)) {
return i ? i : PTR_ERR(page);
}
if (pages) {
pages[i] = page;
flush_anon_page(vma, page, start);
flush_dcache_page(page);
page_mask = 0;
}
next_page:
if (vmas) {
vmas[i] = vma;
page_mask = 0;
}
page_increm = 1 + (~(start >> PAGE_SHIFT) & page_mask);
if (page_increm > nr_pages)
page_increm = nr_pages;
i += page_increm;
start += page_increm * PAGE_SIZE;
nr_pages -= page_increm;
} while (nr_pages);
return i;
}
↓
2.3 缺页处理
2.3.1 do_page_fault()
缺页异常的处理路径:do_page_fault() -> __do_page_fault() -> do_user_addr_fault() -> handle_mm_fault()。
核心函数为handle_mm_fault(),和mm_populate()中的处理一样,只不过实际不同。
mm_populate()是在mmap映射完vma虚拟地址以后,如果参数中指定了需要VM_LOCKED,会立即进行物理内存的分配和内容读取。
而do_page_fault()是在发生实际的内存访问以后,才进行物理内存的分配和内容读取。
2.3.2 faultin_page()
几种缺页原因:unrepresent、COW、swap。
static int faultin_page(struct task_struct *tsk, struct vm_area_struct *vma,
unsigned long address, unsigned int *flags, int *nonblocking)
{
unsigned int fault_flags = 0;
int ret;
/* mlock all present pages, but do not fault in new pages */
if ((*flags & (FOLL_POPULATE | FOLL_MLOCK)) == FOLL_MLOCK)
return -ENOENT;
if (*flags & FOLL_WRITE)
fault_flags |= FAULT_FLAG_WRITE;
if (*flags & FOLL_REMOTE)
fault_flags |= FAULT_FLAG_REMOTE;
if (nonblocking)
fault_flags |= FAULT_FLAG_ALLOW_RETRY;
if (*flags & FOLL_NOWAIT)
fault_flags |= FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_RETRY_NOWAIT;
if (*flags & FOLL_TRIED) {
VM_WARN_ON_ONCE(fault_flags & FAULT_FLAG_ALLOW_RETRY);
fault_flags |= FAULT_FLAG_TRIED;
}
ret = handle_mm_fault(vma, address, fault_flags);
if (ret & VM_FAULT_ERROR) {
int err = vm_fault_to_errno(ret, *flags);
if (err)
return err;
BUG();
}
if (tsk) {
if (ret & VM_FAULT_MAJOR)
tsk->maj_flt++;
else
tsk->min_flt++;
}
if (ret & VM_FAULT_RETRY) {
if (nonblocking)
*nonblocking = 0;
return -EBUSY;
}
/*
* The VM_FAULT_WRITE bit tells us that do_wp_page has broken COW when
* necessary, even if maybe_mkwrite decided not to set pte_write. We
* can thus safely do subsequent page lookups as if they were reads.
* But only do so when looping for pte_write is futile: in some cases
* userspace may also be wanting to write to the gotten user page,
* which a read fault here might prevent (a readonly page might get
* reCOWed by userspace write).
*/
if ((ret & VM_FAULT_WRITE) && !(vma->vm_flags & VM_WRITE))
*flags |= FOLL_COW;
return 0;
}
↓
int handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
unsigned int flags)
{
int ret;
__set_current_state(TASK_RUNNING);
count_vm_event(PGFAULT);
count_memcg_event_mm(vma->vm_mm, PGFAULT);
/* do counter updates before entering really critical section. */
check_sync_rss_stat(current);
if (!arch_vma_access_permitted(vma, flags & FAULT_FLAG_WRITE,
flags & FAULT_FLAG_INSTRUCTION,
flags & FAULT_FLAG_REMOTE))
return VM_FAULT_SIGSEGV;
/*
* Enable the memcg OOM handling for faults triggered in user
* space. Kernel faults are handled more gracefully.
*/
if (flags & FAULT_FLAG_USER)
mem_cgroup_oom_enable();
if (unlikely(is_vm_hugetlb_page(vma)))
ret = hugetlb_fault(vma->vm_mm, vma, address, flags);
else
ret = __handle_mm_fault(vma, address, flags);
if (flags & FAULT_FLAG_USER) {
mem_cgroup_oom_disable();
/*
* The task may have entered a memcg OOM situation but
* if the allocation error was handled gracefully (no
* VM_FAULT_OOM), there is no need to kill anything.
* Just clean up the OOM state peacefully.
*/
if (task_in_memcg_oom(current) && !(ret & VM_FAULT_OOM))
mem_cgroup_oom_synchronize(false);
}
return ret;
}
↓
static int __handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
unsigned int flags)
{
struct vm_fault vmf = {
.vma = vma,
.address = address & PAGE_MASK,
.flags = flags,
.pgoff = linear_page_index(vma, address),
.gfp_mask = __get_fault_gfp_mask(vma),
};
unsigned int dirty = flags & FAULT_FLAG_WRITE;
struct mm_struct *mm = vma->vm_mm;
pgd_t *pgd;
p4d_t *p4d;
int ret;
/* (3.3.1) 查找addr对应的pgd和p4d */
pgd = pgd_offset(mm, address);
p4d = p4d_alloc(mm, pgd, address);
if (!p4d)
return VM_FAULT_OOM;
/* (3.3.2) 查找addr对应的pud,如果还没分配空间 */
vmf.pud = pud_alloc(mm, p4d, address);
if (!vmf.pud)
return VM_FAULT_OOM;
if (pud_none(*vmf.pud) && transparent_hugepage_enabled(vma)) {
ret = create_huge_pud(&vmf);
if (!(ret & VM_FAULT_FALLBACK))
return ret;
} else {
pud_t orig_pud = *vmf.pud;
barrier();
if (pud_trans_huge(orig_pud) || pud_devmap(orig_pud)) {
/* NUMA case for anonymous PUDs would go here */
if (dirty && !pud_write(orig_pud)) {
ret = wp_huge_pud(&vmf, orig_pud);
if (!(ret & VM_FAULT_FALLBACK))
return ret;
} else {
huge_pud_set_accessed(&vmf, orig_pud);
return 0;
}
}
}
/* (3.3.3) 查找addr对应的pmd,如果还没分配空间 */
vmf.pmd = pmd_alloc(mm, vmf.pud, address);
if (!vmf.pmd)
return VM_FAULT_OOM;
if (pmd_none(*vmf.pmd) && transparent_hugepage_enabled(vma)) {
ret = create_huge_pmd(&vmf);
if (!(ret & VM_FAULT_FALLBACK))
return ret;
} else {
pmd_t orig_pmd = *vmf.pmd;
barrier();
if (unlikely(is_swap_pmd(orig_pmd))) {
VM_BUG_ON(thp_migration_supported() &&
!is_pmd_migration_entry(orig_pmd));
if (is_pmd_migration_entry(orig_pmd))
pmd_migration_entry_wait(mm, vmf.pmd);
return 0;
}
if (pmd_trans_huge(orig_pmd) || pmd_devmap(orig_pmd)) {
if (pmd_protnone(orig_pmd) && vma_is_accessible(vma))
return do_huge_pmd_numa_page(&vmf, orig_pmd);
if (dirty && !pmd_write(orig_pmd)) {
ret = wp_huge_pmd(&vmf, orig_pmd);
if (!(ret & VM_FAULT_FALLBACK))
return ret;
} else {
huge_pmd_set_accessed(&vmf, orig_pmd);
return 0;
}
}
}
/* (3.3.4) 处理PTE的缺页 */
return handle_pte_fault(&vmf);
}
↓
static int handle_pte_fault(struct vm_fault *vmf)
{
pte_t entry;
/* (3.3.4.1) pmd为空,说明pte肯定还没有分配 */
if (unlikely(pmd_none(*vmf->pmd))) {
/*
* Leave __pte_alloc() until later: because vm_ops->fault may
* want to allocate huge page, and if we expose page table
* for an instant, it will be difficult to retract from
* concurrent faults and from rmap lookups.
*/
vmf->pte = NULL;
/* (3.3.4.2) pmd不为空,说明:
1、pte可能分配了,
2、也可能没分配只是临近的PTE分配连带创建了pmd
*/
} else {
/* See comment in pte_alloc_one_map() */
if (pmd_devmap_trans_unstable(vmf->pmd))
return 0;
/*
* A regular pmd is established and it can't morph into a huge
* pmd from under us anymore at this point because we hold the
* mmap_sem read mode and khugepaged takes it in write mode.
* So now it's safe to run pte_offset_map().
*/
vmf->pte = pte_offset_map(vmf->pmd, vmf->address);
vmf->orig_pte = *vmf->pte;
/*
* some architectures can have larger ptes than wordsize,
* e.g.ppc44x-defconfig has CONFIG_PTE_64BIT=y and
* CONFIG_32BIT=y, so READ_ONCE cannot guarantee atomic
* accesses. The code below just needs a consistent view
* for the ifs and we later double check anyway with the
* ptl lock held. So here a barrier will do.
*/
barrier();
/* pte还没分配 */
if (pte_none(vmf->orig_pte)) {
pte_unmap(vmf->pte);
vmf->pte = NULL;
}
}
/* (3.3.4.3) 如果PTE都没有创建过,那是新的缺页异常 */
if (!vmf->pte) {
if (vma_is_anonymous(vmf->vma))
/* (3.3.4.3.1) 匿名映射 */
return do_anonymous_page(vmf);
else
/* (3.3.4.3.2) 文件映射 */
return do_fault(vmf);
}
/* (3.3.4.4) 如果PTE存在,但是page不存在,page是被swap出去了 */
if (!pte_present(vmf->orig_pte))
return do_swap_page(vmf);
if (pte_protnone(vmf->orig_pte) && vma_is_accessible(vmf->vma))
return do_numa_page(vmf);
vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
spin_lock(vmf->ptl);
entry = vmf->orig_pte;
if (unlikely(!pte_same(*vmf->pte, entry)))
goto unlock;
if (vmf->flags & FAULT_FLAG_WRITE) {
if (!pte_write(entry))
return do_wp_page(vmf);
entry = pte_mkdirty(entry);
}
entry = pte_mkyoung(entry);
if (ptep_set_access_flags(vmf->vma, vmf->address, vmf->pte, entry,
vmf->flags & FAULT_FLAG_WRITE)) {
update_mmu_cache(vmf->vma, vmf->address, vmf->pte);
} else {
/*
* This is needed only for protection faults but the arch code
* is not yet telling us if this is a protection fault or not.
* This still avoids useless tlb flushes for .text page faults
* with threads.
*/
if (vmf->flags & FAULT_FLAG_WRITE)
flush_tlb_fix_spurious_fault(vmf->vma, vmf->address);
}
unlock:
pte_unmap_unlock(vmf->pte, vmf->ptl);
return 0;
}
|→
2.3.2.1 do_anonymous_page()
static int do_anonymous_page(struct vm_fault *vmf)
{
struct vm_area_struct *vma = vmf->vma;
struct mem_cgroup *memcg;
struct page *page;
int ret = 0;
pte_t entry;
/* File mapping without ->vm_ops ? */
if (vma->vm_flags & VM_SHARED)
return VM_FAULT_SIGBUS;
/*
* Use pte_alloc() instead of pte_alloc_map(). We can't run
* pte_offset_map() on pmds where a huge pmd might be created
* from a different thread.
*
* pte_alloc_map() is safe to use under down_write(mmap_sem) or when
* parallel threads are excluded by other means.
*
* Here we only have down_read(mmap_sem).
*/
/* (3.3.4.3.1.1) 分配pte结构 */
if (pte_alloc(vma->vm_mm, vmf->pmd, vmf->address))
return VM_FAULT_OOM;
/* See the comment in pte_alloc_one_map() */
if (unlikely(pmd_trans_unstable(vmf->pmd)))
return 0;
/* Use the zero-page for reads */
if (!(vmf->flags & FAULT_FLAG_WRITE) &&
!mm_forbids_zeropage(vma->vm_mm)) {
entry = pte_mkspecial(pfn_pte(my_zero_pfn(vmf->address),
vma->vm_page_prot));
vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
vmf->address, &vmf->ptl);
if (!pte_none(*vmf->pte))
goto unlock;
ret = check_stable_address_space(vma->vm_mm);
if (ret)
goto unlock;
/* Deliver the page fault to userland, check inside PT lock */
if (userfaultfd_missing(vma)) {
pte_unmap_unlock(vmf->pte, vmf->ptl);
return handle_userfault(vmf, VM_UFFD_MISSING);
}
goto setpte;
}
/* Allocate our own private page. */
if (unlikely(anon_vma_prepare(vma)))
goto oom;
/* (3.3.4.3.1.2) 分配page结构 */
page = alloc_zeroed_user_highpage_movable(vma, vmf->address);
if (!page)
goto oom;
if (mem_cgroup_try_charge(page, vma->vm_mm, GFP_KERNEL, &memcg, false))
goto oom_free_page;
/*
* The memory barrier inside __SetPageUptodate makes sure that
* preceeding stores to the page contents become visible before
* the set_pte_at() write.
*/
__SetPageUptodate(page);
/* (3.3.4.3.1.3) pte成员赋值 */
entry = mk_pte(page, vma->vm_page_prot);
if (vma->vm_flags & VM_WRITE)
entry = pte_mkwrite(pte_mkdirty(entry));
vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
&vmf->ptl);
if (!pte_none(*vmf->pte))
goto release;
ret = check_stable_address_space(vma->vm_mm);
if (ret)
goto release;
/* Deliver the page fault to userland, check inside PT lock */
if (userfaultfd_missing(vma)) {
pte_unmap_unlock(vmf->pte, vmf->ptl);
mem_cgroup_cancel_charge(page, memcg, false);
put_page(page);
return handle_userfault(vmf, VM_UFFD_MISSING);
}
inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
page_add_new_anon_rmap(page, vma, vmf->address, false);
mem_cgroup_commit_charge(page, memcg, false, false);
lru_cache_add_active_or_unevictable(page, vma);
/* (3.3.4.3.1.4) 设置pte */
setpte:
set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
/* No need to invalidate - it was non-present before */
update_mmu_cache(vma, vmf->address, vmf->pte);
unlock:
pte_unmap_unlock(vmf->pte, vmf->ptl);
return ret;
release:
mem_cgroup_cancel_charge(page, memcg, false);
put_page(page);
goto unlock;
oom_free_page:
put_page(page);
oom:
return VM_FAULT_OOM;
}
2.3.2.2 do_fault()
static int do_fault(struct vm_fault *vmf)
{
struct vm_area_struct *vma = vmf->vma;
struct mm_struct *vm_mm = vma->vm_mm;
int ret;
/*
* The VMA was not fully populated on mmap() or missing VM_DONTEXPAND
*/
if (!vma->vm_ops->fault) {
/*
* If we find a migration pmd entry or a none pmd entry, which
* should never happen, return SIGBUS
*/
if (unlikely(!pmd_present(*vmf->pmd)))
ret = VM_FAULT_SIGBUS;
else {
vmf->pte = pte_offset_map_lock(vmf->vma->vm_mm,
vmf->pmd,
vmf->address,
&vmf->ptl);
/*
* Make sure this is not a temporary clearing of pte
* by holding ptl and checking again. A R/M/W update
* of pte involves: take ptl, clearing the pte so that
* we don't have concurrent modification by hardware
* followed by an update.
*/
if (unlikely(pte_none(*vmf->pte)))
ret = VM_FAULT_SIGBUS;
else
ret = VM_FAULT_NOPAGE;
pte_unmap_unlock(vmf->pte, vmf->ptl);
}
} else if (!(vmf->flags & FAULT_FLAG_WRITE))
/* (1) 普通文件的nopage处理 */
ret = do_read_fault(vmf);
else if (!(vma->vm_flags & VM_SHARED))
/* (2) copy on write的处理 */
ret = do_cow_fault(vmf);
else
/* (3) 共享文件的处理 */
ret = do_shared_fault(vmf);
/* preallocated pagetable is unused: free it */
if (vmf->prealloc_pte) {
pte_free(vm_mm, vmf->prealloc_pte);
vmf->prealloc_pte = NULL;
}
return ret;
}
2.3.2.3 do_read_fault()
static int do_read_fault(struct vm_fault *vmf)
{
struct vm_area_struct *vma = vmf->vma;
int ret = 0;
/*
* Let's call ->map_pages() first and use ->fault() as fallback
* if page by the offset is not ready to be mapped (cold cache or
* something).
*/
/* (1.1) 首先尝试一次准备多个page,调用vm_ops->map_pages() */
if (vma->vm_ops->map_pages && fault_around_bytes >> PAGE_SHIFT > 1) {
ret = do_fault_around(vmf);
if (ret)
return ret;
}
/* (1.2) 如果失败,尝试一次准备一个page,调用vm_ops->fault() */
ret = __do_fault(vmf);
if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
return ret;
ret |= finish_fault(vmf);
unlock_page(vmf->page);
if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
put_page(vmf->page);
return ret;
}
↓
static int __do_fault(struct vm_fault *vmf)
{
struct vm_area_struct *vma = vmf->vma;
int ret;
/*
* Preallocate pte before we take page_lock because this might lead to
* deadlocks for memcg reclaim which waits for pages under writeback:
* lock_page(A)
* SetPageWriteback(A)
* unlock_page(A)
* lock_page(B)
* lock_page(B)
* pte_alloc_pne
* shrink_page_list
* wait_on_page_writeback(A)
* SetPageWriteback(B)
* unlock_page(B)
* # flush A, B to clear the writeback
*/
if (pmd_none(*vmf->pmd) && !vmf->prealloc_pte) {
vmf->prealloc_pte = pte_alloc_one(vmf->vma->vm_mm,
vmf->address);
if (!vmf->prealloc_pte)
return VM_FAULT_OOM;
smp_wmb(); /* See comment in __pte_alloc() */
}
/* (1.2.1) 调用vma对应的具体vm_ops */
ret = vma->vm_ops->fault(vmf);
if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY |
VM_FAULT_DONE_COW)))
return ret;
if (unlikely(PageHWPoison(vmf->page))) {
if (ret & VM_FAULT_LOCKED)
unlock_page(vmf->page);
put_page(vmf->page);
vmf->page = NULL;
return VM_FAULT_HWPOISON;
}
if (unlikely(!(ret & VM_FAULT_LOCKED)))
lock_page(vmf->page);
else
VM_BUG_ON_PAGE(!PageLocked(vmf->page), vmf->page);
return ret;
}
以ext4为例,vm_ops实现如下:
static const struct vm_operations_struct ext4_file_vm_ops = {
.fault = ext4_filemap_fault,
.map_pages = filemap_map_pages,
.page_mkwrite = ext4_page_mkwrite,
};
我们简单分析其中的单page映射函数ext4_filemap_fault():
int ext4_filemap_fault(struct vm_fault *vmf)
{
struct inode *inode = file_inode(vmf->vma->vm_file);
int err;
down_read(&EXT4_I(inode)->i_mmap_sem);
err = filemap_fault(vmf);
up_read(&EXT4_I(inode)->i_mmap_sem);
return err;
}
↓
int filemap_fault(struct vm_fault *vmf)
{
int error;
struct file *file = vmf->vma->vm_file;
struct address_space *mapping = file->f_mapping;
struct file_ra_state *ra = &file->f_ra;
struct inode *inode = mapping->host;
pgoff_t offset = vmf->pgoff;
pgoff_t max_off;
struct page *page;
int ret = 0;
max_off = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE);
if (unlikely(offset >= max_off))
return VM_FAULT_SIGBUS;
/*
* Do we have something in the page cache already?
*/
/* (1.2.1.1) 尝试从文件的cache缓存树中,找到offset对应的page */
page = find_get_page(mapping, offset);
if (likely(page) && !(vmf->flags & FAULT_FLAG_TRIED)) {
/*
* We found the page, so try async readahead before
* waiting for the lock.
*/
do_async_mmap_readahead(vmf->vma, ra, file, page, offset);
} else if (!page) {
/* No page in the page cache at all */
/* 如果文件cache缓存中没有page,先读入文件到缓存 */
do_sync_mmap_readahead(vmf->vma, ra, file, offset);
count_vm_event(PGMAJFAULT);
count_memcg_event_mm(vmf->vma->vm_mm, PGMAJFAULT);
ret = VM_FAULT_MAJOR;
retry_find:
/* 再重新查找page */
page = find_get_page(mapping, offset);
if (!page)
goto no_cached_page;
}
/* (1.2.1.2) 锁住page */
if (!lock_page_or_retry(page, vmf->vma->vm_mm, vmf->flags)) {
put_page(page);
return ret | VM_FAULT_RETRY;
}
/* Did it get truncated? */
if (unlikely(page->mapping != mapping)) {
unlock_page(page);
put_page(page);
goto retry_find;
}
VM_BUG_ON_PAGE(page->index != offset, page);
/*
* We have a locked page in the page cache, now we need to check
* that it's up-to-date. If not, it is going to be due to an error.
*/
if (unlikely(!PageUptodate(page)))
goto page_not_uptodate;
/*
* Found the page and have a reference on it.
* We must recheck i_size under page lock.
*/
max_off = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE);
if (unlikely(offset >= max_off)) {
unlock_page(page);
put_page(page);
return VM_FAULT_SIGBUS;
}
/* (1.2.1.3) 返回获取到的page,已经正确填充了对应offset文件的内容 */
vmf->page = page;
return ret | VM_FAULT_LOCKED;
no_cached_page:
/*
* We're only likely to ever get here if MADV_RANDOM is in
* effect.
*/
error = page_cache_read(file, offset, vmf->gfp_mask);
/*
* The page we want has now been added to the page cache.
* In the unlikely event that someone removed it in the
* meantime, we'll just come back here and read it again.
*/
if (error >= 0)
goto retry_find;
/*
* An error return from page_cache_read can result if the
* system is low on memory, or a problem occurs while trying
* to schedule I/O.
*/
if (error == -ENOMEM)
return VM_FAULT_OOM;
return VM_FAULT_SIGBUS;
page_not_uptodate:
/*
* Umm, take care of errors if the page isn't up-to-date.
* Try to re-read it _once_. We do this synchronously,
* because there really aren't any performance issues here
* and we need to check for errors.
*/
ClearPageError(page);
error = mapping->a_ops->readpage(file, page);
if (!error) {
wait_on_page_locked(page);
if (!PageUptodate(page))
error = -EIO;
}
put_page(page);
if (!error || error == AOP_TRUNCATED_PAGE)
goto retry_find;
/* Things didn't work out. Return zero to tell the mm layer so. */
shrink_readahead_size_eio(file, ra);
return VM_FAULT_SIGBUS;
}
2.3.2.4 do_cow_fault()
static int do_cow_fault(struct vm_fault *vmf)
{
struct vm_area_struct *vma = vmf->vma;
int ret;
if (unlikely(anon_vma_prepare(vma)))
return VM_FAULT_OOM;
/* (2.1) 分配一个新page:cow_page */
vmf->cow_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, vmf->address);
if (!vmf->cow_page)
return VM_FAULT_OOM;
if (mem_cgroup_try_charge(vmf->cow_page, vma->vm_mm, GFP_KERNEL,
&vmf->memcg, false)) {
put_page(vmf->cow_page);
return VM_FAULT_OOM;
}
/* (2.2) 更新原page的内容 */
ret = __do_fault(vmf);
if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
goto uncharge_out;
if (ret & VM_FAULT_DONE_COW)
return ret;
/* (2.3) 拷贝page的内容到cow_page */
copy_user_highpage(vmf->cow_page, vmf->page, vmf->address, vma);
__SetPageUptodate(vmf->cow_page);
/* (2.4) 提交PTE */
ret |= finish_fault(vmf);
unlock_page(vmf->page);
put_page(vmf->page);
if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
goto uncharge_out;
return ret;
uncharge_out:
mem_cgroup_cancel_charge(vmf->cow_page, vmf->memcg, false);
put_page(vmf->cow_page);
return ret;
}
2.3.2.5 do_shared_fault()
static int do_shared_fault(struct vm_fault *vmf)
{
struct vm_area_struct *vma = vmf->vma;
int ret, tmp;
/* (3.1) 更新共享page中的内容 */
ret = __do_fault(vmf);
if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
return ret;
/*
* Check if the backing address space wants to know that the page is
* about to become writable
*/
if (vma->vm_ops->page_mkwrite) {
unlock_page(vmf->page);
tmp = do_page_mkwrite(vmf);
if (unlikely(!tmp ||
(tmp & (VM_FAULT_ERROR | VM_FAULT_NOPAGE)))) {
put_page(vmf->page);
return tmp;
}
}
ret |= finish_fault(vmf);
if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE |
VM_FAULT_RETRY))) {
unlock_page(vmf->page);
put_page(vmf->page);
return ret;
}
fault_dirty_shared_page(vma, vmf->page);
return ret;
}
2.3.2.6 do_swap_page()
int do_swap_page(struct vm_fault *vmf)
{
struct vm_area_struct *vma = vmf->vma;
struct page *page = NULL, *swapcache = NULL;
struct mem_cgroup *memcg;
struct vma_swap_readahead swap_ra;
swp_entry_t entry;
pte_t pte;
int locked;
int exclusive = 0;
int ret = 0;
bool vma_readahead = swap_use_vma_readahead();
if (vma_readahead) {
page = swap_readahead_detect(vmf, &swap_ra);
swapcache = page;
}
if (!pte_unmap_same(vma->vm_mm, vmf->pmd, vmf->pte, vmf->orig_pte)) {
if (page)
put_page(page);
goto out;
}
/* (3.3.4.4.1) 从PTE中读出swap entry */
entry = pte_to_swp_entry(vmf->orig_pte);
if (unlikely(non_swap_entry(entry))) {
if (is_migration_entry(entry)) {
migration_entry_wait(vma->vm_mm, vmf->pmd,
vmf->address);
} else if (is_device_private_entry(entry)) {
/*
* For un-addressable device memory we call the pgmap
* fault handler callback. The callback must migrate
* the page back to some CPU accessible page.
*/
ret = device_private_entry_fault(vma, vmf->address, entry,
vmf->flags, vmf->pmd);
} else if (is_hwpoison_entry(entry)) {
ret = VM_FAULT_HWPOISON;
} else {
print_bad_pte(vma, vmf->address, vmf->orig_pte, NULL);
ret = VM_FAULT_SIGBUS;
}
goto out;
}
delayacct_set_flag(DELAYACCT_PF_SWAPIN);
if (!page) {
page = lookup_swap_cache(entry, vma_readahead ? vma : NULL,
vmf->address);
swapcache = page;
}
/* (3.3.4.4.2) 分配新的page,并且读出swap中的内容到新page中 */
if (!page) {
struct swap_info_struct *si = swp_swap_info(entry);
if (si->flags & SWP_SYNCHRONOUS_IO &&
__swap_count(si, entry) == 1) {
/* skip swapcache */
page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, vmf->address);
if (page) {
__SetPageLocked(page);
__SetPageSwapBacked(page);
set_page_private(page, entry.val);
lru_cache_add_anon(page);
swap_readpage(page, true);
}
} else {
if (vma_readahead)
page = do_swap_page_readahead(entry,
GFP_HIGHUSER_MOVABLE, vmf, &swap_ra);
else
page = swapin_readahead(entry,
GFP_HIGHUSER_MOVABLE, vma, vmf->address);
swapcache = page;
}
if (!page) {
/*
* Back out if somebody else faulted in this pte
* while we released the pte lock.
*/
vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
vmf->address, &vmf->ptl);
if (likely(pte_same(*vmf->pte, vmf->orig_pte)))
ret = VM_FAULT_OOM;
delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
goto unlock;
}
/* Had to read the page from swap area: Major fault */
ret = VM_FAULT_MAJOR;
count_vm_event(PGMAJFAULT);
count_memcg_event_mm(vma->vm_mm, PGMAJFAULT);
} else if (PageHWPoison(page)) {
/*
* hwpoisoned dirty swapcache pages are kept for killing
* owner processes (which may be unknown at hwpoison time)
*/
ret = VM_FAULT_HWPOISON;
delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
swapcache = page;
goto out_release;
}
locked = lock_page_or_retry(page, vma->vm_mm, vmf->flags);
delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
if (!locked) {
ret |= VM_FAULT_RETRY;
goto out_release;
}
/*
* Make sure try_to_free_swap or reuse_swap_page or swapoff did not
* release the swapcache from under us. The page pin, and pte_same
* test below, are not enough to exclude that. Even if it is still
* swapcache, we need to check that the page's swap has not changed.
*/
if (unlikely((!PageSwapCache(page) ||
page_private(page) != entry.val)) && swapcache)
goto out_page;
page = ksm_might_need_to_copy(page, vma, vmf->address);
if (unlikely(!page)) {
ret = VM_FAULT_OOM;
page = swapcache;
goto out_page;
}
if (mem_cgroup_try_charge(page, vma->vm_mm, GFP_KERNEL,
&memcg, false)) {
ret = VM_FAULT_OOM;
goto out_page;
}
/*
* Back out if somebody else already faulted in this pte.
*/
vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
&vmf->ptl);
if (unlikely(!pte_same(*vmf->pte, vmf->orig_pte)))
goto out_nomap;
if (unlikely(!PageUptodate(page))) {
ret = VM_FAULT_SIGBUS;
goto out_nomap;
}
/*
* The page isn't present yet, go ahead with the fault.
*
* Be careful about the sequence of operations here.
* To get its accounting right, reuse_swap_page() must be called
* while the page is counted on swap but not yet in mapcount i.e.
* before page_add_anon_rmap() and swap_free(); try_to_free_swap()
* must be called after the swap_free(), or it will never succeed.
*/
inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
dec_mm_counter_fast(vma->vm_mm, MM_SWAPENTS);
pte = mk_pte(page, vma->vm_page_prot);
if ((vmf->flags & FAULT_FLAG_WRITE) && reuse_swap_page(page, NULL)) {
pte = maybe_mkwrite(pte_mkdirty(pte), vma);
vmf->flags &= ~FAULT_FLAG_WRITE;
ret |= VM_FAULT_WRITE;
exclusive = RMAP_EXCLUSIVE;
}
flush_icache_page(vma, page);
if (pte_swp_soft_dirty(vmf->orig_pte))
pte = pte_mksoft_dirty(pte);
set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte);
vmf->orig_pte = pte;
/* ksm created a completely new copy */
if (unlikely(page != swapcache && swapcache)) {
page_add_new_anon_rmap(page, vma, vmf->address, false);
mem_cgroup_commit_charge(page, memcg, false, false);
lru_cache_add_active_or_unevictable(page, vma);
} else {
do_page_add_anon_rmap(page, vma, vmf->address, exclusive);
mem_cgroup_commit_charge(page, memcg, true, false);
activate_page(page);
}
swap_free(entry);
if (mem_cgroup_swap_full(page) ||
(vma->vm_flags & VM_LOCKED) || PageMlocked(page))
try_to_free_swap(page);
unlock_page(page);
if (page != swapcache && swapcache) {
/*
* Hold the lock to avoid the swap entry to be reused
* until we take the PT lock for the pte_same() check
* (to avoid false positives from pte_same). For
* further safety release the lock after the swap_free
* so that the swap count won't change under a
* parallel locked swapcache.
*/
unlock_page(swapcache);
put_page(swapcache);
}
if (vmf->flags & FAULT_FLAG_WRITE) {
ret |= do_wp_page(vmf);
if (ret & VM_FAULT_ERROR)
ret &= VM_FAULT_ERROR;
goto out;
}
/* No need to invalidate - it was non-present before */
update_mmu_cache(vma, vmf->address, vmf->pte);
unlock:
pte_unmap_unlock(vmf->pte, vmf->ptl);
out:
return ret;
out_nomap:
mem_cgroup_cancel_charge(page, memcg, false);
pte_unmap_unlock(vmf->pte, vmf->ptl);
out_page:
unlock_page(page);
out_release:
put_page(page);
if (page != swapcache && swapcache) {
unlock_page(swapcache);
put_page(swapcache);
}
return ret;
}
参考资料:
1.Linux内核学习——内存管理之进程地址空间
2.关于内核中PST的实现
3.进程地址空间 get_unmmapped_area()
4.linux如何感知通过mmap进行的文件修改
5.linux内存布局和ASLR下的可分配地址空间
6.关于内核中PST的实现
7.linux sbrk/brk函数使用整理
本文来自博客园,作者:pwl999,转载请注明原文链接:https://www.cnblogs.com/pwl999/p/15535001.html