Linux mem 1.1 用户态进程空间的创建 --- execve() 详解

1. 原理介绍

在linux中创建一个新进程,一般是先用fork()从父进程复制一个新的进程空间,然后调用execve()加载新的exe文件,创建新的代码段、数据段、bss、heap、stack、mmap区域。

但是新的进程空间其实也不是execve()一个人完成的,它是execve()和ld合作完成的:

  • 1、execve()加载exe文件和ld(/lib/x86_64-linux-gnu/ld-2.27.so)到进程地址空间。然后跳到ld入口处,把控制权交给ld。
  • 2、ld负责加载exe依赖的动态库文件和进行动态链接。例如iibc等等。

在这里插入图片描述

本文仅关注execve()的具体加载过程。研究execve()我们关注的就是进程地址空间,有多少部分,每部分的地址和长度是多少,和文件中的那些内容关联。

1.1 固定地址映射

在各部分的起始地址为固定模式时,这种映射布局和关系如下图:

在这里插入图片描述

1.2 随机地址映射(ASLR)

为了提升系统的安全,增大漏洞的攻击难度,提出了进程地址空间各区域随机化的措施,称之为ASLR(Address Space Layout Randomization)。ASLR通过随机放置进程关键数据区域的地址空间来防止攻击者能可靠地跳转到内存的特定位置来利用函数。现代操作系统一般都加设这一机制,以防范恶意程序对已知地址进行Return-to-libc攻击。

地址空间随机化分为3个等级:

0 关闭
1 半随机 code&data、stack、mmap、vdso随机化
2 全随机 在1的基础上加上heap随机化

可以通过/proc文件节点查询和配置aslr的等级:

$ cat /proc/sys/kernel/randomize_va_space
2
  • exe 格式

我们知道elf文件有四种格式(ET_REL、ET_EXEC、ET_DYN、ET_CORE),传统exe文件格式为ET_EXEC

因为打开ASLR以后,支持exe文件code&data的随机化,所以需要把exe文件编译成位置无关代码,这种编译出来的exe文件格式为ET_DYN

可以使用-pie选项来编译ET_DYN类型的exe文件:

在这里插入图片描述
关于piepic相关gcc选项的说明:

  • fPIC与-fpic都是在编译时加入的选项,用于生成位置无关的代码(Position-Independent-Code)。这两个选项都是可以使代码在加载到内存时使用相对地址,所有对固定地址的访问都通过全局偏移表(GOT)来实现。-fPIC和-fpic最大的区别在于是否对GOT的大小有限制。-fPIC对GOT表大小无限制,所以如果在不确定的情况下,使用-fPIC是更好的选择。
  • fPIE与-fpie是等价的。这个选项与-fPIC/-fpic大致相同,不同点在于:-fPIC用于生成动态库,-fPIE用与生成可执行文件。再说得直白一点:-fPIE用来生成位置无关的可执行代码。
  • 其中,-fPIE选项用于编译器,使用这个选项之后,从.c或.cpp编译出来的.o文件将是位置无关的目标文件。而-pie选项则用于链接器,使用这个选项之后,链接器能够把-fPIE选项下编译出来的.o文件链接成位置无关可执行程序。

在ubuntu18.04的环境下,gcc编译时不需要手工增加pie选项,系统默认已经加上-pie选项。

  • ASLR layout

在ASLR开启情况下,进程的用户空间布局如下:

在这里插入图片描述

可以看出关键区域的基地址都已经加上随机偏移:

区域基地址随机偏移配置路径
code&data2/3 * DEFAULT_MAP_WINDOWmmap64_rnd_bits pageload_elf_binary() → arch_mmap_rnd()
heapbss区域之上0x02000000load_elf_binary() → arch_randomize_brk()
stackDEFAULT_MAP_WINDOW0 - 0x3fffff pageload_elf_binary() → setup_arg_pages() → randomize_stack_top()
mmapstack区域之下mmap64_rnd_bits pageload_elf_binary() → setup_new_exec() → arch_pick_mmap_layout()

1.3 文件映射

进程空间的映射,有些是和文件关联的。例如:code、data,以及mmap映射的动态库的code、data。

还有些是存储临时数据的,是匿名映射。例如:bss、heap、stack,以及mmap映射动态库的bss。

  • bss

BSS(Block Started by Symbol)段通常是指用来存放程序中未初始化的或者初始化为0的全局变量和静态变量的一块内存区域。

execve()在加载exe文件的时候,怎么确定bss段呢?

一个PT_LOAD的segment,如果它的MemSiz大于FileSiz,那么这大于出来的空白空间就是bss。例如:

# readelf -l test

Elf file type is DYN (Shared object file)
Entry point 0x610
There are 9 program headers, starting at offset 64

Program Headers:
  Type           Offset             VirtAddr           PhysAddr
                 FileSiz            MemSiz              Flags  Align

  LOAD           0x0000000000000da0 0x0000000000200da0 0x0000000000200da0
                 0x0000000000000280 0x00000000000002e8  RW     0x200000

 Section to Segment mapping:
  Segment Sections...

   03     .init_array .fini_array .dynamic .got .data .bss   

上述segemnt的bss段长度: 0x00000000000002e8 - 0x0000000000000280 = 0x68

符合.bss section的定义。

# readelf -S test
There are 29 section headers, starting at offset 0x1ac8:

Section Headers:
  [Nr] Name              Type             Address           Offset
       Size              EntSize          Flags  Link  Info  Align

  [24] .bss              NOBITS           0000000000201020  00001020
       0000000000000068  0000000000000000  WA       0     0     32

execve()加载exe文件时,碰到这种情况就会给这段bss单独创建一个匿名映射的vma来进行地址映射。

但是因为bss的地址和data的地址一般是连续的,所以实际上当bss数量较少时,是直接放到data vma剩余空间中:

555555756000-555555758000 rw-p 00000000 00:00 0                          [heap]
$ cat /proc/2915/maps 
563591f20000-563591f21000 r-xp 00000000 08:01 5249799                    /home/pwl/hook/test/test
563592120000-563592121000 r--p 00000000 08:01 5249799                    /home/pwl/hook/test/test
563592121000-563592122000 rw-p 00001000 08:01 5249799                    /home/pwl/hook/test/test // data vma剩余空间存放了bss
563592f53000-563592f55000 rw-p 00000000 00:00 0                          [heap]

只有当bss空间过大,在data vma剩余地址中保存不下时,才会单独创建一个匿名vma:

$ cat /proc/2932/maps 
5623db1e5000-5623db1e6000 r-xp 00000000 08:01 5249799                    /home/pwl/hook/test/test
5623db3e5000-5623db3e6000 r--p 00000000 08:01 5249799                    /home/pwl/hook/test/test
5623db3e6000-5623db3e7000 rw-p 00001000 08:01 5249799                    /home/pwl/hook/test/test // 部分bss存放在data vma中
5623db3e7000-5623db3f0000 rw-p 00000000 00:00 0 	// 剩余bss创建了独立的匿名vma
5623db787000-5623db789000 rw-p 00000000 00:00 0                          [heap]
  • data 分割

我们在使用execve()加载exe文件时,exe文件一般分成了两个PT_LOAD segemnt来加载:

1、第一个segment包含了如下内容:

   02     .interp .note.ABI-tag .note.gnu.build-id .gnu.hash .dynsym .dynstr .gnu.version .gnu.version_r .rela.dyn .rela.plt .init .plt .plt.got .text .fini .rodata .eh_frame_hdr .eh_frame 

我们一般加载成一个只读可执行的vma:

5623db1e5000-5623db1e6000 r-xp 00000000 08:01 5249799                    /home/pwl/hook/test/test

2、第二个segment包含了如下内容:

   03     .init_array .fini_array .dynamic .got .data .bss 

除去.bss会被特殊处理,可能创建一个独立的匿名vma,剩下部分应该创建成一个可读写的vma。但是实际的情况是这部分会是两个vma,一个属性只读、一个属性可读写

5623db3e5000-5623db3e6000 r--p 00000000 08:01 5249799                    /home/pwl/hook/test/test	// 只读
5623db3e6000-5623db3e7000 rw-p 00001000 08:01 5249799                    /home/pwl/hook/test/test	// 可读写

这是为啥呢?这是在execvce()执行完成后,在ld执行阶段,为了保护无关的数据使用mprotect()设置了前面一部分数据为只读,这样一个vma被分割成了两个:

1、包含.init_array .fini_array .dynamic .got,属性只读。
2、包含.data,属性可读写。

  • layout mmap

综合上述的特殊处理,最后layout和文件之间的映射关系如下:

在这里插入图片描述

和实际查询到的maps对应关系如下:

在这里插入图片描述

1.4 stack

在把控制权移交给ld之前,execve()在用户堆栈中已经构造好了内容,具体内容如下:

在这里插入图片描述

2. 代码详解

2.1 execve()

SYSCALL_DEFINE3(execve,
		const char __user *, filename,
		const char __user *const __user *, argv,
		const char __user *const __user *, envp)
{
	return do_execve(getname(filename), argv, envp);
}

↓

int do_execve(struct filename *filename,
	const char __user *const __user *__argv,
	const char __user *const __user *__envp)
{
	struct user_arg_ptr argv = { .ptr.native = __argv };
	struct user_arg_ptr envp = { .ptr.native = __envp };
	return do_execveat_common(AT_FDCWD, filename, argv, envp, 0);
}

↓

static int do_execveat_common(int fd, struct filename *filename,
			      struct user_arg_ptr argv,
			      struct user_arg_ptr envp,
			      int flags)
{
	char *pathbuf = NULL;
	struct linux_binprm *bprm;
	struct file *file;
	struct files_struct *displaced;
	int retval;

	if (IS_ERR(filename))
		return PTR_ERR(filename);

	/*
	 * We move the actual failure in case of RLIMIT_NPROC excess from
	 * set*uid() to execve() because too many poorly written programs
	 * don't check setuid() return code.  Here we additionally recheck
	 * whether NPROC limit is still exceeded.
	 */
    /* (1) 检查当前用户打开的进程数有没有超标 */
	if ((current->flags & PF_NPROC_EXCEEDED) &&
	    atomic_read(&current_user()->processes) > rlimit(RLIMIT_NPROC)) {
		retval = -EAGAIN;
		goto out_ret;
	}

	/* We're below the limit (still or again), so we don't want to make
	 * further execve() calls fail. */
	current->flags &= ~PF_NPROC_EXCEEDED;

    /* (2) 复制一份当前进程的文件表 */
	retval = unshare_files(&displaced);
	if (retval)
		goto out_ret;

    /* (3) 分配bprm空间
            bprm = binary parameter
            它负责保存在加载二进制文件中用到的参数
     */
	retval = -ENOMEM;
	bprm = kzalloc(sizeof(*bprm), GFP_KERNEL);
	if (!bprm)
		goto out_files;

    /* (4) 准备进程的credential凭证 */
	retval = prepare_bprm_creds(bprm);
	if (retval)
		goto out_free;

    /* (5) 确定执行建议程序的安全性 */
	check_unsafe_exec(bprm);
	current->in_execve = 1;

    /* (6) open指定文件名,得到文件操作句柄 bprm->file
            并且通过deny_write_access()禁止文件写操作
     */
	file = do_open_execat(fd, filename, flags);
	retval = PTR_ERR(file);
	if (IS_ERR(file))
		goto out_unmark;

	sched_exec();

	bprm->file = file;

    /* (6.1) 计算bprm->filename、bprm->interp */
	if (fd == AT_FDCWD || filename->name[0] == '/') {
		bprm->filename = filename->name;
	} else {
		if (filename->name[0] == '\0')
			pathbuf = kasprintf(GFP_KERNEL, "/dev/fd/%d", fd);
		else
			pathbuf = kasprintf(GFP_KERNEL, "/dev/fd/%d/%s",
					    fd, filename->name);
		if (!pathbuf) {
			retval = -ENOMEM;
			goto out_unmark;
		}
		/*
		 * Record that a name derived from an O_CLOEXEC fd will be
		 * inaccessible after exec. Relies on having exclusive access to
		 * current->files (due to unshare_files above).
		 */
		if (close_on_exec(fd, rcu_dereference_raw(current->files->fdt)))
			bprm->interp_flags |= BINPRM_FLAGS_PATH_INACCESSIBLE;
		bprm->filename = pathbuf;
	}
	bprm->interp = bprm->filename;

    /* (7) 创建参数区(大小为一个page,此时为临时映射,不能直接访问):
            创建临时的mm:bprm->mm
            创建指向参数区的vma,位置在用户空间的最顶端,大小为一个page:
                vma->vm_end = STACK_TOP_MAX;
                vma->vm_start = vma->vm_end - PAGE_SIZE;
                bprm->p = vma->vm_end - sizeof(void *);     // 指向vma区域的最顶端
            注意:这里只是分配了vma,并没有分配物理内存。连mm都不是tsk->mm,
                所以无法通过直接访问地址触发缺页异常来分配物理内存
                需要通过其他函数显式的为这块空间分配物理内存
     */
	retval = bprm_mm_init(bprm);
	if (retval)
		goto out_unmark;

    /* (8.1) 计算argv中包含的字符串个数 */
	bprm->argc = count(argv, MAX_ARG_STRINGS);
	if ((retval = bprm->argc) < 0)
		goto out;

    /* (8.2) 计算envp中包含的字符串个数 */
	bprm->envc = count(envp, MAX_ARG_STRINGS);
	if ((retval = bprm->envc) < 0)
		goto out;

    /* (8.3) 拷贝被执行的文件前256字节到缓冲区bprm->buf[]中 */
	retval = prepare_binprm(bprm);
	if (retval < 0)
		goto out;

    /* (8.4) 拷贝bprm->filename到“参数区”的最顶端
            加载参数区的物理内存的分配,起始是在这个函数中完成的
     */
	retval = copy_strings_kernel(1, &bprm->filename, bprm);
	if (retval < 0)
		goto out;

	bprm->exec = bprm->p;
    /* (8.4) 拷贝envp到“参数区”,存储位置从高到低 */
	retval = copy_strings(bprm->envc, envp, bprm);
	if (retval < 0)
		goto out;

    /* (8.5) 拷贝argv到“参数区”,存储位置从高到低 */
	retval = copy_strings(bprm->argc, argv, bprm);
	if (retval < 0)
		goto out;

    /* (9) 调用各个二进制解析器,来尝试解析执行指定的文件 */
	retval = exec_binprm(bprm);
	if (retval < 0)
		goto out;

	/* execve succeeded */
    /* (10) 执行成功,做一些清理工作 */
	current->fs->in_exec = 0;
	current->in_execve = 0;
	membarrier_execve(current);
	acct_update_integrals(current);
	task_numa_free(current, false);
	free_bprm(bprm);
	kfree(pathbuf);
	putname(filename);
	if (displaced)
		put_files_struct(displaced);
	return retval;

out:
	if (bprm->mm) {
		acct_arg_size(bprm, 0);
		mmput(bprm->mm);
	}

out_unmark:
	current->fs->in_exec = 0;
	current->in_execve = 0;

out_free:
	free_bprm(bprm);
	kfree(pathbuf);

out_files:
	if (displaced)
		reset_files_struct(displaced);
out_ret:
	putname(filename);
	return retval;
}

可以看到do_execveat_common()函数中大部分的工作是准备一个加载参数的结构体bprm,计算bprm需要用到的各种参数。并且创建了一个page大小的参数区域vma,用来保存(filename+envp+argv)。

2.1.1 bprm_mm_init()

在bprm_mm_init()中,创建了“参数区”虚拟地址对应的mm、vma:

static int bprm_mm_init(struct linux_binprm *bprm)
{
	int err;
	struct mm_struct *mm = NULL;

    /* (7.1) 分配临时的mm结构 */
	bprm->mm = mm = mm_alloc();
	err = -ENOMEM;
	if (!mm)
		goto err;

    /* (7.2) 进一步初始化mm */
	err = __bprm_mm_init(bprm);
	if (err)
		goto err;

	return 0;

err:
	if (mm) {
		bprm->mm = NULL;
		mmdrop(mm);
	}

	return err;
}

↓

static int __bprm_mm_init(struct linux_binprm *bprm)
{
	int err;
	struct vm_area_struct *vma = NULL;
	struct mm_struct *mm = bprm->mm;

    /* (7.2.1) 分配新的vma */
	bprm->vma = vma = vm_area_alloc(mm);
	if (!vma)
		return -ENOMEM;

	if (down_write_killable(&mm->mmap_sem)) {
		err = -EINTR;
		goto err_free;
	}

	/*
	 * Place the stack at the largest stack address the architecture
	 * supports. Later, we'll move this to an appropriate place. We don't
	 * use STACK_TOP because that can depend on attributes which aren't
	 * configured yet.
	 */
	BUILD_BUG_ON(VM_STACK_FLAGS & VM_STACK_INCOMPLETE_SETUP);
    /* (7.2.2) 给vma成员赋值:
            位置为用户空间的最高地址,大小为一个page
     */
	vma->vm_end = STACK_TOP_MAX;
	vma->vm_start = vma->vm_end - PAGE_SIZE;
	vma->vm_flags = VM_SOFTDIRTY | VM_STACK_FLAGS | VM_STACK_INCOMPLETE_SETUP;
	vma->vm_page_prot = vm_get_page_prot(vma->vm_flags);

    /* (7.2.3) 将vma插入mm的红黑树中 */
	err = insert_vm_struct(mm, vma);
	if (err)
		goto err;

	mm->stack_vm = mm->total_vm = 1;
	arch_bprm_mm_init(mm, vma);
	up_write(&mm->mmap_sem);

    /* (7.2.4) 将bprm->p指向参数区的最高位置 */
	bprm->p = vma->vm_end - sizeof(void *);
	return 0;
err:
	up_write(&mm->mmap_sem);
err_free:
	bprm->vma = NULL;
	vm_area_free(vma);
	return err;
}

2.1.2 copy_strings()

在copy_strings_kernel()或者copy_strings()函数中,才真正分配了“参数区”对应的物理内存page:

static int copy_strings(int argc, struct user_arg_ptr argv,
			struct linux_binprm *bprm)
{
	struct page *kmapped_page = NULL;
	char *kaddr = NULL;
	unsigned long kpos = 0;
	int ret;

    /* (8.4.1) 逐个拷贝多个用户态字符串 */
	while (argc-- > 0) {
		const char __user *str;
		int len;
		unsigned long pos;

		ret = -EFAULT;
        /* (8.4.2) 获取被拷贝字符串的用户态首地址 */
		str = get_user_arg_ptr(argv, argc);
		if (IS_ERR(str))
			goto out;

        /* (8.4.3) 获取被拷贝字符串的长度 */
		len = strnlen_user(str, MAX_ARG_STRLEN);
		if (!len)
			goto out;

		ret = -E2BIG;
		if (!valid_arg_len(bprm, len))
			goto out;

		/* We're going to work our way backwords. */
        /* (8.4.4) 计算“参数区”中存储位置的偏移 */
		pos = bprm->p;
		str += len;
		bprm->p -= len;

        /* (8.4.5) 逐个字符拷贝字符串 */
		while (len > 0) {
			int offset, bytes_to_copy;

			if (fatal_signal_pending(current)) {
				ret = -ERESTARTNOHAND;
				goto out;
			}
			cond_resched();

			offset = pos % PAGE_SIZE;
			if (offset == 0)
				offset = PAGE_SIZE;

			bytes_to_copy = offset;
			if (bytes_to_copy > len)
				bytes_to_copy = len;

			offset -= bytes_to_copy;
			pos -= bytes_to_copy;
			str -= bytes_to_copy;
			len -= bytes_to_copy;

			if (!kmapped_page || kpos != (pos & PAGE_MASK)) {
				struct page *page;

                /* (8.4.6) 获取“参数区”对应的物理内存page,如果还没有则新分配page
                        分配核心的实现在__get_user_pages_locked()中,可以到mmap()一文中查看它的详细实现
                 */
				page = get_arg_page(bprm, pos, 1);
				if (!page) {
					ret = -E2BIG;
					goto out;
				}

				if (kmapped_page) {
					flush_kernel_dcache_page(kmapped_page);
					kunmap(kmapped_page);
					put_arg_page(kmapped_page);
				}
				kmapped_page = page;

                /* (8.4.7) 给“参数区”page映射内核虚拟地址,这样才能进行访问 */
				kaddr = kmap(kmapped_page);
				kpos = pos & PAGE_MASK;
				flush_arg_page(bprm, kpos, kmapped_page);
			}

            /* (8.4.8) 拷贝用户态字符串到“参数区”中的存储位置 */
			if (copy_from_user(kaddr+offset, str, bytes_to_copy)) {
				ret = -EFAULT;
				goto out;
			}
		}
	}
	ret = 0;
out:
	if (kmapped_page) {
		flush_kernel_dcache_page(kmapped_page);
		kunmap(kmapped_page);
		put_arg_page(kmapped_page);
	}
	return ret;
}

2.1.3 security_bprm_check()

在lsm的security_bprm_check()钩子中,我们可以获得命令行参数argv:

sys_execve() → do_execveat_common() → exec_binprm() → search_binary_handler() → security_bprm_check()

↓

static int get_argv_from_bprm(struct linux_binprm *bprm)
{
    int ret = 0;
    unsigned long offset, pos;
    char *kaddr;
    struct page *page;
    char argv[PAGE_SIZE] = {0};
    int i = 0;
    int argc = 0;
    int count = 0;
    if (!bprm)
        return 0;

    argc = bprm->argc;
    
    pos = bprm->p;
    do {
        offset = pos & ~PAGE_MASK;
        page = get_arg_page(bprm, pos, 0);
        if (!page) {
            ret = 0;
            goto out;
        }
        kaddr = kmap_atomic(page);

        for (i = 0; offset < PAGE_SIZE && count < argc && i < PAGE_SIZE; offset++, pos++) {
            if (kaddr[offset] == '\0') {
                count++;
                pos++;
                printk("argv is %s\n", argv);
                memset(argv, 0, sizeof(argv));
                i = 0;
                continue;
            }
            argv[i] = kaddr[offset];
            i++;
        }
        
        kunmap_atomic(kaddr);
        put_arg_page(page);
    } while (offset == PAGE_SIZE);

    ret = 0;

out:
    return ret;
}

2.2 load_elf_binary()

在准备好bprm和"参数区"以后,execve剩下的工作就交给了binfmt加载函数,而linux下最常见的binfmt就是elf格式对应:

static struct linux_binfmt elf_format = {
	.module		= THIS_MODULE,
	.load_binary	= load_elf_binary,
	.load_shlib	= load_elf_library,
	.core_dump	= elf_core_dump,
	.min_coredump	= ELF_EXEC_PAGESIZE,
};

其中的核心为函数load_elf_binary(),负责解析elf文件格式,并且加载PT_LOAD segment到内存,并进行地址空间映射:

do_execveat_common() → exec_binprm() → search_binary_handler() → fmt->load_binary()

↓

static int load_elf_binary(struct linux_binprm *bprm)
{
	struct file *interpreter = NULL; /* to shut gcc up */
 	unsigned long load_addr = 0, load_bias = 0;
	int load_addr_set = 0;
	char * elf_interpreter = NULL;
	unsigned long error;
	struct elf_phdr *elf_ppnt, *elf_phdata, *interp_elf_phdata = NULL;
	unsigned long elf_bss, elf_brk;
	int bss_prot = 0;
	int retval, i;
	unsigned long elf_entry;
	unsigned long interp_load_addr = 0;
	unsigned long start_code, end_code, start_data, end_data;
	unsigned long reloc_func_desc __maybe_unused = 0;
	int executable_stack = EXSTACK_DEFAULT;
	struct pt_regs *regs = current_pt_regs();
	struct {
		struct elfhdr elf_ex;
		struct elfhdr interp_elf_ex;
	} *loc;
	struct arch_elf_state arch_state = INIT_ARCH_ELF_STATE;
	loff_t pos;

	loc = kmalloc(sizeof(*loc), GFP_KERNEL);
	if (!loc) {
		retval = -ENOMEM;
		goto out_ret;
	}
	
	/* Get the exec-header */
    /* (1.1) 获取到exec文件的'elf header'保存到loc->elf_ex
            bprm->buf[]中实现读取了exec文件头256字节的内容
     */
	loc->elf_ex = *((struct elfhdr *)bprm->buf);

	retval = -ENOEXEC;
	/* First of all, some simple consistency checks */
    /* (1.2) 检查elf header中的magic number是否合法 */
	if (memcmp(loc->elf_ex.e_ident, ELFMAG, SELFMAG) != 0)
		goto out;

    /* (1.3) 检查文件类型是否是可以执行的exe或者so */
	if (loc->elf_ex.e_type != ET_EXEC && loc->elf_ex.e_type != ET_DYN)
		goto out;
    /* (1.4) 检查文件的架构类型和当前环境是否符合 */
	if (!elf_check_arch(&loc->elf_ex))
		goto out;
	if (elf_check_fdpic(&loc->elf_ex))
		goto out;
	if (!bprm->file->f_op->mmap)
		goto out;

    /* (2.1) 根据exec文件'elf header'中的信息,读出 'program header table' 保存到elf_phdata  */
	elf_phdata = load_elf_phdrs(&loc->elf_ex, bprm->file);
	if (!elf_phdata)
		goto out;

	elf_ppnt = elf_phdata;
	elf_bss = 0;
	elf_brk = 0;

	start_code = ~0UL;
	end_code = 0;
	start_data = 0;
	end_data = 0;

    /* (2.2) 遍历exec文件的 'program header table',找到PT_INTERP segment
            用 `readelf -l xxx`可以读出interpreter文件的路径,一般为"/lib64/ld-linux-x86-64.so.2"
            并且读出interpreter文件的`elf header`保存到loc->interp_elf_ex
     */
	for (i = 0; i < loc->elf_ex.e_phnum; i++) {

        /* (2.2.1) 找到PT_INTERP segment */
		if (elf_ppnt->p_type == PT_INTERP) {
			/* This is the program interpreter used for
			 * shared libraries - for now assume that this
			 * is an a.out format binary
			 */
			retval = -ENOEXEC;
			if (elf_ppnt->p_filesz > PATH_MAX || 
			    elf_ppnt->p_filesz < 2)
				goto out_free_ph;

			retval = -ENOMEM;
			elf_interpreter = kmalloc(elf_ppnt->p_filesz,
						  GFP_KERNEL);
			if (!elf_interpreter)
				goto out_free_ph;

			pos = elf_ppnt->p_offset;
            /* (2.2.2) 读出PT_INTERP segment的内容,
                    即interpreter文件的路径:"/lib64/ld-linux-x86-64.so.2"
             */
			retval = kernel_read(bprm->file, elf_interpreter,
					     elf_ppnt->p_filesz, &pos);
			if (retval != elf_ppnt->p_filesz) {
				if (retval >= 0)
					retval = -EIO;
				goto out_free_interp;
			}
			/* make sure path is NULL terminated */
			retval = -ENOEXEC;
			if (elf_interpreter[elf_ppnt->p_filesz - 1] != '\0')
				goto out_free_interp;

            /* (2.2.3) open interpreter文件,得到操作句柄  */
			interpreter = open_exec(elf_interpreter);
			retval = PTR_ERR(interpreter);
			if (IS_ERR(interpreter))
				goto out_free_interp;

			/*
			 * If the binary is not readable then enforce
			 * mm->dumpable = 0 regardless of the interpreter's
			 * permissions.
			 */
			would_dump(bprm, interpreter);

			/* Get the exec headers */
			pos = 0;
             /* (2.2.4) 读取interpreter文件的'elf header'保存到interp_elf_ex  */
			retval = kernel_read(interpreter, &loc->interp_elf_ex,
					     sizeof(loc->interp_elf_ex), &pos);
			if (retval != sizeof(loc->interp_elf_ex)) {
				if (retval >= 0)
					retval = -EIO;
				goto out_free_dentry;
			}

			break;
		}
		elf_ppnt++;
	}

	elf_ppnt = elf_phdata;
    /* (2.3) 遍历exec文件的 'program header table',
            找到PT_GNU_STACK segment,判断堆栈是否要加execve属性
            找到PT_LOPROC/PT_HIPROC segment,进行校验
     */
	for (i = 0; i < loc->elf_ex.e_phnum; i++, elf_ppnt++)
		switch (elf_ppnt->p_type) {
		case PT_GNU_STACK:
			if (elf_ppnt->p_flags & PF_X)
				executable_stack = EXSTACK_ENABLE_X;
			else
				executable_stack = EXSTACK_DISABLE_X;
			break;

		case PT_LOPROC ... PT_HIPROC:
			retval = arch_elf_pt_proc(&loc->elf_ex, elf_ppnt,
						  bprm->file, false,
						  &arch_state);
			if (retval)
				goto out_free_dentry;
			break;
		}

	/* Some simple consistency checks for the interpreter */
    /* (2.4) 针对interpreter文件"/lib64/ld-linux-x86-64.so.2",做一些简单的一致性检查 */
	if (elf_interpreter) {
		retval = -ELIBBAD;
		/* Not an ELF interpreter */
        /* (2.4.1) 检查elf header中的magic number是否合法 */
		if (memcmp(loc->interp_elf_ex.e_ident, ELFMAG, SELFMAG) != 0)
			goto out_free_dentry;
		/* Verify the interpreter has a valid arch */
        /* (2.4.2) 检查文件的架构类型和当前环境是否符合 */
		if (!elf_check_arch(&loc->interp_elf_ex) ||
		    elf_check_fdpic(&loc->interp_elf_ex))
			goto out_free_dentry;

		/* Load the interpreter program headers */
        /* (2.4.3) 读出 'program header table' 保存到interp_elf_phdata */
		interp_elf_phdata = load_elf_phdrs(&loc->interp_elf_ex,
						   interpreter);
		if (!interp_elf_phdata)
			goto out_free_dentry;

		/* Pass PT_LOPROC..PT_HIPROC headers to arch code */
		elf_ppnt = interp_elf_phdata;
        /* (2.4.4) 遍历interpreter文件的 'program header table',校验PT_LOPROC segment */
		for (i = 0; i < loc->interp_elf_ex.e_phnum; i++, elf_ppnt++)
			switch (elf_ppnt->p_type) {
			case PT_LOPROC ... PT_HIPROC:
				retval = arch_elf_pt_proc(&loc->interp_elf_ex,
							  elf_ppnt, interpreter,
							  true, &arch_state);
				if (retval)
					goto out_free_dentry;
				break;
			}
	}

	/*
	 * Allow arch code to reject the ELF at this point, whilst it's
	 * still possible to return an error to the code that invoked
	 * the exec syscall.
	 */
    /* (2.5) 此时允许体系结构代码拒绝ELF,同时仍然可以将错误返回给调用exec syscall的代码。 */
	retval = arch_check_elf(&loc->elf_ex,
				!!interpreter, &loc->interp_elf_ex,
				&arch_state);
	if (retval)
		goto out_free_dentry;

	/* Flush all traces of the currently running executable */
    /* (3) 释放当前进程旧的exe文件相关资源,建立新exe文件相关资源 */
	retval = flush_old_exec(bprm);
	if (retval)
		goto out_free_dentry;

	/* Do this immediately, since STACK_TOP as used in setup_arg_pages
	   may depend on the personality.  */
    /* (4) 配置current->personality属性 */
	SET_PERSONALITY2(loc->elf_ex, &arch_state);
    /* (4.1) read = read + exec */
	if (elf_read_implies_exec(loc->elf_ex, executable_stack))
		current->personality |= READ_IMPLIES_EXEC;
    /* (4.2) 地址随机化 */
	if (!(current->personality & ADDR_NO_RANDOMIZE) && randomize_va_space)
		current->flags |= PF_RANDOMIZE;

    /* (5) 设置当前进程的mm->mmap_base等属性 */
	setup_new_exec(bprm);
    /* (6) 设置cred:task->real_cred/cred = bprm->cred */
	install_exec_creds(bprm);

	/* Do this so that we can load the interpreter, if need be.  We will
	   change some of these later */
    /* (7) 设置好stack区域的地址空间映射(匿名映射) */
	retval = setup_arg_pages(bprm, randomize_stack_top(STACK_TOP),
				 executable_stack);
	if (retval < 0)
		goto out_free_dentry;
	
	current->mm->start_stack = bprm->p;

	/* Now we do a little grungy work by mmapping the ELF image into
	   the correct location in memory. */
    /* (8) 现在正式开始elf文件的映射,把所有PT_LOAD segment加载到内存的正确位置(文件映射) */
	for(i = 0, elf_ppnt = elf_phdata;
	    i < loc->elf_ex.e_phnum; i++, elf_ppnt++) {
		int elf_prot = 0, elf_flags;
		unsigned long k, vaddr;
		unsigned long total_size = 0;

		if (elf_ppnt->p_type != PT_LOAD)
			continue;

        /* (8.1) 如果PT_LOAD segment中间有bss,则创建一个匿名的vma */
		if (unlikely (elf_brk > elf_bss)) {
			unsigned long nbyte;
	            
			/* There was a PT_LOAD segment with p_memsz > p_filesz
			   before this one. Map anonymous pages, if needed,
			   and clear the area.  */
			retval = set_brk(elf_bss + load_bias,
					 elf_brk + load_bias,
					 bss_prot);
			if (retval)
				goto out_free_dentry;
			nbyte = ELF_PAGEOFFSET(elf_bss);
			if (nbyte) {
				nbyte = ELF_MIN_ALIGN - nbyte;
				if (nbyte > elf_brk - elf_bss)
					nbyte = elf_brk - elf_bss;
				if (clear_user((void __user *)elf_bss +
							load_bias, nbyte)) {
					/*
					 * This bss-zeroing can fail if the ELF
					 * file specifies odd protections. So
					 * we don't check the return value
					 */
				}
			}
		}

        /* (8.2) 获取映射的属性 */
		if (elf_ppnt->p_flags & PF_R)
			elf_prot |= PROT_READ;
		if (elf_ppnt->p_flags & PF_W)
			elf_prot |= PROT_WRITE;
		if (elf_ppnt->p_flags & PF_X)
			elf_prot |= PROT_EXEC;

		elf_flags = MAP_PRIVATE | MAP_DENYWRITE | MAP_EXECUTABLE;

		vaddr = elf_ppnt->p_vaddr;
		/*
		 * If we are loading ET_EXEC or we have already performed
		 * the ET_DYN load_addr calculations, proceed normally.
         * 如果我们正在加载ET_EXEC或已经执行了ET_DYN load_addr计算,请正常进行。
		 */
        /* (8.3) exe文件其实有两种类型:ET_EXEC、ET_DYN。
                ET_EXEC是传统的exe文件
                ET_DYN是为了支持代码随机化编译成PIE位置无关代码
         */
        /* (8.3.1) ET_EXEC文件,或者ET_DYN的第一次以后的解析 */
		if (loc->elf_ex.e_type == ET_EXEC || load_addr_set) {
			elf_flags |= MAP_FIXED;
        /* (8.3.2) ET_DYN文件的第一次解析,
                需要计算随机偏移load_bias和需要加载映射的总长度total_size
         */
		} else if (loc->elf_ex.e_type == ET_DYN) {
			/*
			 * This logic is run once for the first LOAD Program
			 * Header for ET_DYN binaries to calculate the
			 * randomization (load_bias) for all the LOAD
			 * Program Headers, and to calculate the entire
			 * size of the ELF mapping (total_size). (Note that
			 * load_addr_set is set to true later once the
			 * initial mapping is performed.)
             * 对于第一个ET_DYN二进制文件的LOAD程序头,将运行一次此逻辑,以计算所有LOAD程序头的随机化(load_bias),并计算ELF映射的整个大小(total_size)。 (请注意,在执行初始映射后,将load_addr_set设置为true。)
			 *
			 * There are effectively two types of ET_DYN
			 * binaries: programs (i.e. PIE: ET_DYN with INTERP)
			 * and loaders (ET_DYN without INTERP, since they
			 * _are_ the ELF interpreter). The loaders must
			 * be loaded away from programs since the program
			 * may otherwise collide with the loader (especially
			 * for ET_EXEC which does not have a randomized
			 * position). For example to handle invocations of
			 * "./ld.so someprog" to test out a new version of
			 * the loader, the subsequent program that the
			 * loader loads must avoid the loader itself, so
			 * they cannot share the same load range. Sufficient
			 * room for the brk must be allocated with the
			 * loader as well, since brk must be available with
			 * the loader.
             * ET_DYN二进制文件实际上有两种类型:程序(即PIE:带有INTERP的ET_DYN)和加载器(没有INTERP的ET_DYN,因为它们是ELF解释器)。加载程序必须远离程序加载,因为否则程序可能会与加载程序发生冲突(尤其是对于没有随机位置的ET_EXEC)。例如,要处理“ ./ld.so someprog”调用以测试新版本的加载程序,加载程序加载的后续程序必须避免加载程序本身,因此它们不能共享相同的加载范围。装载程序还必须为brk分配足够的空间,因为brk必须随装载程序一起提供。
			 *
			 * Therefore, programs are loaded offset from
			 * ELF_ET_DYN_BASE and loaders are loaded into the
			 * independently randomized mmap region (0 load_bias
			 * without MAP_FIXED).
             * 因此,程序从ELF_ET_DYN_BASE偏移量加载,并且加载器被加载到独立随机的mmap区域(0加载偏置而不带MAP_FIXED)。
			 */
			if (elf_interpreter) {
                /* (8.3.3) 计算exe的随机偏移基地址: (2/3 * DEFAULT_MAP_WINDOW) + mmap64_rnd_bits */
				load_bias = ELF_ET_DYN_BASE;
				if (current->flags & PF_RANDOMIZE)
					load_bias += arch_mmap_rnd();
				elf_flags |= MAP_FIXED;
			} else
				load_bias = 0;

			/*
			 * Since load_bias is used for all subsequent loading
			 * calculations, we must lower it by the first vaddr
			 * so that the remaining calculations based on the
			 * ELF vaddrs will be correctly offset. The result
			 * is then page aligned.
             * 由于load_bias用于所有后续加载计算,因此我们必须将其降低第一个vaddr,以便基于ELF vaddr的其余计算将正确偏移。 然后将结果对齐页面。
			 */
			load_bias = ELF_PAGESTART(load_bias - vaddr);

            /* (8.3.4) 计算ET_DYN文件中所有PT_LOAD segment的总长度 */
			total_size = total_mapping_size(elf_phdata,
							loc->elf_ex.e_phnum);
			if (!total_size) {
				retval = -EINVAL;
				goto out_free_dentry;
			}
		}

        /* (8.4) 这里是最重要的部分:mmap
                根据segment的文件偏移和虚拟地址偏移,创建vma映射
                这里还有一个技巧:ET_DYN文件的第一次计算了total_size,会首先mmap total_size长度,然后unmmap掉不是当前segment的区域,这样做是为了保证mmap区域连续
         */
		error = elf_map(bprm->file, load_bias + vaddr, elf_ppnt,
				elf_prot, elf_flags, total_size);
		if (BAD_ADDR(error)) {
			retval = IS_ERR((void *)error) ?
				PTR_ERR((void*)error) : -EINVAL;
			goto out_free_dentry;
		}

        /* (8.5) 第一个PT_LOAD segment相关地址的计算 */
		if (!load_addr_set) {
			load_addr_set = 1;
			load_addr = (elf_ppnt->p_vaddr - elf_ppnt->p_offset);
			if (loc->elf_ex.e_type == ET_DYN) {
				load_bias += error -
				             ELF_PAGESTART(load_bias + vaddr);
				load_addr += load_bias;
				reloc_func_desc = load_bias;
			}
		}

        /* (8.6.1) 最小segment的start为start_code
                 最大segment的start为start_data
         */
		k = elf_ppnt->p_vaddr;
		if (k < start_code)
			start_code = k;
		if (start_data < k)
			start_data = k;

		/*
		 * Check to see if the section's size will overflow the
		 * allowed task size. Note that p_filesz must always be
		 * <= p_memsz so it is only necessary to check p_memsz.
		 */
		if (BAD_ADDR(k) || elf_ppnt->p_filesz > elf_ppnt->p_memsz ||
		    elf_ppnt->p_memsz > TASK_SIZE ||
		    TASK_SIZE - elf_ppnt->p_memsz < k) {
			/* set_brk can never work. Avoid overflows. */
			retval = -EINVAL;
			goto out_free_dentry;
		}

        /* (8.6.2) 如果一个PT_LOAD segment的elf_ppnt->p_filesz和elf_ppnt->p_memsz大小不一样
                    (p_filesz - p_memsz)之间的区域就是一个bss
         */
		k = elf_ppnt->p_vaddr + elf_ppnt->p_filesz;

		if (k > elf_bss)
			elf_bss = k;
		if ((elf_ppnt->p_flags & PF_X) && end_code < k)
			end_code = k;
		if (end_data < k)
			end_data = k;
		k = elf_ppnt->p_vaddr + elf_ppnt->p_memsz;
		if (k > elf_brk) {
			bss_prot = elf_prot;
			elf_brk = k;
		}
	}

    /* (8.6.3) 各种指针加上代码段的随机偏移 */
	loc->elf_ex.e_entry += load_bias;
	elf_bss += load_bias;
	elf_brk += load_bias;
	start_code += load_bias;
	end_code += load_bias;
	start_data += load_bias;
	end_data += load_bias;

	/* Calling set_brk effectively mmaps the pages that we need
	 * for the bss and break sections.  We must do this before
	 * mapping in the interpreter, to make sure it doesn't wind
	 * up getting placed where the bss needs to go.
	 */
    /* (8.7) 创建bss对应的匿名映射vma */
	retval = set_brk(elf_bss, elf_brk, bss_prot);
	if (retval)
		goto out_free_dentry;
	if (likely(elf_bss != elf_brk) && unlikely(padzero(elf_bss))) {
		retval = -EFAULT; /* Nobody gets to see this, but.. */
		goto out_free_dentry;
	}

    /* (8.8) 如果interpreter文件"/lib64/ld-linux-x86-64.so.2"存在
            将其也映射到进程空间
     */
	if (elf_interpreter) {
		unsigned long interp_map_addr = 0;

		elf_entry = load_elf_interp(&loc->interp_elf_ex,
					    interpreter,
					    &interp_map_addr,
					    load_bias, interp_elf_phdata);
		if (!IS_ERR((void *)elf_entry)) {
			/*
			 * load_elf_interp() returns relocation
			 * adjustment
			 */
			interp_load_addr = elf_entry;
            /* (8.8.1) 加载完interpreter,获得其入口地址 */
			elf_entry += loc->interp_elf_ex.e_entry;
		}
		if (BAD_ADDR(elf_entry)) {
			retval = IS_ERR((void *)elf_entry) ?
					(int)elf_entry : -EINVAL;
			goto out_free_dentry;
		}
		reloc_func_desc = interp_load_addr;

		allow_write_access(interpreter);
		fput(interpreter);
		kfree(elf_interpreter);
	} else {
		elf_entry = loc->elf_ex.e_entry;
		if (BAD_ADDR(elf_entry)) {
			retval = -EINVAL;
			goto out_free_dentry;
		}
	}

	kfree(interp_elf_phdata);
	kfree(elf_phdata);

	set_binfmt(&elf_format);

#ifdef ARCH_HAS_SETUP_ADDITIONAL_PAGES
	retval = arch_setup_additional_pages(bprm, !!elf_interpreter);
	if (retval < 0)
		goto out;
#endif /* ARCH_HAS_SETUP_ADDITIONAL_PAGES */

    /* (8.9) 在堆栈中构造elf的各种参数
            貌似是给interpreter使用的?
     */
	retval = create_elf_tables(bprm, &loc->elf_ex,
			  load_addr, interp_load_addr);
	if (retval < 0)
		goto out;
	/* N.B. passed_fileno might not be initialized? */
	current->mm->end_code = end_code;
	current->mm->start_code = start_code;
	current->mm->start_data = start_data;
	current->mm->end_data = end_data;
	current->mm->start_stack = bprm->p;

    /* (8.10) 如果ASLR需要heap随机化
            计算在现有基础上随机heap的起始地址current->mm->start_brk
            随机范围为32M(0x02000000)
     */
	if ((current->flags & PF_RANDOMIZE) && (randomize_va_space > 1)) {
		/*
		 * For architectures with ELF randomization, when executing
		 * a loader directly (i.e. no interpreter listed in ELF
		 * headers), move the brk area out of the mmap region
		 * (since it grows up, and may collide early with the stack
		 * growing down), and into the unused ELF_ET_DYN_BASE region.
		 */
		if (IS_ENABLED(CONFIG_ARCH_HAS_ELF_RANDOMIZE) &&
		    loc->elf_ex.e_type == ET_DYN && !interpreter)
			current->mm->brk = current->mm->start_brk =
				ELF_ET_DYN_BASE;

		current->mm->brk = current->mm->start_brk =
			arch_randomize_brk(current->mm);
#ifdef compat_brk_randomized
		current->brk_randomized = 1;
#endif
	}

    /* (8.11) 在0地址上创建一个全零的匿名映射vma */
	if (current->personality & MMAP_PAGE_ZERO) {
		/* Why this, you ask???  Well SVr4 maps page 0 as read-only,
		   and some applications "depend" upon this behavior.
		   Since we do not have the power to recompile these, we
		   emulate the SVr4 behavior. Sigh. */
		error = vm_mmap(NULL, 0, PAGE_SIZE, PROT_READ | PROT_EXEC,
				MAP_FIXED | MAP_PRIVATE, 0);
	}

#ifdef ELF_PLAT_INIT
	/*
	 * The ABI may specify that certain registers be set up in special
	 * ways (on i386 %edx is the address of a DT_FINI function, for
	 * example.  In addition, it may also specify (eg, PowerPC64 ELF)
	 * that the e_entry field is the address of the function descriptor
	 * for the startup routine, rather than the address of the startup
	 * routine itself.  This macro performs whatever initialization to
	 * the regs structure is required as well as any relocations to the
	 * function descriptor entries when executing dynamically links apps.
	 */
	ELF_PLAT_INIT(regs, reloc_func_desc);
#endif

    /* (8.12) 修改用户态寄存器: ip = interpreter elf_entry, sp = bprm->p
            这样从execve系统调用返回时,跳转到interpreter入口开始执行
     */
	start_thread(regs, elf_entry, bprm->p);
	retval = 0;
out:
	kfree(loc);
out_ret:
	return retval;

	/* error cleanup */
out_free_dentry:
	kfree(interp_elf_phdata);
	allow_write_access(interpreter);
	if (interpreter)
		fput(interpreter);
out_free_interp:
	kfree(elf_interpreter);
out_free_ph:
	kfree(elf_phdata);
	goto out;
}

2.2.1 flush_old_exec()

flush_old_exec()负责清除当前进程旧exe的资源,将其替换成新exe文件。这里的重点是,释放掉旧的用户地址空间,重新创建新exe的用户地址空间映射。

int flush_old_exec(struct linux_binprm * bprm)
{
	int retval;

	/*
	 * Make sure we have a private signal table and that
	 * we are unassociated from the previous thread group.
	 */
    /* (3.1) 确保我们有一个专用信号表,并且我们与上一个线程组没有关联。 */
	retval = de_thread(current);
	if (retval)
		goto out;

	/*
	 * Must be called _before_ exec_mmap() as bprm->mm is
	 * not visibile until then. This also enables the update
	 * to be lockless.
	 */
    /* (3.2) bprm->mm->exe_file = bprm->file */
	set_mm_exe_file(bprm->mm, bprm->file);

	would_dump(bprm, bprm->file);

	/*
	 * Release all of the old mmap stuff
	 */
	acct_arg_size(bprm, 0);
    /* (3.3) 这里是重点:
            首先释放掉进程旧的用户地址空间映射
            然后创建新exe的用户地址空间映射,bprm->mm正式转正了:tsk->mm = bprm->mm
            从现在开始可以通过“正常地址访问+缺页异常”来访问新exe的地址空间了,不过现在还是映射了一个page,即“参数区”
     */
	retval = exec_mmap(bprm->mm);
	if (retval)
		goto out;

	/*
	 * After clearing bprm->mm (to mark that current is using the
	 * prepared mm now), we have nothing left of the original
	 * process. If anything from here on returns an error, the check
	 * in search_binary_handler() will SEGV current.
	 */
	bprm->mm = NULL;

	set_fs(USER_DS);
	current->flags &= ~(PF_RANDOMIZE | PF_FORKNOEXEC | PF_KTHREAD |
					PF_NOFREEZE | PF_NO_SETAFFINITY);
	flush_thread();
	current->personality &= ~bprm->per_clear;

	/*
	 * We have to apply CLOEXEC before we change whether the process is
	 * dumpable (in setup_new_exec) to avoid a race with a process in userspace
	 * trying to access the should-be-closed file descriptors of a process
	 * undergoing exec(2).
	 */
	do_close_on_exec(current->files);
	return 0;

out:
	return retval;
}

2.2.2 setup_new_exec()

设置当前进程的mm->mmap_base等属性。

void setup_new_exec(struct linux_binprm * bprm)
{
	/*
	 * Once here, prepare_binrpm() will not be called any more, so
	 * the final state of setuid/setgid/fscaps can be merged into the
	 * secureexec flag.
	 */
	bprm->secureexec |= bprm->cap_elevated;

	if (bprm->secureexec) {
		/* Make sure parent cannot signal privileged process. */
		current->pdeath_signal = 0;

		/*
		 * For secureexec, reset the stack limit to sane default to
		 * avoid bad behavior from the prior rlimits. This has to
		 * happen before arch_pick_mmap_layout(), which examines
		 * RLIMIT_STACK, but after the point of no return to avoid
		 * needing to clean up the change on failure.
		 */
		if (current->signal->rlim[RLIMIT_STACK].rlim_cur > _STK_LIM)
			current->signal->rlim[RLIMIT_STACK].rlim_cur = _STK_LIM;
	}

    /* (5.1) 计算mmap区域的基地址:tsk->mm->mmap_base
            目前这个基地址在stack区域之下,从高往低增长
     */
	arch_pick_mmap_layout(current->mm);

	current->sas_ss_sp = current->sas_ss_size = 0;

	/*
	 * Figure out dumpability. Note that this checking only of current
	 * is wrong, but userspace depends on it. This should be testing
	 * bprm->secureexec instead.
	 */
	if (bprm->interp_flags & BINPRM_FLAGS_ENFORCE_NONDUMP ||
	    !(uid_eq(current_euid(), current_uid()) &&
	      gid_eq(current_egid(), current_gid())))
		set_dumpable(current->mm, suid_dumpable);
	else
		set_dumpable(current->mm, SUID_DUMP_USER);

	arch_setup_new_exec();
	perf_event_exec();
    /* (5.2) 设置:tsk->comm = bprm->filename */
	__set_task_comm(current, kbasename(bprm->filename), true);

	/* Set the new mm task size. We have to do that late because it may
	 * depend on TIF_32BIT which is only updated in flush_thread() on
	 * some architectures like powerpc
	 */
	current->mm->task_size = TASK_SIZE;

	/* An exec changes our domain. We are no longer part of the thread
	   group */
	WRITE_ONCE(current->self_exec_id, current->self_exec_id + 1);
	flush_signal_handlers(current, 0);
}

2.2.3 setup_arg_pages()

setup_arg_pages()的主要功能是设置好stack区域的地址空间。

在这里插入图片描述

整个stack区域的设置过程分为几步:

  • 1、初始状态,已经映射了一个page“参数区”在用户空间的最顶端。
  • 2、随机偏移,如果打开了ASLR特性,stack的起始地址需要随机化。
  • 3、初始化扩展,不到1个page大小的堆栈空间肯定是不够的,扩展到4k+128k大小。

在进程运行的过程中,如果还需要扩大stack的大小会走以下的流程:

do_page_fault() -> __do_page_fault() -> do_user_addr_fault() -> expand_stack()、handle_mm_fault()

setup_arg_pages()的具体解析如下:

int setup_arg_pages(struct linux_binprm *bprm,
		    unsigned long stack_top,
		    int executable_stack)
{
	unsigned long ret;
	unsigned long stack_shift;
	struct mm_struct *mm = current->mm;
	struct vm_area_struct *vma = bprm->vma;
	struct vm_area_struct *prev = NULL;
	unsigned long vm_flags;
	unsigned long stack_base;
	unsigned long stack_size;
	unsigned long stack_expand;
	unsigned long rlim_stack;

#ifdef CONFIG_STACK_GROWSUP
	/* Limit stack size */
	stack_base = rlimit_max(RLIMIT_STACK);
	if (stack_base > STACK_SIZE_MAX)
		stack_base = STACK_SIZE_MAX;

	/* Add space for stack randomization. */
	stack_base += (STACK_RND_MASK << PAGE_SHIFT);

	/* Make sure we didn't let the argument array grow too large. */
	if (vma->vm_end - vma->vm_start > stack_base)
		return -ENOMEM;

	stack_base = PAGE_ALIGN(stack_top - stack_base);

	stack_shift = vma->vm_start - stack_base;
	mm->arg_start = bprm->p - stack_shift;
	bprm->p = vma->vm_end - stack_shift;
#else
    /* (7.1) 根据stack的随机偏移,计算依据映射的第一个page的偏移值 */
	stack_top = arch_align_stack(stack_top);
	stack_top = PAGE_ALIGN(stack_top);

	if (unlikely(stack_top < mmap_min_addr) ||
	    unlikely(vma->vm_end - vma->vm_start >= stack_top - mmap_min_addr))
		return -ENOMEM;

	stack_shift = vma->vm_end - stack_top;

	bprm->p -= stack_shift;
	mm->arg_start = bprm->p;
#endif

	if (bprm->loader)
		bprm->loader -= stack_shift;
	bprm->exec -= stack_shift;

	if (down_write_killable(&mm->mmap_sem))
		return -EINTR;

	vm_flags = VM_STACK_FLAGS;

	/*
	 * Adjust stack execute permissions; explicitly enable for
	 * EXSTACK_ENABLE_X, disable for EXSTACK_DISABLE_X and leave alone
	 * (arch default) otherwise.
	 */
    /* (7.2) 根据stack execv属性,重新计算这部分vma的属性 */
	if (unlikely(executable_stack == EXSTACK_ENABLE_X))
		vm_flags |= VM_EXEC;
	else if (executable_stack == EXSTACK_DISABLE_X)
		vm_flags &= ~VM_EXEC;
	vm_flags |= mm->def_flags;
	vm_flags |= VM_STACK_INCOMPLETE_SETUP;

    /* (7.3) 修改已经创建好的“参数区”的属性 */
	ret = mprotect_fixup(vma, &prev, vma->vm_start, vma->vm_end,
			vm_flags);
	if (ret)
		goto out_unlock;
	BUG_ON(prev != vma);

	/* Move stack pages down in memory. */
	if (stack_shift) {
		ret = shift_arg_pages(vma, stack_shift);
		if (ret)
			goto out_unlock;
	}

	/* mprotect_fixup is overkill to remove the temporary stack flags */
	vma->vm_flags &= ~VM_STACK_INCOMPLETE_SETUP;

	stack_expand = 131072UL; /* randomly 32*4k (or 2*64k) pages */
	stack_size = vma->vm_end - vma->vm_start;
	/*
	 * Align this down to a page boundary as expand_stack
	 * will align it up.
	 */
    /* (7.4) 扩展stack的大小到初始值4k + 128k */
	rlim_stack = rlimit(RLIMIT_STACK) & PAGE_MASK;
#ifdef CONFIG_STACK_GROWSUP
	if (stack_size + stack_expand > rlim_stack)
		stack_base = vma->vm_start + rlim_stack;
	else
		stack_base = vma->vm_end + stack_expand;
#else
	if (stack_size + stack_expand > rlim_stack)
		stack_base = vma->vm_end - rlim_stack;
	else
		stack_base = vma->vm_start - stack_expand;
#endif
	current->mm->start_stack = bprm->p;
	ret = expand_stack(vma, stack_base);
	if (ret)
		ret = -EFAULT;

out_unlock:
	up_write(&mm->mmap_sem);
	return ret;
}

2.2.4 elf_map()

static unsigned long elf_map(struct file *filep, unsigned long addr,
		struct elf_phdr *eppnt, int prot, int type,
		unsigned long total_size)
{
	unsigned long map_addr;
	unsigned long size = eppnt->p_filesz + ELF_PAGEOFFSET(eppnt->p_vaddr);
	unsigned long off = eppnt->p_offset - ELF_PAGEOFFSET(eppnt->p_vaddr);
	addr = ELF_PAGESTART(addr);
	size = ELF_PAGEALIGN(size);

	/* mmap() will return -EINVAL if given a zero size, but a
	 * segment with zero filesize is perfectly valid */
	if (!size)
		return addr;

	/*
	* total_size is the size of the ELF (interpreter) image.
	* The _first_ mmap needs to know the full size, otherwise
	* randomization might put this image into an overlapping
	* position with the ELF binary image. (since size < total_size)
	* So we first map the 'big' image - and unmap the remainder at
	* the end. (which unmap is needed for ELF images with holes.)
    * total_size是ELF(interpreter)图像的大小。 _first_ mmap需要知道完整大小,否则随机化处理可能会使此图像与ELF二进制图像重叠。(因为大小<total_size)。 
    * 因此,我们首先映射“大”图像-最后取消映射其余部分。(带孔的ELF图像需要取消映射。)
	*/
	if (total_size) {
		total_size = ELF_PAGEALIGN(total_size);
        /* (8.4.1) 首先映射total_size大小的vma */
		map_addr = vm_mmap(filep, addr, total_size, prot, type, off);
		if (!BAD_ADDR(map_addr))
            /* (8.4.2) 然后将当前segment还未用到的区域unmap */
			vm_munmap(map_addr+size, total_size-size);
	} else
		map_addr = vm_mmap(filep, addr, size, prot, type, off);

	return(map_addr);
}

2.2.5 load_elf_interp()

改函数和load_elf_binary()本身很类似,负责把interpreter文件"/lib64/ld-linux-x86-64.so.2"加载到进程空间。

这里就不重复解析了。

2.2.6 create_elf_tables()

static int
create_elf_tables(struct linux_binprm *bprm, struct elfhdr *exec,
		unsigned long load_addr, unsigned long interp_load_addr)
{
	unsigned long p = bprm->p;
	int argc = bprm->argc;
	int envc = bprm->envc;
	elf_addr_t __user *sp;
	elf_addr_t __user *u_platform;
	elf_addr_t __user *u_base_platform;
	elf_addr_t __user *u_rand_bytes;
	const char *k_platform = ELF_PLATFORM;
	const char *k_base_platform = ELF_BASE_PLATFORM;
	unsigned char k_rand_bytes[16];
	int items;
	elf_addr_t *elf_info;
	int ei_index = 0;
	const struct cred *cred = current_cred();
	struct vm_area_struct *vma;

	/*
	 * In some cases (e.g. Hyper-Threading), we want to avoid L1
	 * evictions by the processes running on the same package. One
	 * thing we can do is to shuffle the initial stack for them.
	 */

	p = arch_align_stack(p);

	/*
	 * If this architecture has a platform capability string, copy it
	 * to userspace.  In some cases (Sparc), this info is impossible
	 * for userspace to get any other way, in others (i386) it is
	 * merely difficult.
	 */
    /* (8.9.1) 在用户堆栈中压入:ELF_PLATFORM */
	u_platform = NULL;
	if (k_platform) {
		size_t len = strlen(k_platform) + 1;

		u_platform = (elf_addr_t __user *)STACK_ALLOC(p, len);
		if (__copy_to_user(u_platform, k_platform, len))
			return -EFAULT;
	}

	/*
	 * If this architecture has a "base" platform capability
	 * string, copy it to userspace.
	 */
    /* (8.9.2) 在用户堆栈中压入:ELF_BASE_PLATFORM */
	u_base_platform = NULL;
	if (k_base_platform) {
		size_t len = strlen(k_base_platform) + 1;

		u_base_platform = (elf_addr_t __user *)STACK_ALLOC(p, len);
		if (__copy_to_user(u_base_platform, k_base_platform, len))
			return -EFAULT;
	}

	/*
	 * Generate 16 random bytes for userspace PRNG seeding.
	 */
    /* (8.9.3) 为用户空间PRNG种子生成16个随机字节。 */
	get_random_bytes(k_rand_bytes, sizeof(k_rand_bytes));
	u_rand_bytes = (elf_addr_t __user *)
		       STACK_ALLOC(p, sizeof(k_rand_bytes));
	if (__copy_to_user(u_rand_bytes, k_rand_bytes, sizeof(k_rand_bytes)))
		return -EFAULT;

	/* Create the ELF interpreter info */
    /* (8.9.4) 将ELF interpreter info保存进mm->saved_auxv[] */
	elf_info = (elf_addr_t *)current->mm->saved_auxv;
	/* update AT_VECTOR_SIZE_BASE if the number of NEW_AUX_ENT() changes */
#define NEW_AUX_ENT(id, val) \
	do { \
		elf_info[ei_index++] = id; \
		elf_info[ei_index++] = val; \
	} while (0)

#ifdef ARCH_DLINFO
	/* 
	 * ARCH_DLINFO must come first so PPC can do its special alignment of
	 * AUXV.
	 * update AT_VECTOR_SIZE_ARCH if the number of NEW_AUX_ENT() in
	 * ARCH_DLINFO changes
	 */
	ARCH_DLINFO;
#endif
	NEW_AUX_ENT(AT_HWCAP, ELF_HWCAP);
	NEW_AUX_ENT(AT_PAGESZ, ELF_EXEC_PAGESIZE);
	NEW_AUX_ENT(AT_CLKTCK, CLOCKS_PER_SEC);
	NEW_AUX_ENT(AT_PHDR, load_addr + exec->e_phoff);
	NEW_AUX_ENT(AT_PHENT, sizeof(struct elf_phdr));
	NEW_AUX_ENT(AT_PHNUM, exec->e_phnum);
	NEW_AUX_ENT(AT_BASE, interp_load_addr);
	NEW_AUX_ENT(AT_FLAGS, 0);
	NEW_AUX_ENT(AT_ENTRY, exec->e_entry);
	NEW_AUX_ENT(AT_UID, from_kuid_munged(cred->user_ns, cred->uid));
	NEW_AUX_ENT(AT_EUID, from_kuid_munged(cred->user_ns, cred->euid));
	NEW_AUX_ENT(AT_GID, from_kgid_munged(cred->user_ns, cred->gid));
	NEW_AUX_ENT(AT_EGID, from_kgid_munged(cred->user_ns, cred->egid));
	NEW_AUX_ENT(AT_SECURE, bprm->secureexec);
	NEW_AUX_ENT(AT_RANDOM, (elf_addr_t)(unsigned long)u_rand_bytes);
#ifdef ELF_HWCAP2
	NEW_AUX_ENT(AT_HWCAP2, ELF_HWCAP2);
#endif
	NEW_AUX_ENT(AT_EXECFN, bprm->exec);
	if (k_platform) {
		NEW_AUX_ENT(AT_PLATFORM,
			    (elf_addr_t)(unsigned long)u_platform);
	}
	if (k_base_platform) {
		NEW_AUX_ENT(AT_BASE_PLATFORM,
			    (elf_addr_t)(unsigned long)u_base_platform);
	}
	if (bprm->interp_flags & BINPRM_FLAGS_EXECFD) {
		NEW_AUX_ENT(AT_EXECFD, bprm->interp_data);
	}
#undef NEW_AUX_ENT
	/* AT_NULL is zero; clear the rest too */
	memset(&elf_info[ei_index], 0,
	       sizeof current->mm->saved_auxv - ei_index * sizeof elf_info[0]);

	/* And advance past the AT_NULL entry.  */
	ei_index += 2;

    /* (8.9.5) 在堆栈中保留位置用来保存ELF interpreter info */
	sp = STACK_ADD(p, ei_index);

    /* (8.9.6) 在堆栈中保留位置用来保存 argv 和 envp 指针
                并更新了堆栈指针bprm->p
     */
	items = (argc + 1) + (envc + 1) + 1;
	bprm->p = STACK_ROUND(sp, items);

	/* Point sp at the lowest address on the stack */
#ifdef CONFIG_STACK_GROWSUP
	sp = (elf_addr_t __user *)bprm->p - items - ei_index;
	bprm->exec = (unsigned long)sp; /* XXX: PARISC HACK */
#else
	sp = (elf_addr_t __user *)bprm->p;
#endif


	/*
	 * Grow the stack manually; some architectures have a limit on how
	 * far ahead a user-space access may be in order to grow the stack.
	 */
	vma = find_extend_vma(current->mm, bprm->p);
	if (!vma)
		return -EFAULT;

	/* Now, let's put argc (and argv, envp if appropriate) on the stack */
	if (__put_user(argc, sp++))
		return -EFAULT;

	/* Populate list of argv pointers back to argv strings. */
    /* (8.9.7) 向堆栈中拷贝argv指针
                还有副产品计算出了mm->arg_end
     */
	p = current->mm->arg_end = current->mm->arg_start;
	while (argc-- > 0) {
		size_t len;
		if (__put_user((elf_addr_t)p, sp++))
			return -EFAULT;
		len = strnlen_user((void __user *)p, MAX_ARG_STRLEN);
		if (!len || len > MAX_ARG_STRLEN)
			return -EINVAL;
		p += len;
	}
	if (__put_user(0, sp++))
		return -EFAULT;
	current->mm->arg_end = p;

	/* Populate list of envp pointers back to envp strings. */
    /* (8.9.8) 向堆栈中拷贝envp指针
                还有副产品计算出了mm->env_end
     */
	current->mm->env_end = current->mm->env_start = p;
	while (envc-- > 0) {
		size_t len;
		if (__put_user((elf_addr_t)p, sp++))
			return -EFAULT;
		len = strnlen_user((void __user *)p, MAX_ARG_STRLEN);
		if (!len || len > MAX_ARG_STRLEN)
			return -EINVAL;
		p += len;
	}
	if (__put_user(0, sp++))
		return -EFAULT;
	current->mm->env_end = p;

	/* Put the elf_info on the stack in the right place.  */
    /* (8.9.9) 向堆栈中拷贝ELF interpreter info */
	if (copy_to_user(sp, elf_info, ei_index * sizeof(elf_addr_t)))
		return -EFAULT;
	return 0;
}

2.2.7 start_thread()

start_thread()修改用户态寄存器: ip = interpreter elf_entry, sp = bprm->p,这样从execve系统调用返回时,跳转到interpreter入口开始执行。

struct pt_regs *regs = current_pt_regs();

void
start_thread(struct pt_regs *regs, unsigned long new_ip, unsigned long new_sp)
{
	start_thread_common(regs, new_ip, new_sp,
			    __USER_CS, __USER_DS, 0);
}

↓

static void
start_thread_common(struct pt_regs *regs, unsigned long new_ip,
		    unsigned long new_sp,
		    unsigned int _cs, unsigned int _ss, unsigned int _ds)
{
	WARN_ON_ONCE(regs != current_pt_regs());

	if (static_cpu_has(X86_BUG_NULL_SEG)) {
		/* Loading zero below won't clear the base. */
		loadsegment(fs, __USER_DS);
		load_gs_index(__USER_DS);
	}

	loadsegment(fs, 0);
	loadsegment(es, _ds);
	loadsegment(ds, _ds);
	load_gs_index(0);

    /* (8.12.1) 设置用户态寄存器ip和sp为新的值 */
	regs->ip		= new_ip;
	regs->sp		= new_sp;
	regs->cs		= _cs;
	regs->ss		= _ss;
	regs->flags		= X86_EFLAGS_IF;
	force_iret();
}

参考资料:

1、/proc//maps简要分析
2、ELF文件的加载过程(load_elf_binary函数详解)
3、ELF文件及android hook原理
4、linux内存布局和ASLR下的可分配地址空间
5、从bprm获取命令行参数

posted @ 2020-10-26 14:44  pwl999  阅读(699)  评论(0编辑  收藏  举报