转载:mmap
https://www.jianshu.com/p/0ce91e10d026
https://gewu.pcwanli.com/front/article/21147.html
https://blog.csdn.net/m0_53157173/article/details/127578558
1、mmap基础概念
mmap 是一种内存映射文件的方法,即将一个文件或者其他对象映射到进程的地址空间,实现文件磁盘地址和进程虚拟地址空间中一段虚拟地址的一一映射关系。
实现这样的映射关系后,进程就可以采用指针的方式读写操作这一段内存,而系统会自动回写脏页面到对应的文件磁盘上,即完成了对文件的操作而不必调用read,write等系统调用函数。相反,内核空间的这段区域的修改也直接反应用户空间,从而可以实现不同进程的文件共享。如下图所示:
由上图可以看出,进程的虚拟地址空间,由多个虚拟内存区域构成。虚拟内存区域是进程的虚拟地址空间中的一个同质区间,即具有同样特性的连续地址范围。上图中所示的text数据段、初始数据段、Bss数据段、堆、栈、内存映射,都是一个独立的虚拟内存区域。而为内存映射服务的地址空间处在堆栈之间的空余部分。
linux 内核使用的vm_area_struct 结构来表示一个独立的虚拟内存区域,由于每个不同质的虚拟内存区域功能和内部机制不同;因此同一个进程使用多个vm_area_struct 结构来分别表示不同类型的虚拟内存区域。各个vm_area_struct 结构使用链表或者树形结构链接,方便进程快速访问。如下图所示:
mmap API
// include<sys/mman.h> void *mmap(void *addr, size_t length, int prot, int flags, int fd, off_t offset); int munmap(void *addr, size_t length);
-
addr:指定起始地址,为了可移植性一般设为NULL
-
length:表示映射到进程地址空间的大小
-
prot:读写属性,PROT_EXEC、PROT_READ、PROT_WRITE、PROT_NONE
-
flags:标志,如共享映射、私有映射
-
fd:文件描述符,匿名映射时设为-1。
-
offset:文件映射时,表示偏移量
flag标志
-
MAP_SHARED:创建一个共享的映射区域。多个进程可以这样映射同一个文件,修改后的内容会同步到磁盘文件中。
-
MAP_PRIVATE:创建写时复制的私有映射。多个进程可以私有映射同一个文件,修改之后不会同步到磁盘中。
-
MAP_ANONYMOUS:创建匿名映射,即没有关联到文件的映射
-
MAP_FIXED:使用参数addr创建映射,如果无法映射指定的地址就返回失败,addr要求按页对齐。如果指定的地址空间与已有的VMA重叠,会先销毁重叠的区域。
-
MAP_POPULATE:对于文件映射,会提前预读文件内容到映射区域,该特性只支持私有映射。
4类映射
根据prot和flags的不同组合,可以分为以下4种映射类型:
-
私有匿名:通常用于内存分配(大块)
-
私有文件:通常用于加载动态库
-
共享匿名:通常用于进程间共享内存,默认打开
/dev/zero
这个特殊的设备文件 -
共享文件:通常用于内存映射I/O,进程间通信
dev/mem
https://blog.csdn.net/skyflying2012/article/details/47611399
https://www.jianshu.com/p/d4681baf0288
https://blog.csdn.net/yetaibing1990/article/details/88089811
mmap内存映射原理
-
当用户空间调用mmap时,系统会寻找一段满足要求的连续虚拟地址,然后创建一个新的vma插入到mm系统的链表和红黑树中。
-
调用内核空间mmap,建立文件块/设备物理地址和进程虚拟地址vma的映射关系
-
如果是磁盘文件,没有特别设置标志的话这里只是建立映射不会实际分配内存。
-
如果是设备文件,直接通过remap_pfn_range函数建立设备物理地址到虚拟地址的映射。
-
(如果是磁盘文件映射)当进程对这片映射地址空间进行访问时,引发缺页异常,将数据从磁盘中拷贝到物理内存。后续用户空间就可以直接对这块内核空间的物理内存进行读写,省去了用户空间跟内核空间之间的拷贝过程。
内核代码分析
当我们在用户空间调用mmap时,首先通过系统调用进入内核空间,可以看到这里将offset转成了以页为单位。
// arch/x86/kernel/sys_x86_64.c SYSCALL_DEFINE6(mmap, unsigned long, addr, unsigned long, len, unsigned long, prot, unsigned long, flags, unsigned long, fd, unsigned long, off) { long error; error = -EINVAL; if (off & ~PAGE_MASK) goto out; error = sys_mmap_pgoff(addr, len, prot, flags, fd, off >> PAGE_SHIFT); out: return error; }
来看系统调用sys_mmap_pgoff
,如果是不是匿名映射,会通过fd获取file结构体。
// mm/mmap.c SYSCALL_DEFINE6(mmap_pgoff, unsigned long, addr, unsigned long, len, unsigned long, prot, unsigned long, flags, unsigned long, fd, unsigned long, pgoff) { struct file *file = NULL; unsigned long retval; if (!(flags & MAP_ANONYMOUS)) { // ... file = fget(fd); // ... } // ... retval = vm_mmap_pgoff(file, addr, len, prot, flags, pgoff); return retval; }
接着看vm_mmap_pgoff
函数,这里主要用信号量对进程地址空间做了一个保护,然后根据populate的值会prefault页表,如果是文件映射则会对文件进行预读。
// mm/util.c unsigned long vm_mmap_pgoff(struct file *file, unsigned long addr, unsigned long len, unsigned long prot, unsigned long flag, unsigned long pgoff) { unsigned long ret; struct mm_struct *mm = current->mm; unsigned long populate; LIST_HEAD(uf); ret = security_mmap_file(file, prot, flag); if (!ret) { if (down_write_killable(&mm->mmap_sem)) return -EINTR; ret = do_mmap_pgoff(file, addr, len, prot, flag, pgoff, &populate, &uf); up_write(&mm->mmap_sem); userfaultfd_unmap_complete(mm, &uf); if (populate) mm_populate(ret, populate); } return ret; }
do_mmap_pgoff
只是简单调用do_mmap
// include/linux/mm.h static inline unsigned long do_mmap_pgoff(struct file *file, unsigned long addr, unsigned long len, unsigned long prot, unsigned long flags, unsigned long pgoff, unsigned long *populate, struct list_head *uf) { return do_mmap(file, addr, len, prot, flags, 0, pgoff, populate, uf); }
我们来看do_mmap
实现:
// mm/mmap.c unsigned long do_mmap(struct file *file, unsigned long addr, unsigned long len, unsigned long prot, unsigned long flags, vm_flags_t vm_flags, unsigned long pgoff, unsigned long *populate, struct list_head *uf) { struct mm_struct *mm = current->mm; // ... len = PAGE_ALIGN(len); // ... addr = get_unmapped_area(file, addr, len, pgoff, flags); // ... addr = mmap_region(file, addr, len, vm_flags, pgoff, uf); if (!IS_ERR_VALUE(addr) && ((vm_flags & VM_LOCKED) || (flags & (MAP_POPULATE | MAP_NONBLOCK)) == MAP_POPULATE)) *populate = len; return addr; }
这个函数主要将映射长度页对齐,对prot属性和flags标志进行了检查和处理,设置了vm_flags。get_unmapped_area
函数检查指定的地址或自动选择可用的虚拟地址。然后就调用mmap_region
,可以看到返回之后,根据调用接口时设置的flags对populate进行了设置。如果设置了MAP_LOCKED
,或者设置了MAP_POPULATE
但没有设置MAP_NONBLOCK
,就进行前面提到的prefault操作。
然后继续看mmap_region
// mm/mmap.c unsigned long mmap_region(struct file *file, unsigned long addr, unsigned long len, vm_flags_t vm_flags, unsigned long pgoff, struct list_head *uf) { // ... vma = vma_merge(mm, prev, addr, addr + len, vm_flags, NULL, file, pgoff, NULL, NULL_VM_UFFD_CTX); if (vma) // 可以跟之前的映射合并 goto out; vma = kmem_cache_zalloc(vm_area_cachep, GFP_KERNEL); vma->vm_mm = mm; vma->vm_start = addr; vma->vm_end = addr + len; vma->vm_flags = vm_flags; vma->vm_page_prot = vm_get_page_prot(vm_flags); vma->vm_pgoff = pgoff; INIT_LIST_HEAD(&vma->anon_vma_chain); if (file) { // ... vma->vm_file = get_file(file); error = call_mmap(file, vma); // 调用文件的mmap //... } else if (vm_flags & VM_SHARED) { error = shmem_zero_setup(vma); } // ... return addr; // ... }
该函数首先做了一些地址空间检查,接着vma_merge
检查是否可以和老的映射合并,然后就是分配vma并初始化。如果是文件映射,调用call_mmap
;如果是匿名共享映射,调用shmem_zero_setup
,它里面会进行/dev/zero
文件相关设置。
call_mmap
只是简单地调用文件句柄中的mmap操作函数。
// include/linux/fs.h static inline int call_mmap(struct file *file, struct vm_area_struct *vma) { return file->f_op->mmap(file, vma); }
如果是普通文件系统中的文件的话,我们以ext4为例,里面主要是设置了vma->vm_ops
为ext4_file_vm_ops
。
// fs/ext4/file.c static int ext4_file_mmap(struct file *file, struct vm_area_struct *vma) { //... vma->vm_ops = &ext4_file_vm_ops; //... return 0; } static const struct vm_operations_struct ext4_file_vm_ops = { .fault = ext4_filemap_fault, .map_pages = filemap_map_pages, .page_mkwrite = ext4_page_mkwrite, };
后续当访问这个vma地址空间时,就会调用相应的操作函数进行处理,比如页错误处理函数会调用ext4_filemap_fault
,里面又会调用filemap_fault
。
测试程序代码:catbro666/mmap-driver-demo
mmap方法的实现,核心函数是remap_pfn_range
,它用于建立实际物理地址到vma虚拟地址的映射。我们来看下它的参数,第一个是要映射的用户空间vma,第二个是映射起始地址,第三个是内核内存的物理页帧号,第四个是映射区域的大小,第五个是对这个映射的页保护标志。
Note:
1. map的地址必须是page alignment
2.如果要user space 实时拿到driver正确的数据,须使用dma_alloc_coherent 来申请memory
#include <linux/init.h> #include <linux/module.h> #include <linux/fs.h> #include <linux/mm.h> #include <linux/gfp.h> // alloc_page #include <linux/miscdevice.h> // miscdevice misc_xxx #include <linux/uaccess.h> // copy_from/to_user #define DEMO_NAME "demo_dev" #define PAGE_ORDER 2 #define MAX_SIZE (PAGE_SIZE << PAGE_ORDER) static struct device *mydemodrv_device; static struct page *page = NULL; static char *device_buffer = NULL; static int demodrv_open(struct inode *inode, struct file *file) { struct mm_struct *mm = current->mm; int major = MAJOR(inode->i_rdev); int minor = MINOR(inode->i_rdev); printk("%s: major=%d, minor=%d\n", __func__, major, minor); printk("client: %s (%d)\n", current->comm, current->pid); printk("code section: [0x%lx 0x%lx]\n", mm->start_code, mm->end_code); printk("data section: [0x%lx 0x%lx]\n", mm->start_data, mm->end_data); printk("brk section: s: 0x%lx, c: 0x%lx\n", mm->start_brk, mm->brk); printk("mmap section: s: 0x%lx\n", mm->mmap_base); printk("stack section: s: 0x%lx\n", mm->start_stack); printk("arg section: [0x%lx 0x%lx]\n", mm->arg_start, mm->arg_end); printk("env section: [0x%lx 0x%lx]\n", mm->env_start, mm->env_end); return 0; } static int demodrv_release(struct inode *inode, struct file *file) { return 0; } static ssize_t demodrv_read(struct file *file, char __user *buf, size_t count, loff_t *ppos) { int actual_readed; int max_read; int need_read; int ret; max_read = PAGE_SIZE - *ppos; need_read = max_read > count ? count : max_read; if (need_read == 0) dev_warn(mydemodrv_device, "no space for read"); ret = copy_to_user(buf, device_buffer + *ppos, need_read); if (ret == need_read) return -EFAULT; actual_readed = need_read - ret; *ppos += actual_readed; printk("%s actual_readed=%d, pos=%lld\n", __func__, actual_readed, *ppos); return actual_readed; } static ssize_t demodrv_write(struct file *file, const char __user *buf, size_t count, loff_t *ppos) { int actual_written; int max_write; int need_write; int ret; max_write = PAGE_SIZE - *ppos; need_write = max_write > count ? count : max_write; if (need_write == 0) dev_warn(mydemodrv_device, "no space for write"); ret = copy_from_user(device_buffer + *ppos, buf, need_write); if (ret == need_write) return -EFAULT; actual_written = need_write - ret; *ppos += actual_written; printk("%s actual_written=%d, pos=%lld\n", __func__, actual_written, *ppos); return actual_written; } static int demodev_mmap(struct file *file, struct vm_area_struct *vma) { struct mm_struct *mm; unsigned long size; unsigned long pfn_start; void *virt_start; int ret; mm = current->mm; pfn_start = page_to_pfn(page) + vma->vm_pgoff; virt_start = page_address(page) + (vma->vm_pgoff << PAGE_SHIFT); /* 映射大小不超过实际物理页 */ size = min(((1 << PAGE_ORDER) - vma->vm_pgoff) << PAGE_SHIFT, vma->vm_end - vma->vm_start); printk("phys_start: 0x%lx, offset: 0x%lx, vma_size: 0x%lx, map size:0x%lx\n", pfn_start << PAGE_SHIFT, vma->vm_pgoff << PAGE_SHIFT, vma->vm_end - vma->vm_start, size); if (size <= 0) { printk("%s: offset 0x%lx too large, max size is 0x%lx\n", __func__, vma->vm_pgoff << PAGE_SHIFT, MAX_SIZE); return -EINVAL; } // 外层vm_mmap_pgoff已经用信号量保护了 // down_read(&mm->mmap_sem); ret = remap_pfn_range(vma, vma->vm_start, pfn_start, size, vma->vm_page_prot); // up_read(&mm->mmap_sem); if (ret) { printk("remap_pfn_range failed, vm_start: 0x%lx\n", vma->vm_start); } else { printk("map kernel 0x%px to user 0x%lx, size: 0x%lx\n", virt_start, vma->vm_start, size); } return ret; } static loff_t demodev_llseek(struct file *file, loff_t offset, int whence) { loff_t pos; switch(whence) { case 0: /* SEEK_SET */ pos = offset; break; case 1: /* SEEK_CUR */ pos = file->f_pos + offset; break; case 2: /* SEEK_END */ pos = MAX_SIZE + offset; break; default: return -EINVAL; } if (pos < 0 || pos > MAX_SIZE) return -EINVAL; file->f_pos = pos; return pos; } static const struct file_operations demodrv_fops = { .owner = THIS_MODULE, .open = demodrv_open, .release = demodrv_release, .read = demodrv_read, .write = demodrv_write, .mmap = demodev_mmap, .llseek = demodev_llseek }; static struct miscdevice mydemodrv_misc_device = { .minor = MISC_DYNAMIC_MINOR, .name = DEMO_NAME, .fops = &demodrv_fops, }; static int __init demo_dev_init(void) { int ret; ret = misc_register(&mydemodrv_misc_device); if (ret) { printk("failed to register misc device"); return ret; } mydemodrv_device = mydemodrv_misc_device.this_device; printk("succeeded register misc device: %s\n", DEMO_NAME); page = alloc_pages(GFP_KERNEL, PAGE_ORDER); if (!page) { printk("alloc_page failed\n"); return -ENOMEM; } device_buffer = page_address(page); printk("device_buffer physical address: %lx, virtual address: %px\n", page_to_pfn(page) << PAGE_SHIFT, device_buffer); return 0; } static void __exit demo_dev_exit(void) { printk("removing device\n"); __free_pages(page, PAGE_ORDER); misc_deregister(&mydemodrv_misc_device); } module_init(demo_dev_init); module_exit(demo_dev_exit); MODULE_AUTHOR("catbro666"); MODULE_LICENSE("GPL v2"); MODULE_DESCRIPTION("mmap test module");