linux内存管理(九)- 页面回收
参考《深入理解linux内核架构》和这篇博客Linux中的内存回收 [一] - 知乎 (zhihu.com)
内核代码v6.8-rc2
内存在计算机系统中经常是稀缺资源,当系统中内存不足甚至耗尽,为了让系统继续运行必须回收一部分内存。
为了回收内存,我们必须首先知道系统中的内存都处于什么状态。内存中的页面主要有两大块,文件映射和匿名映射。对于前者通常称为文件缓存,这部分页面的回收相对容易。对于干净的页面也就是没有写过的页面直接回收即可。对于脏页需要先写回文件在丢弃。因此回收内存首先可以先把这一部分内存回收掉。对于匿名页面的回收就需要将页面交换到swap区再回收。
回收内存之前还有两个问题:1. 如何选择要回收的页面。如果选择回收的页面是经常访问的,回收之后要马上再分配那就失去了回收页面的意义,系统处于频繁交换页和分配页中,要避免这种情况;2. 内存在交换区如何组织,如何将页面写入交换区,如何找回页面。
对于第一个问题,根据时间局部性,一定要选择那些最近最不常使用的页面去交换,也就是LRU(Least Recently Used)算法。内核维护了active和inactive两种链表来表示活跃的和不活跃的页,这是一个动态的过程。因为文件映射和匿名映射的页面回收优先级不同,又做了细分,加上不可回收页面,目前内核有5类lru链表。之所以是5类不是5个是因为每个node都有5个lru链表。
enum lru_list { LRU_INACTIVE_ANON = LRU_BASE, LRU_ACTIVE_ANON = LRU_BASE + LRU_ACTIVE, LRU_INACTIVE_FILE = LRU_BASE + LRU_FILE, LRU_ACTIVE_FILE = LRU_BASE + LRU_FILE + LRU_ACTIVE, LRU_UNEVICTABLE, NR_LRU_LISTS };
这些链表是per node的。
typedef struct pglist_data { /* * NOTE: THIS IS UNUSED IF MEMCG IS ENABLED. * * Use mem_cgroup_lruvec() to look up lruvecs. */ struct lruvec __lruvec; }
那page是什么时候加入到lru链表中的呢?lru链表是页面回收用的,所以只有分配了的页面才有可能存在于lru链中。那很有可能在分配页面的时候把它加进去。事实的确如此。
static vm_fault_t do_anonymous_page(struct vm_fault *vmf) { ... folio_add_lru_vma(folio, vma); ... }
在分配匿名页的时候会调用folio_add_lru_vma,该函数会调用folio_add_lru。
void folio_add_lru(struct folio *folio) { struct folio_batch *fbatch; VM_BUG_ON_FOLIO(folio_test_active(folio) && folio_test_unevictable(folio), folio); VM_BUG_ON_FOLIO(folio_test_lru(folio), folio); /* see the comment in lru_gen_add_folio() */ if (lru_gen_enabled() && !folio_test_unevictable(folio) && lru_gen_in_fault() && !(current->flags & PF_MEMALLOC)) //设置页的active标志 folio_set_active(folio); //增加页refcount计数 folio_get(folio); local_lock(&cpu_fbatches.lock); //获取per cpu lru cache结构 fbatch = this_cpu_ptr(&cpu_fbatches.lru_add); //将页面加入per cpu lru cache或者lru_active_anon链 folio_batch_add_and_move(fbatch, folio, lru_add_fn); local_unlock(&cpu_fbatches.lock); }
lru链是全局的,加入lru链需要持有相关的锁,可想而知当分配或回收密集的时候锁的争用非常严重,于是内核使用per cpu lru cache去做优化。对于要加入lru链的页,可以先加入该per cpu lru cache,如果cache满了就一次性全部加入到lru链中,这样可以大大减少对lru链的引用。
static void folio_batch_add_and_move(struct folio_batch *fbatch, struct folio *folio, move_fn_t move_fn) { if (folio_batch_add(fbatch, folio) && !folio_test_large(folio) && !lru_cache_disabled()) return; folio_batch_move_lru(fbatch, move_fn); }
可以看到页会先加入fbatch如果没问题就直接返回,否则就要把整个fbatch都加入lru链。
static void folio_batch_move_lru(struct folio_batch *fbatch, move_fn_t move_fn) { int i; struct lruvec *lruvec = NULL; unsigned long flags = 0; for (i = 0; i < folio_batch_count(fbatch); i++) { struct folio *folio = fbatch->folios[i]; /* block memcg migration while the folio moves between lru */ if (move_fn != lru_add_fn && !folio_test_clear_lru(folio)) continue; //获取lruvec lruvec = folio_lruvec_relock_irqsave(folio, lruvec, &flags);
//将folio加入lruvec move_fn(lruvec, folio); // 设置页的lru标志 folio_set_lru(folio); } if (lruvec) unlock_page_lruvec_irqrestore(lruvec, flags); folios_put(fbatch->folios, folio_batch_count(fbatch)); folio_batch_reinit(fbatch); }
该函数把fbatch内的folio一个个加入lru。lru从哪里来呢?
static inline struct lruvec *folio_lruvec_relock_irqsave(struct folio *folio, struct lruvec *locked_lruvec, unsigned long *flags) { if (locked_lruvec) { if (folio_matches_lruvec(folio, locked_lruvec)) return locked_lruvec; unlock_page_lruvec_irqrestore(locked_lruvec, *flags); } return folio_lruvec_lock_irqsave(folio, flags); }
循环的第一次 locked_lruvec是空的,所以会调folio_lruvec_lock_irqsave。
static inline struct lruvec *folio_lruvec_lock_irqsave(struct folio *folio, unsigned long *flagsp) { struct pglist_data *pgdat = folio_pgdat(folio); spin_lock_irqsave(&pgdat->__lruvec.lru_lock, *flagsp); return &pgdat->__lruvec; }
忽略memory cgroups,获取lruvec就是从该folio所属的node中得到__lruvec。也可以看出lruvec确实是per node的。
回到folio_move_batch_lru,看看folio是怎么加入到lru的。传入的move_fn为lru_add_fn。
static void lru_add_fn(struct lruvec *lruvec, struct folio *folio) { int was_unevictable = folio_test_clear_unevictable(folio); long nr_pages = folio_nr_pages(folio); VM_BUG_ON_FOLIO(folio_test_lru(folio), folio); /* * Is an smp_mb__after_atomic() still required here, before * folio_evictable() tests the mlocked flag, to rule out the possibility * of stranding an evictable folio on an unevictable LRU? I think * not, because __munlock_folio() only clears the mlocked flag * while the LRU lock is held. * * (That is not true of __page_cache_release(), and not necessarily * true of release_pages(): but those only clear the mlocked flag after * folio_put_testzero() has excluded any other users of the folio.) */ if (folio_evictable(folio)) { if (was_unevictable) __count_vm_events(UNEVICTABLE_PGRESCUED, nr_pages); } else {
//清掉active标志 folio_clear_active(folio); folio_set_unevictable(folio); /* * folio->mlock_count = !!folio_test_mlocked(folio)? * But that leaves __mlock_folio() in doubt whether another * actor has already counted the mlock or not. Err on the * safe side, underestimate, let page reclaim fix it, rather * than leaving a page on the unevictable LRU indefinitely. */ folio->mlock_count = 0; if (!was_unevictable) __count_vm_events(UNEVICTABLE_PGCULLED, nr_pages); } //将folio加入lruvec lruvec_add_folio(lruvec, folio); trace_mm_lru_insertion(folio); }
lruvec_add_folio最终将folio加入lruvec。
static __always_inline void lruvec_add_folio(struct lruvec *lruvec, struct folio *folio) {
//根据flags判断folio属于哪个lru链 enum lru_list lru = folio_lru_list(folio); //一般CONFIG_LRU_GEN是disabled,直接返回false if (lru_gen_add_folio(lruvec, folio, false)) return; //更新一下lruvec的数据 update_lru_size(lruvec, lru, folio_zonenum(folio), folio_nr_pages(folio)); if (lru != LRU_UNEVICTABLE)
//把folio->lru加入到lruvec->lists[lru] list_add(&folio->lru, &lruvec->lists[lru]); }
static __always_inline enum lru_list folio_lru_list(struct folio *folio) { enum lru_list lru; VM_BUG_ON_FOLIO(folio_test_active(folio) && folio_test_unevictable(folio), folio); if (folio_test_unevictable(folio)) return LRU_UNEVICTABLE; lru = folio_is_file_lru(folio) ? LRU_INACTIVE_FILE : LRU_INACTIVE_ANON; if (folio_test_active(folio)) lru += LRU_ACTIVE; return lru; }
如果没有显示的设置active那就会将page放到inactive链表中。
lruvec中包含了一个lru链表数组。
struct lruvec {
//lru lists包含了之前提到的5个链表 struct list_head lists[NR_LRU_LISTS]; }
至此我们知道在分配匿名页的时候会将分配的页放在该folio所在node的lruvec的对应链表里,从代码中看到第一次加入lru链表是在inactive链表中。
如果页面一直呆在inactive链表中是很“危险”的,不知道什么时候就会被回收。如果不想被回收那该怎么做呢?这就是页面在lru链表上如何移动的问题。与此相关的页的flag有PG_referenced和PG_active。假设初始状态两个标志都是0,他们的状态变化是,
inactive,unreferenced -> inactive,referenced //第一次access inactive,referenced -> active,unreferenced //第二次access active,unreferenced -> active,referenced //第三次access
页面回收会选择那些inactive且unreferenced的页。实现此方向状态转移的函数是folio_mark_accessed。
/* * Mark a page as having seen activity. * * inactive,unreferenced -> inactive,referenced * inactive,referenced -> active,unreferenced * active,unreferenced -> active,referenced * * When a newly allocated page is not yet visible, so safe for non-atomic ops, * __SetPageReferenced(page) may be substituted for mark_page_accessed(page). */ void folio_mark_accessed(struct folio *folio) { if (lru_gen_enabled()) { folio_inc_refs(folio); return; } if (!folio_test_referenced(folio)) { folio_set_referenced(folio); } else if (folio_test_unevictable(folio)) { /* * Unevictable pages are on the "LRU_UNEVICTABLE" list. But, * this list is never rotated or maintained, so marking an * unevictable page accessed has no effect. */ } else if (!folio_test_active(folio)) { /* * If the folio is on the LRU, queue it for activation via * cpu_fbatches.activate. Otherwise, assume the folio is in a * folio_batch, mark it active and it'll be moved to the active * LRU on the next drain. */ if (folio_test_lru(folio))
//会将folio加入per cpu的lru cache,如果它已经存在那就啥也不做。感觉这个地方有bug,如果fbatch满了,又不去排空,那不是会越界? folio_activate(folio); else __lru_cache_activate_folio(folio); folio_clear_referenced(folio); workingset_activation(folio); } if (folio_test_idle(folio)) folio_clear_idle(folio); }
与之相关的API还有folio_referenced和folio_check_references。
/** * folio_referenced() - Test if the folio was referenced. * @folio: The folio to test. * @is_locked: Caller holds lock on the folio. * @memcg: target memory cgroup * @vm_flags: A combination of all the vma->vm_flags which referenced the folio. * * Quick test_and_clear_referenced for all mappings of a folio, * * Return: The number of mappings which referenced the folio. Return -1 if * the function bailed out due to rmap lock contention. */ int folio_referenced(struct folio *folio, int is_locked, struct mem_cgroup *memcg, unsigned long *vm_flags) { int we_locked = 0; struct folio_referenced_arg pra = { .mapcount = folio_mapcount(folio), .memcg = memcg, }; struct rmap_walk_control rwc = {
//folio_referenced_one是最终去判断folio是否被引用的关键函数 .rmap_one = folio_referenced_one, .arg = (void *)&pra, .anon_lock = folio_lock_anon_vma_read, .try_lock = true, .invalid_vma = invalid_folio_referenced_vma, }; *vm_flags = 0; if (!pra.mapcount) return 0; if (!folio_raw_mapping(folio)) return 0; if (!is_locked && (!folio_test_anon(folio) || folio_test_ksm(folio))) { we_locked = folio_trylock(folio); if (!we_locked) return 1; } //利用反向映射查找folio对应的所有pte rmap_walk(folio, &rwc); *vm_flags = pra.vm_flags; if (we_locked) folio_unlock(folio); return rwc.contended ? -1 : pra.referenced; }
folio_referenced会利用反向映射查找所有映射该folio的pte再使用folio_referenced_one判断是否folio被引用过。
看一下folio_referenced_one函数
static bool folio_referenced_one(struct folio *folio, struct vm_area_struct *vma, unsigned long address, void *arg) { struct folio_referenced_arg *pra = arg; DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, 0); int referenced = 0; unsigned long start = address, ptes = 0; while (page_vma_mapped_walk(&pvmw)) { address = pvmw.address; if (vma->vm_flags & VM_LOCKED) { if (!folio_test_large(folio) || !pvmw.pte) { /* Restore the mlock which got missed */ mlock_vma_folio(folio, vma); page_vma_mapped_walk_done(&pvmw); pra->vm_flags |= VM_LOCKED; return false; /* To break the loop */ } /* * For large folio fully mapped to VMA, will * be handled after the pvmw loop. * * For large folio cross VMA boundaries, it's * expected to be picked by page reclaim. But * should skip reference of pages which are in * the range of VM_LOCKED vma. As page reclaim * should just count the reference of pages out * the range of VM_LOCKED vma. */ ptes++; pra->mapcount--; continue; } if (pvmw.pte) { if (lru_gen_enabled() && pte_young(ptep_get(pvmw.pte))) { lru_gen_look_around(&pvmw); referenced++; } //判断folio是否被引用,如果是清掉引用标记 if (ptep_clear_flush_young_notify(vma, address, pvmw.pte)) referenced++; } else if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) { if (pmdp_clear_flush_young_notify(vma, address, pvmw.pmd)) referenced++; } else { /* unexpected pmd-mapped folio? */ WARN_ON_ONCE(1); } pra->mapcount--; } if ((vma->vm_flags & VM_LOCKED) && folio_test_large(folio) && folio_within_vma(folio, vma)) { unsigned long s_align, e_align; s_align = ALIGN_DOWN(start, PMD_SIZE); e_align = ALIGN_DOWN(start + folio_size(folio) - 1, PMD_SIZE); /* folio doesn't cross page table boundary and fully mapped */ if ((s_align == e_align) && (ptes == folio_nr_pages(folio))) { /* Restore the mlock which got missed */ mlock_vma_folio(folio, vma); pra->vm_flags |= VM_LOCKED; return false; /* To break the loop */ } } if (referenced) folio_clear_idle(folio); if (folio_test_clear_young(folio)) referenced++; if (referenced) { pra->referenced++; pra->vm_flags |= vma->vm_flags & ~VM_LOCKED; } if (!pra->mapcount) return false; /* To break the loop */ return true; }
总体来说folio_referenced会返回folio映射个数并清除pte的reference标记。
来看看folio_check_references.
static enum folio_references folio_check_references(struct folio *folio, struct scan_control *sc) { int referenced_ptes, referenced_folio; unsigned long vm_flags; //获取该folio的引用次数 referenced_ptes = folio_referenced(folio, 1, sc->target_mem_cgroup, &vm_flags);
//判断该folio是否被引用 referenced_folio = folio_test_clear_referenced(folio); /* * The supposedly reclaimable folio was found to be in a VM_LOCKED vma. * Let the folio, now marked Mlocked, be moved to the unevictable list. */ if (vm_flags & VM_LOCKED) return FOLIOREF_ACTIVATE; /* rmap lock contention: rotate */ if (referenced_ptes == -1) return FOLIOREF_KEEP; if (referenced_ptes) { /* * All mapped folios start out with page table * references from the instantiating fault, so we need * to look twice if a mapped file/anon folio is used more * than once. * * Mark it and spare it for another trip around the * inactive list. Another page table reference will * lead to its activation. * * Note: the mark is set for activated folios as well * so that recently deactivated but used folios are * quickly recovered. */ folio_set_referenced(folio); //引用超过1次?可以加到active list里面 if (referenced_folio || referenced_ptes > 1) return FOLIOREF_ACTIVATE; /* * Activate file-backed executable folios after first usage. */
//可执行文件引用一次就可以放到active list里面 if ((vm_flags & VM_EXEC) && folio_is_file_lru(folio)) return FOLIOREF_ACTIVATE; return FOLIOREF_KEEP; } /* Reclaim if clean, defer dirty folios to writeback */ if (referenced_folio && folio_is_file_lru(folio)) return FOLIOREF_RECLAIM_CLEAN; return FOLIOREF_RECLAIM; }
folio_check_references负责判断当前的页面是不是需要被移动到active list或者被回收。
下一篇我们看一下内核是如何进行页面回收的。