linux内存管理(九)- 页面回收

参考《深入理解linux内核架构》和这篇博客Linux中的内存回收 [一] - 知乎 (zhihu.com)

内核代码v6.8-rc2

内存在计算机系统中经常是稀缺资源,当系统中内存不足甚至耗尽,为了让系统继续运行必须回收一部分内存。

为了回收内存,我们必须首先知道系统中的内存都处于什么状态。内存中的页面主要有两大块,文件映射和匿名映射。对于前者通常称为文件缓存,这部分页面的回收相对容易。对于干净的页面也就是没有写过的页面直接回收即可。对于脏页需要先写回文件在丢弃。因此回收内存首先可以先把这一部分内存回收掉。对于匿名页面的回收就需要将页面交换到swap区再回收。

回收内存之前还有两个问题:1. 如何选择要回收的页面。如果选择回收的页面是经常访问的,回收之后要马上再分配那就失去了回收页面的意义,系统处于频繁交换页和分配页中,要避免这种情况;2. 内存在交换区如何组织,如何将页面写入交换区,如何找回页面。

对于第一个问题,根据时间局部性,一定要选择那些最近最不常使用的页面去交换,也就是LRU(Least Recently Used)算法。内核维护了active和inactive两种链表来表示活跃的和不活跃的页,这是一个动态的过程。因为文件映射和匿名映射的页面回收优先级不同,又做了细分,加上不可回收页面,目前内核有5类lru链表。之所以是5类不是5个是因为每个node都有5个lru链表。

enum lru_list {
    LRU_INACTIVE_ANON = LRU_BASE,
    LRU_ACTIVE_ANON = LRU_BASE + LRU_ACTIVE,
    LRU_INACTIVE_FILE = LRU_BASE + LRU_FILE,
    LRU_ACTIVE_FILE = LRU_BASE + LRU_FILE + LRU_ACTIVE,
    LRU_UNEVICTABLE,
    NR_LRU_LISTS
};

这些链表是per node的。

typedef struct pglist_data {
    /*
     * NOTE: THIS IS UNUSED IF MEMCG IS ENABLED.
     *
     * Use mem_cgroup_lruvec() to look up lruvecs.
     */
    struct lruvec        __lruvec;
}

那page是什么时候加入到lru链表中的呢?lru链表是页面回收用的,所以只有分配了的页面才有可能存在于lru链中。那很有可能在分配页面的时候把它加进去。事实的确如此。

static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
{
...
    folio_add_lru_vma(folio, vma);
...
}

在分配匿名页的时候会调用folio_add_lru_vma,该函数会调用folio_add_lru。

void folio_add_lru(struct folio *folio)
{
        struct folio_batch *fbatch;

        VM_BUG_ON_FOLIO(folio_test_active(folio) &&
                        folio_test_unevictable(folio), folio);
        VM_BUG_ON_FOLIO(folio_test_lru(folio), folio);

        /* see the comment in lru_gen_add_folio() */
        if (lru_gen_enabled() && !folio_test_unevictable(folio) &&
            lru_gen_in_fault() && !(current->flags & PF_MEMALLOC))
                //设置页的active标志
                folio_set_active(folio);
        //增加页refcount计数
        folio_get(folio);
        local_lock(&cpu_fbatches.lock);
        //获取per cpu lru cache结构
        fbatch = this_cpu_ptr(&cpu_fbatches.lru_add);
        //将页面加入per cpu lru cache或者lru_active_anon链
        folio_batch_add_and_move(fbatch, folio, lru_add_fn);
        local_unlock(&cpu_fbatches.lock);
}

lru链是全局的,加入lru链需要持有相关的锁,可想而知当分配或回收密集的时候锁的争用非常严重,于是内核使用per cpu lru cache去做优化。对于要加入lru链的页,可以先加入该per cpu lru cache,如果cache满了就一次性全部加入到lru链中,这样可以大大减少对lru链的引用。

static void folio_batch_add_and_move(struct folio_batch *fbatch,
                struct folio *folio, move_fn_t move_fn)
{
        if (folio_batch_add(fbatch, folio) && !folio_test_large(folio) &&
            !lru_cache_disabled())
                return;
        folio_batch_move_lru(fbatch, move_fn);
}

可以看到页会先加入fbatch如果没问题就直接返回,否则就要把整个fbatch都加入lru链。

static void folio_batch_move_lru(struct folio_batch *fbatch, move_fn_t move_fn)
{
        int i;
        struct lruvec *lruvec = NULL;
        unsigned long flags = 0;

        for (i = 0; i < folio_batch_count(fbatch); i++) {
                struct folio *folio = fbatch->folios[i];

                /* block memcg migration while the folio moves between lru */
                if (move_fn != lru_add_fn && !folio_test_clear_lru(folio))
                        continue;
                //获取lruvec
                lruvec = folio_lruvec_relock_irqsave(folio, lruvec, &flags);
//将folio加入lruvec move_fn(lruvec, folio); // 设置页的lru标志 folio_set_lru(folio); }
if (lruvec) unlock_page_lruvec_irqrestore(lruvec, flags); folios_put(fbatch->folios, folio_batch_count(fbatch)); folio_batch_reinit(fbatch); }

该函数把fbatch内的folio一个个加入lru。lru从哪里来呢?

static inline struct lruvec *folio_lruvec_relock_irqsave(struct folio *folio,
                struct lruvec *locked_lruvec, unsigned long *flags)
{
        if (locked_lruvec) {
                if (folio_matches_lruvec(folio, locked_lruvec))
                        return locked_lruvec;

                unlock_page_lruvec_irqrestore(locked_lruvec, *flags);
        }

        return folio_lruvec_lock_irqsave(folio, flags);
}

循环的第一次 locked_lruvec是空的,所以会调folio_lruvec_lock_irqsave。

static inline struct lruvec *folio_lruvec_lock_irqsave(struct folio *folio,
                unsigned long *flagsp)
{
        struct pglist_data *pgdat = folio_pgdat(folio);

        spin_lock_irqsave(&pgdat->__lruvec.lru_lock, *flagsp);
        return &pgdat->__lruvec;
}

忽略memory cgroups,获取lruvec就是从该folio所属的node中得到__lruvec。也可以看出lruvec确实是per node的。

回到folio_move_batch_lru,看看folio是怎么加入到lru的。传入的move_fn为lru_add_fn。

static void lru_add_fn(struct lruvec *lruvec, struct folio *folio)
{
        int was_unevictable = folio_test_clear_unevictable(folio);
        long nr_pages = folio_nr_pages(folio);

        VM_BUG_ON_FOLIO(folio_test_lru(folio), folio);

        /*
         * Is an smp_mb__after_atomic() still required here, before
         * folio_evictable() tests the mlocked flag, to rule out the possibility
         * of stranding an evictable folio on an unevictable LRU?  I think
         * not, because __munlock_folio() only clears the mlocked flag
         * while the LRU lock is held.
         *
         * (That is not true of __page_cache_release(), and not necessarily
         * true of release_pages(): but those only clear the mlocked flag after
         * folio_put_testzero() has excluded any other users of the folio.)
         */
        if (folio_evictable(folio)) {
                if (was_unevictable)
                        __count_vm_events(UNEVICTABLE_PGRESCUED, nr_pages);
        } else {
//清掉active标志 folio_clear_active(folio); folio_set_unevictable(folio);
/* * folio->mlock_count = !!folio_test_mlocked(folio)? * But that leaves __mlock_folio() in doubt whether another * actor has already counted the mlock or not. Err on the * safe side, underestimate, let page reclaim fix it, rather * than leaving a page on the unevictable LRU indefinitely. */ folio->mlock_count = 0; if (!was_unevictable) __count_vm_events(UNEVICTABLE_PGCULLED, nr_pages); } //将folio加入lruvec lruvec_add_folio(lruvec, folio); trace_mm_lru_insertion(folio); }

 lruvec_add_folio最终将folio加入lruvec。

static __always_inline
void lruvec_add_folio(struct lruvec *lruvec, struct folio *folio)
{       
//根据flags判断folio属于哪个lru链
enum lru_list lru = folio_lru_list(folio); //一般CONFIG_LRU_GEN是disabled,直接返回false if (lru_gen_add_folio(lruvec, folio, false)) return; //更新一下lruvec的数据 update_lru_size(lruvec, lru, folio_zonenum(folio), folio_nr_pages(folio)); if (lru != LRU_UNEVICTABLE)
//把folio->lru加入到lruvec->lists[lru] list_add(
&folio->lru, &lruvec->lists[lru]); }
static __always_inline enum lru_list folio_lru_list(struct folio *folio)
{
    enum lru_list lru;

    VM_BUG_ON_FOLIO(folio_test_active(folio) && folio_test_unevictable(folio), folio);

    if (folio_test_unevictable(folio))
        return LRU_UNEVICTABLE;

    lru = folio_is_file_lru(folio) ? LRU_INACTIVE_FILE : LRU_INACTIVE_ANON;
    if (folio_test_active(folio))
        lru += LRU_ACTIVE;

    return lru;
}

如果没有显示的设置active那就会将page放到inactive链表中。

lruvec中包含了一个lru链表数组。

struct lruvec {
//lru lists包含了之前提到的5个链表
struct list_head lists[NR_LRU_LISTS]; }

至此我们知道在分配匿名页的时候会将分配的页放在该folio所在node的lruvec的对应链表里,从代码中看到第一次加入lru链表是在inactive链表中。

如果页面一直呆在inactive链表中是很“危险”的,不知道什么时候就会被回收。如果不想被回收那该怎么做呢?这就是页面在lru链表上如何移动的问题。与此相关的页的flag有PG_referenced和PG_active。假设初始状态两个标志都是0,他们的状态变化是,

inactive,unreferenced    ->    inactive,referenced //第一次access
inactive,referenced    ->    active,unreferenced //第二次access
active,unreferenced    ->    active,referenced  //第三次access

页面回收会选择那些inactive且unreferenced的页。实现此方向状态转移的函数是folio_mark_accessed。

/*
 * Mark a page as having seen activity.
 *
 * inactive,unreferenced        ->      inactive,referenced
 * inactive,referenced          ->      active,unreferenced
 * active,unreferenced          ->      active,referenced
 *
 * When a newly allocated page is not yet visible, so safe for non-atomic ops,
 * __SetPageReferenced(page) may be substituted for mark_page_accessed(page).
 */
void folio_mark_accessed(struct folio *folio)
{
        if (lru_gen_enabled()) {
                folio_inc_refs(folio);
                return;
        }

        if (!folio_test_referenced(folio)) {
                folio_set_referenced(folio);
        } else if (folio_test_unevictable(folio)) {
                /*
                 * Unevictable pages are on the "LRU_UNEVICTABLE" list. But,
                 * this list is never rotated or maintained, so marking an
                 * unevictable page accessed has no effect.
                 */
        } else if (!folio_test_active(folio)) {
                /*
                 * If the folio is on the LRU, queue it for activation via
                 * cpu_fbatches.activate. Otherwise, assume the folio is in a
                 * folio_batch, mark it active and it'll be moved to the active
                 * LRU on the next drain.
                 */
                if (folio_test_lru(folio))
//会将folio加入per cpu的lru cache,如果它已经存在那就啥也不做。感觉这个地方有bug,如果fbatch满了,又不去排空,那不是会越界? folio_activate(folio);
else __lru_cache_activate_folio(folio); folio_clear_referenced(folio); workingset_activation(folio); } if (folio_test_idle(folio)) folio_clear_idle(folio); }

与之相关的API还有folio_referenced和folio_check_references。

/**
 * folio_referenced() - Test if the folio was referenced.
 * @folio: The folio to test.
 * @is_locked: Caller holds lock on the folio.
 * @memcg: target memory cgroup
 * @vm_flags: A combination of all the vma->vm_flags which referenced the folio.
 *
 * Quick test_and_clear_referenced for all mappings of a folio,
 *
 * Return: The number of mappings which referenced the folio. Return -1 if
 * the function bailed out due to rmap lock contention.
 */
int folio_referenced(struct folio *folio, int is_locked,
             struct mem_cgroup *memcg, unsigned long *vm_flags)
{
    int we_locked = 0;
    struct folio_referenced_arg pra = {
        .mapcount = folio_mapcount(folio),
        .memcg = memcg,
    };
    struct rmap_walk_control rwc = {
//folio_referenced_one是最终去判断folio是否被引用的关键函数 .rmap_one
= folio_referenced_one, .arg = (void *)&pra, .anon_lock = folio_lock_anon_vma_read, .try_lock = true, .invalid_vma = invalid_folio_referenced_vma, }; *vm_flags = 0; if (!pra.mapcount) return 0; if (!folio_raw_mapping(folio)) return 0; if (!is_locked && (!folio_test_anon(folio) || folio_test_ksm(folio))) { we_locked = folio_trylock(folio); if (!we_locked) return 1; } //利用反向映射查找folio对应的所有pte rmap_walk(folio, &rwc); *vm_flags = pra.vm_flags; if (we_locked) folio_unlock(folio); return rwc.contended ? -1 : pra.referenced; }

folio_referenced会利用反向映射查找所有映射该folio的pte再使用folio_referenced_one判断是否folio被引用过。

看一下folio_referenced_one函数

static bool folio_referenced_one(struct folio *folio,
        struct vm_area_struct *vma, unsigned long address, void *arg)
{
    struct folio_referenced_arg *pra = arg;
    DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, 0);
    int referenced = 0;
    unsigned long start = address, ptes = 0;

    while (page_vma_mapped_walk(&pvmw)) {
        address = pvmw.address;

        if (vma->vm_flags & VM_LOCKED) {
            if (!folio_test_large(folio) || !pvmw.pte) {
                /* Restore the mlock which got missed */
                mlock_vma_folio(folio, vma);
                page_vma_mapped_walk_done(&pvmw);
                pra->vm_flags |= VM_LOCKED;
                return false; /* To break the loop */
            }
            /*
             * For large folio fully mapped to VMA, will
             * be handled after the pvmw loop.
             *
             * For large folio cross VMA boundaries, it's
             * expected to be picked  by page reclaim. But
             * should skip reference of pages which are in
             * the range of VM_LOCKED vma. As page reclaim
             * should just count the reference of pages out
             * the range of VM_LOCKED vma.
             */
            ptes++;
            pra->mapcount--;
            continue;
        }

        if (pvmw.pte) {
            if (lru_gen_enabled() &&
                pte_young(ptep_get(pvmw.pte))) {
                lru_gen_look_around(&pvmw);
                referenced++;
            }
             //判断folio是否被引用,如果是清掉引用标记
            if (ptep_clear_flush_young_notify(vma, address,
                        pvmw.pte))
                referenced++;
        } else if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
            if (pmdp_clear_flush_young_notify(vma, address,
                        pvmw.pmd))
                referenced++;
        } else {
            /* unexpected pmd-mapped folio? */
            WARN_ON_ONCE(1);
        }

        pra->mapcount--;
    }

    if ((vma->vm_flags & VM_LOCKED) &&
            folio_test_large(folio) &&
            folio_within_vma(folio, vma)) {
        unsigned long s_align, e_align;

        s_align = ALIGN_DOWN(start, PMD_SIZE);
        e_align = ALIGN_DOWN(start + folio_size(folio) - 1, PMD_SIZE);

        /* folio doesn't cross page table boundary and fully mapped */
        if ((s_align == e_align) && (ptes == folio_nr_pages(folio))) {
            /* Restore the mlock which got missed */
            mlock_vma_folio(folio, vma);
            pra->vm_flags |= VM_LOCKED;
            return false; /* To break the loop */
        }
    }

    if (referenced)
        folio_clear_idle(folio);
    if (folio_test_clear_young(folio))
        referenced++;

    if (referenced) {
        pra->referenced++;
        pra->vm_flags |= vma->vm_flags & ~VM_LOCKED;
    }

    if (!pra->mapcount)
        return false; /* To break the loop */

    return true;
}

总体来说folio_referenced会返回folio映射个数并清除pte的reference标记。

来看看folio_check_references.

static enum folio_references folio_check_references(struct folio *folio,
                          struct scan_control *sc)
{
    int referenced_ptes, referenced_folio;
    unsigned long vm_flags;
    //获取该folio的引用次数
    referenced_ptes = folio_referenced(folio, 1, sc->target_mem_cgroup,
                       &vm_flags);
//判断该folio是否被引用 referenced_folio
= folio_test_clear_referenced(folio); /* * The supposedly reclaimable folio was found to be in a VM_LOCKED vma. * Let the folio, now marked Mlocked, be moved to the unevictable list. */ if (vm_flags & VM_LOCKED) return FOLIOREF_ACTIVATE; /* rmap lock contention: rotate */ if (referenced_ptes == -1) return FOLIOREF_KEEP; if (referenced_ptes) { /* * All mapped folios start out with page table * references from the instantiating fault, so we need * to look twice if a mapped file/anon folio is used more * than once. * * Mark it and spare it for another trip around the * inactive list. Another page table reference will * lead to its activation. * * Note: the mark is set for activated folios as well * so that recently deactivated but used folios are * quickly recovered. */ folio_set_referenced(folio); //引用超过1次?可以加到active list里面 if (referenced_folio || referenced_ptes > 1) return FOLIOREF_ACTIVATE; /* * Activate file-backed executable folios after first usage. */
//可执行文件引用一次就可以放到active list里面
if ((vm_flags & VM_EXEC) && folio_is_file_lru(folio)) return FOLIOREF_ACTIVATE; return FOLIOREF_KEEP; } /* Reclaim if clean, defer dirty folios to writeback */ if (referenced_folio && folio_is_file_lru(folio)) return FOLIOREF_RECLAIM_CLEAN; return FOLIOREF_RECLAIM; }

folio_check_references负责判断当前的页面是不是需要被移动到active list或者被回收。

下一篇我们看一下内核是如何进行页面回收的。

posted on 2024-06-12 18:30  半山随笔  阅读(251)  评论(0编辑  收藏  举报

导航