详解slab机制

详解slab机制

http://blog.csdn.net/u010246947/article/details/10133101

目前有很多讲slab的文章,要么是纯讲原理画一堆图结合源码不深导致理解困难,要么是纯代码注释导致理解更困难,我在猛攻了一周时间后,细致总结一下slab,争取从原理到源码都能细致的理解到并立刻达到清楚的使用。

一、slab分配器概述:

有了伙伴系统buddy,我们可以以页为单位获取连续的物理内存了,即4K为单位的获取,但如果需要频繁的获取/释放并不大的连续物理内存怎么办,如几十字节几百字节的获取/释放,这样的话用buddy就不太合适了,这就引出了slab。

比如我需要一个100字节的连续物理内存,那么内核slab分配器会给我提供一个相应大小的连续物理内存单元,为128字节大小(不会是整好100字节,而是这个档的一个对齐值,如100字节对应128字节,30字节对应32字节,60字节对应64字节),这个物理内存实际上还是从伙伴系统获取的物理页;当我不再需要这个内存时应该释放它,释放它并非把它归还给伙伴系统,而是归还给slab分配器,这样等再需要获取时无需再从伙伴系统申请,这也就是为什么slab分配器往往会把最近释放的内存(即所谓“热”)分配给申请者,这样效率是比较高的。

二、创建一个slab:

2.1、什么叫创建slab:

上面举了申请100字节连续物理内存的例子,还提到了实际分配的是128字节内存,也就是实际上内核中slab分配器对不同长度内存是分档的,其实这就是slab分配器的一个基本原则,按申请的内存的大小分配相应长度的内存。

同时也说明一个事实,内核中一定应该有这样的按不同长度slab内存单元,也就是说已经创建过这样的内存块,否则申请时怎能根据大小识别应该分配给怎样大小的内存,这可以先参加kmalloc的实现,kmalloc->__do_kmalloc,__do_kmalloc函数中的如下:

static __always_inline void *__do_kmalloc(size_t size, gfp_t flags,  void *caller)

{

         struct kmem_cache *cachep;

         void *ret;

         /*找一个合适大小的高速缓存*/

         cachep = __find_general_cachep(size, flags);

         if (unlikely(ZERO_OR_NULL_PTR(cachep)))

                   return cachep;

         ret = __cache_alloc(cachep, flags, caller);

         trace_kmalloc((unsigned long) caller, ret,

                         size, cachep->buffer_size, flags);

         return ret;

}

加深的部分就是说,kmalloc申请的物理内存长度为参数size,它需要先根据这个长度找到相应的长度的缓存,这个缓存的概念是什么马上就要引出先别着急,先看函数__find_general_cachep:

static inline struct kmem_cache *__find_general_cachep(size_t size,  gfp_t gfpflags)

{

         struct cache_sizes *csizep = malloc_sizes;

#if DEBUG

         /* This happens if someone tries to call

          * kmem_cache_create(), or __kmalloc(), before

          * the generic caches are initialized.

          */

         BUG_ON(malloc_sizes[INDEX_AC].cs_cachep == NULL);

#endif

         if (!size)

                   return ZERO_SIZE_PTR;

    /*这是本函数唯一有用的地方: 寻找合适大小的cache_sizes*/

         while (size > csizep->cs_size)

                   csizep++;

         /*

          * Really subtle: The last entry with cs->cs_size==ULONG_MAX

          * has cs_{dma,}cachep==NULL. Thus no special case

          * for large kmalloc calls required.

          */

#ifdef CONFIG_ZONE_DMA

         if (unlikely(gfpflags & GFP_DMA))

                   return csizep->cs_dmacachep;

#endif

         return csizep->cs_cachep;

}

如上面加深的部分所示,这个函数唯一有用的部分就是这里,csizep初始化成全局变量malloc_sizes,根据全局变量malloc_sizes的cs_size成员和size的大小比较,不断后移malloc_sizes,现在就要看看malloc_sizes是怎么回事:

struct cache_sizes malloc_sizes[] = {

#define CACHE(x) { .cs_size = (x) },

#include <linux/kmalloc_sizes.h>

         CACHE(ULONG_MAX)

#undef CACHE

};

观察文件linux/kmalloc_sizes.h的情况,篇幅太大这个文件内容就不列了,里面都是一堆的CACHE(X)的宏声明,根据里边的定制宏情况(L1_CACHE_BYTES值为32,KMALLOC_MAX_SIZE值为4194304),一共声明了CACHE(32)、CACHE(64)、CACHE(96)、CACHE(128)、CACHE(192)、CACHE(256)、CACHE(512)、CACHE(1024)、CACHE(2048)、CACHE(4096)、CACHE(8192)、CACHE(16384)、CACHE(32768)、CACHE(65536)、CACHE(131072)、CACHE(262144)、CACHE(524288)、CACHE(1048576)、CACHE(2097152)、CACHE(4194304)和最后的CACHE(0xffffffff)共计21个CACHE(X)的宏声明,结合结构类型struct cache_sizes,对于arm它实际上有两个成员:

struct cache_sizes {

         size_t                        cs_size;

         struct kmem_cache         *cs_cachep;

#ifdef CONFIG_ZONE_DMA

         struct kmem_cache         *cs_dmacachep;

#endif

};

除X86以外基本都没有DMA必须在物理内存前16MB的限制,所以包括arm的很多体系结构都没有CONFIG_ZONE_DMA,所以本结构实际上是两个成员cs_size和cs_cachep,那么这里就比较清晰了,全局变量malloc_sizes共有21个成员,每个成员都定义了cs_size值,从32到4194304加上0xffffffff,cs_cachep都为NULL;其实这些值就是slab分配器的一个个按长度的分档;

回到函数__find_general_cachep,已经很清晰了,全局变量malloc_sizes的第0个成员开始,当申请的内存长度比该成员的档次值cs_size大,就换下一个成员,直到比它小为止,仍然如申请100字节的例子,在96字节的分档时还比申请长度小,在128字节的分档时就可以满足了,这就是为什么说申请100字节实际获取到的是128字节的内存单元的原因。

回到函数__do_kmalloc,接下来调用的是__cache_alloc,将按照前面确定的内存分档值给申请者分配一个相应值的内存,这说明,内核有能力给分配这样的内存单元;

内核为什么有能力创建这样的内存单元?slab分配器并非一开始就能智能的根据内存分档值分配相应长度的内存的,它需要先创建一个这样的“规则”式的东西,之后才可以根据这个“规则”分配相应长度的内存,看看前面的结构struct cache_sizes的定义,里边的成员cs_cachep,它的结构类型是struct kmem_cache      *,这个结构也是同样是刚才提到的缓存的概念,每种长度的slab分配都得通过它对应的cache分配,换句话说就是每种cache对应一种长度的slab分配,这里顺便能看看slab分配接口,一个是函数kmalloc一个是函数kmem_cache_alloc,kmalloc的参数比较轻松,直接输入自己想要的内存长度即可,由slab分配器去找应该是属于哪个长度分档的,然后由那个分档的kmem_cache结构指针去分配相应长度内存,而kmem_cache_alloc就显得比较“专业”,它不是输入我要多少长度内存,而是直接以kmem_cache结构指针作为参数,直接指定我要这样长度分档的内存,稍微看看这两个函数的调用情况就可以发现它们很快都是调用函数__cache_alloc,只是前面的这些不同而已。

比如现在有一个内核模块想要申请一种它自创的结构,这个结构是111字节,并且它不想获取128字节内存就想获取111字节长度内存,那么它需要在slab分配器中创建一个这样的“规则”,这个规则规定slab分配器当按这种“规则”分配时要给我111字节的内存,这个“规则”的创建方法就是调用函数kmem_cache_create;

同样,内核slab分配器之所以能够默认的提供32-4194304共20种内存长度分档,肯定也是需要创建这样20个“规则”的,这是在初始化时创建的,由函数kmem_cache_init,先不要纠结kmem_cache_init,它里边有一些道理需要在理解slab分配器原理后才能更好的理解,先看kmem_cache_create:

 

2.2、创建slab的过程:

现在去看结构kmem_cache的各个成员定义是很模糊的,直接看函数源码:

struct kmem_cache *

kmem_cache_create (const char *name, size_t size, size_t align,

         unsigned long flags, void (*ctor)(void *))

{

         size_t left_over, slab_size, ralign;

         struct kmem_cache *cachep = NULL, *pc;

         gfp_t gfp;

         /*

          * Sanity checks... these are all serious usage bugs.

          */

/*参数检查: 名字不能为NULL、不许在中断中调用本函数(本函数可能睡眠)、

  获取长度不得小于4字节(CPU字长)、获取长度不得大于最大值(1<<22 = 4MB)*/

         if (!name || in_interrupt() || (size < BYTES_PER_WORD) ||

             size > KMALLOC_MAX_SIZE) {

                   printk(KERN_ERR "%s: Early error in slab %s\n", __func__,

                                     name);

                   BUG();

         }

 

         /*

          * We use cache_chain_mutex to ensure a consistent view of

          * cpu_online_mask as well.  Please see cpuup_callback

          */

         if (slab_is_available()) {

                   get_online_cpus();

                   mutex_lock(&cache_chain_mutex);

         }

 

    /*一些检查机制,无需关注*/

         list_for_each_entry(pc, &cache_chain, next) {

                   char tmp;

                   int res;

 

                   /*

                    * This happens when the module gets unloaded and doesn't

                    * destroy its slab cache and no-one else reuses the vmalloc

                    * area of the module.  Print a warning.

                    */

                   res = probe_kernel_address(pc->name, tmp);

                   if (res) {

                            printk(KERN_ERR

                                   "SLAB: cache with size %d has lost its name\n",

                                   pc->buffer_size);

                            continue;

                   }

 

                   if (!strcmp(pc->name, name)) {

                            printk(KERN_ERR

                                   "kmem_cache_create: duplicate cache %s\n", name);

                            dump_stack();

                            goto oops;

                   }

         }

 

#if DEBUG

         WARN_ON(strchr(name, ' '));  /* It confuses parsers */

#if FORCED_DEBUG

         /*

          * Enable redzoning and last user accounting, except for caches with

          * large objects, if the increased size would increase the object size

          * above the next power of two: caches with object sizes just above a

          * power of two have a significant amount of internal fragmentation.

          */

         if (size < 4096 || fls(size - 1) == fls(size-1 + REDZONE_ALIGN +

                                                        2 * sizeof(unsigned long long)))

                   flags |= SLAB_RED_ZONE | SLAB_STORE_USER;

         if (!(flags & SLAB_DESTROY_BY_RCU))

                   flags |= SLAB_POISON;

#endif

         if (flags & SLAB_DESTROY_BY_RCU)

                   BUG_ON(flags & SLAB_POISON);

#endif

         /*

          * Always checks flags, a caller might be expecting debug support which

          * isn't available.

          */

         BUG_ON(flags & ~CREATE_MASK);

 

         /*

          * Check that size is in terms of words.  This is needed to avoid

          * unaligned accesses for some archs when redzoning is used, and makes

          * sure any on-slab bufctl's are also correctly aligned.

          */

/*下面是一堆关于对齐的内容*/

         if (size & (BYTES_PER_WORD - 1)) {

                   size += (BYTES_PER_WORD - 1);

                   size &= ~(BYTES_PER_WORD - 1);

         }

 

         /* calculate the final buffer alignment: */

 

         /* 1) arch recommendation: can be overridden for debug */

         if (flags & SLAB_HWCACHE_ALIGN) {

                   /*

                    * Default alignment: as specified by the arch code.  Except if

                    * an object is really small, then squeeze multiple objects into

                    * one cacheline.

                    */

                   ralign = cache_line_size();

                   while (size <= ralign / 2)

                            ralign /= 2;

         } else {

                   ralign = BYTES_PER_WORD;

         }

 

         /*

          * Redzoning and user store require word alignment or possibly larger.

          * Note this will be overridden by architecture or caller mandated

          * alignment if either is greater than BYTES_PER_WORD.

          */

         if (flags & SLAB_STORE_USER)

                   ralign = BYTES_PER_WORD;

 

         if (flags & SLAB_RED_ZONE) {

                   ralign = REDZONE_ALIGN;

                   /* If redzoning, ensure that the second redzone is suitably

                    * aligned, by adjusting the object size accordingly. */

                   size += REDZONE_ALIGN - 1;

                   size &= ~(REDZONE_ALIGN - 1);

         }

 

         /* 2) arch mandated alignment */

         if (ralign < ARCH_SLAB_MINALIGN) {

                   ralign = ARCH_SLAB_MINALIGN;

         }

         /* 3) caller mandated alignment */

         if (ralign < align) {

                   ralign = align;

         }

         /* disable debug if necessary */

         if (ralign > __alignof__(unsigned long long))

                   flags &= ~(SLAB_RED_ZONE | SLAB_STORE_USER);

         /*

          * 4) Store it.

          */

         align = ralign;

 

         if (slab_is_available())

                   gfp = GFP_KERNEL;

         else

                   gfp = GFP_NOWAIT;

 

         /* Get cache's description obj. */

/*从cache_cache缓存中分配一个kmem_cache新实例*/

         cachep = kmem_cache_zalloc(&cache_cache, gfp);

         if (!cachep)

                   goto oops;

 

#if DEBUG

         cachep->obj_size = size;

 

         /*

          * Both debugging options require word-alignment which is calculated

          * into align above.

          */

         if (flags & SLAB_RED_ZONE) {

                  /* add space for red zone words */

                   cachep->obj_offset += sizeof(unsigned long long);

                   size += 2 * sizeof(unsigned long long);

         }

         if (flags & SLAB_STORE_USER) {

                   /* user store requires one word storage behind the end of

                    * the real object. But if the second red zone needs to be

                    * aligned to 64 bits, we must allow that much space.

                    */

                   if (flags & SLAB_RED_ZONE)

                            size += REDZONE_ALIGN;

                   else

                            size += BYTES_PER_WORD;

         }

#if FORCED_DEBUG && defined(CONFIG_DEBUG_PAGEALLOC)

         if (size >= malloc_sizes[INDEX_L3 + 1].cs_size

             && cachep->obj_size > cache_line_size() && size < PAGE_SIZE) {

                   cachep->obj_offset += PAGE_SIZE - size;

                   size = PAGE_SIZE;

         }

#endif

#endif

 

         /*

          * Determine if the slab management is 'on' or 'off' slab.

          * (bootstrapping cannot cope with offslab caches so don't do

          * it too early on.)

          */

         /*确定slab管理对象的存储方式:内置还是外置。通常,当对象大于等于512时,使用外置方式。初始化阶段采用内置式(kmem_cache_init中创建两个普通高速缓存之后就把变量slab_early_init置0了)*/

         if ((size >= (PAGE_SIZE >> 3)) && !slab_early_init)

                   /*

                    * Size is large, assume best to place the slab management obj

                    * off-slab (should allow better packing of objs).

                    */

                   flags |= CFLGS_OFF_SLAB;

         size = ALIGN(size, align);

    /*计算碎片大小,计算slab由几个页面组成,同时计算每个slab中有多少对象*/

         left_over = calculate_slab_order(cachep, size, align, flags);

         if (!cachep->num) {

                   printk(KERN_ERR

                          "kmem_cache_create: couldn't create cache %s.\n", name);

                   kmem_cache_free(&cache_cache, cachep);

                   cachep = NULL;

                   goto oops;

         }

    /*计算slab管理对象的大小,包括struct slab对象和kmem_bufctl_t数组  */

         slab_size = ALIGN(cachep->num * sizeof(kmem_bufctl_t)

                              + sizeof(struct slab), align);

         /*

          * If the slab has been placed off-slab, and we have enough space then

          * move it on-slab. This is at the expense of any extra colouring.

          */

         /*如果这是一个外置式slab,并且碎片大小大于slab管理对象的大小,则可将slab管理对象移到slab中,改造成一个内置式slab*/

         if (flags & CFLGS_OFF_SLAB && left_over >= slab_size) {

        /*去除外置标志*/

                   flags &= ~CFLGS_OFF_SLAB;

        /*碎片可以减小了!*/

                   left_over -= slab_size;

         }

    /*对于实际的外置slab,无需对齐管理对象,恢复其对齐前长度*/

         if (flags & CFLGS_OFF_SLAB) {

                   /* really off slab. No need for manual alignment */

                   slab_size =

                       cachep->num * sizeof(kmem_bufctl_t) + sizeof(struct slab);

#ifdef CONFIG_PAGE_POISONING

                   /* If we're going to use the generic kernel_map_pages()

                    * poisoning, then it's going to smash the contents of

                    * the redzone and userword anyhow, so switch them off.

                    */

                   if (size % PAGE_SIZE == 0 && flags & SLAB_POISON)

                            flags &= ~(SLAB_RED_ZONE | SLAB_STORE_USER);

#endif

         }

    /*着色块单位,为32字节*/

         cachep->colour_off = cache_line_size();

         /* Offset must be a multiple of the alignment. */

    /*着色块单位必须是对齐单位的整数倍*/

         if (cachep->colour_off < align)

                   cachep->colour_off = align;

    /*得出碎片区域需要多少个着色块*/

         cachep->colour = left_over / cachep->colour_off;

    /*管理对象大小*/

         cachep->slab_size = slab_size;

    cachep->flags = flags;

         cachep->gfpflags = 0;

    /*对于arm无需关注下面的if,因为不需考虑DMA*/

         if (CONFIG_ZONE_DMA_FLAG && (flags & SLAB_CACHE_DMA))

                   cachep->gfpflags |= GFP_DMA;

    /*slab对象的大小*/

         cachep->buffer_size = size;

    /*slab对象的大小的倒数,计算对象在slab中索引时用,参见obj_to_index函数 */

         cachep->reciprocal_buffer_size = reciprocal_value(size);

    /*外置slab,这里分配一个slab管理对象,保存在slabp_cache中,如果是内置式的slab,此指针为空*/

         if (flags & CFLGS_OFF_SLAB) {

                   cachep->slabp_cache = kmem_find_general_cachep(slab_size, 0u);

                   /*

                    * This is a possibility for one of the malloc_sizes caches.

                    * But since we go off slab only for object size greater than

                    * PAGE_SIZE/8, and malloc_sizes gets created in ascending order,

                    * this should not happen at all.

                    * But leave a BUG_ON for some lucky dude.

                    */

                   BUG_ON(ZERO_OR_NULL_PTR(cachep->slabp_cache));

         }

    /*cache的构造函数和名字*/

         cachep->ctor = ctor;

         cachep->name = name;

 

    /*设置每个cpu上的local cache,配置local cache和slab三链*/

         if (setup_cpu_cache(cachep, gfp)) {

                   __kmem_cache_destroy(cachep);

                   cachep = NULL;

                   goto oops;

         }

 

         /* cache setup completed, link it into the list */

         list_add(&cachep->next, &cache_chain);

oops:

         if (!cachep && (flags & SLAB_PANIC))

                   panic("kmem_cache_create(): failed to create slab `%s'\n",

                         name);

         if (slab_is_available()) {

                   mutex_unlock(&cache_chain_mutex);

                   put_online_cpus();

         }

         return cachep;

}

---------------------------------------------------------------------------------------------------------------------------------

直到函数中的“if (slab_is_available()) gfp = GFP_KERNEL;”这里,前面的都可以不用关注,分别是运行环境和参数的检查(需要注意本函数会可能睡眠,所以绝不能在中断中调用本函数)、一堆对齐机制的东西,看看这一段:

if (slab_is_available())

         gfp = GFP_KERNEL;

else

         gfp = GFP_NOWAIT;

到这里首先根据当前slab是否初始化完成确定变量gfp的值,gfp并不陌生,它规定了从伙伴系统寻找内存的地点和方式,这里的在slab初始化完成时gfp值为GFP_KERNEL说明了为什么可能会睡眠,而slab初始化完成之前gfp值为GFP_NOWAIT说明不会睡眠;

---------------------------------------------------------------------------------------------------------------------------------

接下来是获取一个kmem_cache结构,调用kmem_cache_zalloc,它和kmem_cache_zalloc唯一区别就是会对所分配区域进行清零操作,即在kmem_cache_alloc函数的gfp参数中加入标志__GFP_ZERO,其他没有区别;

由前面2.1节的分析已知,如果想要通过slab分配器获取某长度的内存,必须创建这样的“规则”,那么现在需要一个kmem_cache结构体长度的内存,同样也是需要一个该长度的“规则”,没错该长度的“规则”也是在初始化函数kmem_cache_init中创建,而我们创建这个“规则”的结果就是全局变量cache_cache,所以现在需要申请一个kmem_cache结构体长度的内存时就通过全局变量cache_cache这样一个已创建好的kmem_cache结构变量。

不过全局变量cache_cache并不是一个理解slab创建的好例子,原因在后面就会明白,理解slab还是继续观察函数kmem_cache_create,接下来是确定slab管理对象的存储方式:

if ((size >= (PAGE_SIZE >> 3)) && !slab_early_init)

         /*

          * Size is large, assume best to place the slab management obj

          * off-slab (should allow better packing of objs).

          */

         flags |= CFLGS_OFF_SLAB;

这里引出了slab管理对象的存储方式,分为内置和外置,简单的说,内置就是说slab管理部分的内容和实际供使用的内存都在申请到的内存区域中,外置slab管理部分的内容自己再单独申请一个内存区域,和实际申请的内存区域分开,所谓slab管理部分,包括slab结构体、对象描述符,后面会细致描述,这里的if的意思是,当slab初始化完成后,如果创建的“规则”的内存长度大于(PAGE_SIZE >> 3)即512字节时,就使用外置方式,否则使用内置方式,初始化完成之前均使用内置方式。

---------------------------------------------------------------------------------------------------------------------------------

接下来是left_over = calculate_slab_order(cachep, size, align, flags);这是在计算,所创建的“规则”的内存长度size,最终创建的slab将应该有多少个物理页面、有多少个这样size的对象、有多少碎片(碎片就是说申请的内存长度除了对象以外剩下的不能用的内存的长度):

static size_t calculate_slab_order(struct kmem_cache *cachep,

                            size_t size, size_t align, unsigned long flags)

{

         unsigned long offslab_limit;

         size_t left_over = 0;

         int gfporder;

         for (gfporder = 0; gfporder <= KMALLOC_MAX_ORDER; gfporder++) {

                   unsigned int num;

                   size_t remainder;

       

        /*计算每个slab中对象的数目、浪费空间大小

          参数: gfporder: slab大小为2^gfporder个页面

                buffer_size: 对象大小

                align: 对象的对齐方式

                flags: 是外置slab还是内置slab

                remainder: slab中浪费的空间(碎片)是多少

                num: slab中对象个数*/

                   cache_estimate(gfporder, size, align, flags, &remainder, &num);

                   if (!num)

                            continue;

                   if (flags & CFLGS_OFF_SLAB) {

                            /*

                             * Max number of objs-per-slab for caches which

                             * use off-slab slabs. Needed to avoid a possible

                             * looping condition in cache_grow().

                             */

                            offslab_limit = size - sizeof(struct slab);

                            offslab_limit /= sizeof(kmem_bufctl_t);

                           if (num > offslab_limit)

                                     break;

                   }

                   /* Found something acceptable - save it away */

        /*统计slab的对象个数、页面个数(最大为2^1 = 2页)、浪费空间大小*/

                   cachep->num = num;

                   cachep->gfporder = gfporder;

                   left_over = remainder;

                   /*

                    * A VFS-reclaimable slab tends to have most allocations

                    * as GFP_NOFS and we really don't want to have to be allocating

                    * higher-order pages when we are unable to shrink dcache.

                    */

                   /*SLAB_RECLAIM_ACCOUNT表示此slab所占页面为可回收的,当内核检测是否有足够的页面满足用户态的需求时,此类页面将被计算在内,通过调用kmem_freepages()函数可以释放分配给slab的页框。由于是可回收的,所以不需要做后面的碎片检测了*/

                   if (flags & SLAB_RECLAIM_ACCOUNT)

                            break;

                   /*

                    * Large number of objects is good, but very large slabs are

                    * currently bad for the gfp()s.

                    */

                   /*slab_break_gfp_order初始化后为1,即slab最多是2^1 = 2个页*/

                   if (gfporder >= slab_break_gfp_order)

                            break;

                   /*

                    * Acceptable internal fragmentation?

                    */

                   /*slab所占页面的大小是碎片大小的8倍以上,页面利用率较高,可以接受这样的order */

                   if (left_over * 8 <= (PAGE_SIZE << gfporder))

                            break;

         }

    /*返回碎片大小*/

         return left_over;

}

 

static void cache_estimate(unsigned long gfporder, size_t buffer_size,

                               size_t align, int flags, size_t *left_over,

                               unsigned int *num)

{

         int nr_objs;

         size_t mgmt_size;

    /*slab大小为2^gfporder个页面*/

         size_t slab_size = PAGE_SIZE << gfporder;

         /*

          * The slab management structure can be either off the slab or

          * on it. For the latter case, the memory allocated for a

          * slab is used for:

          *

          * - The struct slab

          * - One kmem_bufctl_t for each object

          * - Padding to respect alignment of @align

          * - @buffer_size bytes for each object

          *

          * If the slab management structure is off the slab, then the

          * alignment will already be calculated into the size. Because

          * the slabs are all pages aligned, the objects will be at the

          * correct alignment when allocated.

          */

         /*对于外置slab,没有slab管理对象问题,直接用申请空间除以对象大小就是对象个数*/

         if (flags & CFLGS_OFF_SLAB) {

        /*外置slab不存在管理对象,全部用于存储slab对象*/

                   mgmt_size = 0;

        /*所以slab对象个数 = slab大小除以对象大小*/

                   nr_objs = slab_size / buffer_size;

        /*对象个数不许超限*/

                   if (nr_objs > SLAB_LIMIT)

                            nr_objs = SLAB_LIMIT;

         }

         /*对于内置slab,需要减去slab管理对象所占空间,slab管理对象包括slab结构、每个对象一个对象描述符*/

else {

                   /*

                    * Ignore padding for the initial guess. The padding

                    * is at most @align-1 bytes, and @buffer_size is at

                    * least @align. In the worst case, this result will

                    * be one greater than the number of objects that fit

                    * into the memory allocation when taking the padding

                    * into account.

                    */

                   /*内置式slab,slab管理对象与slab对象在一起,

                     此时slab页面中包含:一个struct slab对象,

                                         一个kmem_bufctl_t数组(kmem_bufctl_t数组大小与slab对象数目相同),

                                         slab对象。

                     slab大小需要减去管理对象大小,所以对象个数为剩余大小除以每个对象大小加上kmem_bufctl_t结构大小*/

                   nr_objs = (slab_size - sizeof(struct slab)) /

                              (buffer_size + sizeof(kmem_bufctl_t));

                   /*

                    * This calculated number will be either the right

                    * amount, or one greater than what we want.

                    */

                   /*如果对齐后大小超过slab总大小,需要减去一个对象*/

                   if (slab_mgmt_size(nr_objs, align) + nr_objs*buffer_size

                          > slab_size)

                            nr_objs--;

        /*对象个数不许超限*/

                   if (nr_objs > SLAB_LIMIT)

                            nr_objs = SLAB_LIMIT;

        /*得出slab管理对象对齐后总大小*/

                   mgmt_size = slab_mgmt_size(nr_objs, align);

         }

        /*得出slab最终对象个数*/

         *num = nr_objs;

    /*前面已经得到了slab管理对象大小(外置为0,内置也已计算),这样就可以最终得出slab最终浪费空间大小*/

         *left_over = slab_size - nr_objs*buffer_size - mgmt_size;

}

calculate_slab_order通过for循环调用函数cache_estimate就是最终得出了所要创建的“规则”的内存长度size,也就是创建这样的slab,每个slab有多少物理页,每个slab有多少个这样的对象,每个slab的碎片是多大;每个slab其实最多2个物理页,所能容纳的size大小的对象个数与外置还是内置相关,外置情况下slab管理对象不占用所申请的空间,内置则占用,slab管理对象包括slab结构长度和“对象个数”个对象描述符;

小节:在调用完calculate_slab_order后,能算出这样的slab应该从伙伴系统申请多少物理页(最多2页)(cache->gfporder),里边有多少个期望长度(size)的对象(cache->num),每个slab的碎片是多大(变量left_over);

---------------------------------------------------------------------------------------------------------------------------------

接下来是一部分根据碎片大小情况,可能的把外置slab改造成内置slab的情况,不用特别关注,这往往出现在由于申请长度size和对齐单位align的值的原因,实际改为内置的话可省下很多空间即碎片可减小很多的情况;最终得出内置/外置的管理对象大小slab_size和碎片大小left_over(源码不贴了就);

接下来是对该“规则”的slab的一些属性设置:

/*着色块单位,为32字节*/

cachep->colour_off = cache_line_size();

/* Offset must be a multiple of the alignment. */

/*着色块单位必须是对齐单位的整数倍*/

if (cachep->colour_off < align)

         cachep->colour_off = align;

/*得出碎片区域需要多少个着色块*/

cachep->colour = left_over / cachep->colour_off;

/*管理对象大小*/

cachep->slab_size = slab_size;

cachep->flags = flags;

cachep->gfpflags = 0;

/*对于arm无需关注下面的if,因为不需考虑DMA*/

if (CONFIG_ZONE_DMA_FLAG && (flags & SLAB_CACHE_DMA))

         cachep->gfpflags |= GFP_DMA;

/*slab对象的大小*/

cachep->buffer_size = size;

/*slab对象的大小的倒数,计算对象在slab中索引时用,参见obj_to_index函数 */

cachep->reciprocal_buffer_size = reciprocal_value(size);

/*外置slab,这里分配一个slab管理对象,保存在slabp_cache中,如果是内置式的slab,此指针为空*/

if (flags & CFLGS_OFF_SLAB) {

         cachep->slabp_cache = kmem_find_general_cachep(slab_size, 0u);

         /*

          * This is a possibility for one of the malloc_sizes caches.

          * But since we go off slab only for object size greater than

          * PAGE_SIZE/8, and malloc_sizes gets created in ascending order,

          * this should not happen at all.

          * But leave a BUG_ON for some lucky dude.

          */

         BUG_ON(ZERO_OR_NULL_PTR(cachep->slabp_cache));

}

/*cache的构造函数和名字*/

cachep->ctor = ctor;

cachep->name = name;

前面几个是关于着色的,着色将在后面描述一下,但个人认为对于着色不用特别关注,要知道它的原理和作用方式,但着色在事实上有它的缺点,并且导致slab的管理非常复杂,在linux后续版本更多是通过slub来替代slab的着色机制;

---------------------------------------------------------------------------------------------------------------------------------

接下来是管理对象大小slab_size、slab的分配flag、对象大小buffer_size(及其倒数)、伙伴系统接口gfpflag、构造函数ctor、slab名称name的设置;

需要注意一下对于外置slab的slab管理对象的位置,已经知道外置slab的slab管理对象不在所申请的空间内而是另外再申请一段空间,源码就是对外置slab的slabp_cache专门再申请管理对象slab_size大小的一段空间用于存储外置slab的管理对象,对于内置slab无需关注该成员默认为NULL;

---------------------------------------------------------------------------------------------------------------------------------

接下来是一个重点内容,为该slab创建其本地缓存(local cache)和slab三链,函数setup_cpu_cache:

现在可以先看看结构体kmem_cache,如下:

struct kmem_cache {

         struct array_cache *array[NR_CPUS];

/* 2) Cache tunables. Protected by cache_chain_mutex */

         unsigned int batchcount;

         unsigned int limit;    

         unsigned int shared;   

         unsigned int buffer_size;

         u32 reciprocal_buffer_size;

/* 3) touched by every alloc & free from the backend */

         unsigned int flags;            /* constant flags */

         unsigned int num;             /* # of objs per slab */

/* 4) cache_grow/shrink */

         /* order of pgs per slab (2^n) */

         unsigned int gfporder;

         /* force GFP flags, e.g. GFP_DMA */

         gfp_t gfpflags;

         size_t colour;                     /* cache colouring range */

         unsigned int colour_off;  /* colour offset */

         struct kmem_cache *slabp_cache;

         unsigned int slab_size;

         unsigned int dflags;          /* dynamic flags */

         /* constructor func */

         void (*ctor)(void *obj);

/* 5) cache creation/removal */

         const char *name;

         struct list_head next;

/* 6) statistics */

#ifdef CONFIG_DEBUG_SLAB

         unsigned long num_active;

         unsigned long num_allocations;

         unsigned long high_mark;

         unsigned long grown;

         unsigned long reaped;

         unsigned long errors;

         unsigned long max_freeable;

         unsigned long node_allocs;

         unsigned long node_frees;

         unsigned long node_overflow;

         atomic_t allochit;

         atomic_t allocmiss;

         atomic_t freehit;

         atomic_t freemiss;

         /*

          * If debugging is enabled, then the allocator can add additional

          * fields and/or padding to every object. buffer_size contains the total

          * object size including these internal fields, the following two

          * variables contain the offset to the user object and its size.

          */

         int obj_offset;

         int obj_size;

#endif /* CONFIG_DEBUG_SLAB */

         /*

          * We put nodelists[] at the end of kmem_cache, because we want to size

          * this array to nr_node_ids slots instead of MAX_NUMNODES

          * (see kmem_cache_init())

          * We still use [MAX_NUMNODES] and not [1] or [0] because cache_cache

          * is statically defined, so we reserve the max number of nodes.

          */

         struct kmem_list3 *nodelists[MAX_NUMNODES];

         /*

          * Do not add fields after nodelists[]

          */

};

到目前为止还未设置的成员有:

struct array_cache *array[NR_CPUS];

unsigned int batchcount;         /*批量转入转出对象的个数*/

unsigned int limit;                /*本地高速缓存中空闲对象的数量*/

unsigned int shared;              /*是否存在共享CPU高速缓存*/

unsigned int dflags;                   /* dynamic flags (可先不关心)*/

struct kmem_list3 *nodelists[MAX_NUMNODES];

batchcount和limit与实际分配内存相关,shared只在多CPU情况下有意义,dflags暂无需关注,重点关注array和nodelists,它们涉及了所申请内存的分配机制:

在实际开始分配内存时,每个CPU都从kmem_cache结构体中的array中获取需要的内存,如果这里没有内存(用光或第一次用,第一次都是没有内存的需从buddy获取),需要从buddy获取,从buddy获取的方式是通过slab三链的成员nodelists(slab三链这个名字是发现某个文章中这么叫的,所谓slab三链,是指全空、半空、全满三种slab链表)从buddy获取到物理页,然后把相关的物理页地址再传给array,可以看到在kmem_cache结构中array是每个CPU都有一个的(NR_CPUS代表CPU个数),之所以有这种机制是因为如果都是通过slab三链获取物理页,那么在多CPU的情况下就会出现多个CPU抢占slab自旋锁的情况,这样会导致效率比较低,发没发现,这个array的机制和伙伴系统buddy的冷热页框机制很像,关于array和slab三链是如何分配内存的细节后面详细讨论;现在只要知道它们的大概道理即可,继续观察函数setup_cpu_cache:

static int __init_refok setup_cpu_cache(struct kmem_cache *cachep, gfp_t gfp)

{

    /*此时初始化已经完毕(start_kernel的mm_init后调用kmem_cache_init_late),      直接使能local cache*/

         if (g_cpucache_up == FULL)

                   return enable_cpucache(cachep, gfp);

    /*这说明当前还处于初始化阶段,

      g_cpucache_up记录general cache初始化的进度,比如PARTIAL_AC表示struct array_cache所在的cache已经创建,

      PARTIAL_L3表示struct kmem_list3所在的cache已经创建,                                                      注意创建这两个cache的先后顺序。在初始化阶段只需配置主cpu的local cache和slab三链

     若g_cpucache_up为NONE,说明sizeof(struct array)大小的cache还没有创建,      初始化阶段创建sizeof(struct array)大小的cache时进入这个流程,此时struct arraycache_init所在的general cache还未创建,只能使用静态分配的全局变量initarray_generic表示的local cache*/

         if (g_cpucache_up == NONE) {

                   /*

                    * Note: the first kmem_cache_create must create the cache

                    * that's used by kmalloc(24), otherwise the creation of

                    * further caches will BUG().

                    */

                   cachep->array[smp_processor_id()] = &initarray_generic.cache;

                   /*

                    * If the cache that's used by kmalloc(sizeof(kmem_list3)) is

                    * the first cache, then we need to set up all its list3s,

                    * otherwise the creation of further caches will BUG().

                    */

                   /*创建struct kmem_list3所在的cache是在struct array_cache所在cache之后,  所以此时struct kmem_list3所在的cache也一定没有创建,也需要使用全局变量initkmem_list3*/

                   set_up_list3s(cachep, SIZE_AC);

 

        /*执行到这struct array_cache所在的cache创建完毕,如果struct kmem_list3和struct array_cache位于同一个general cache中,不会再重复创建了(不过显然不可能),       g_cpucache_up表示的进度更进一步*/

                   if (INDEX_AC == INDEX_L3)

                            g_cpucache_up = PARTIAL_L3;

                   else

                            g_cpucache_up = PARTIAL_AC;

         }

    /*g_cpucache_up至少为PARTIAL_AC时进入这个流程,struct arraycache_init所在的general cache已经建立起来,可以通过kmalloc分配了*/

    else {

                   cachep->array[smp_processor_id()] =

                            kmalloc(sizeof(struct arraycache_init), gfp);

        /*struct kmem_list3所在cache仍未创建完毕,还需使用全局的slab三链*/

                   if (g_cpucache_up == PARTIAL_AC) {

            set_up_list3s(cachep, SIZE_L3);

                            g_cpucache_up = PARTIAL_L3;

                   }

        /*struct kmem_list3所在的cache和struct array_cache所在cache都已经创建完毕,无需全局变量*/

        else {

                            int node;

                            for_each_online_node(node) {

                /*通过kmalloc分配struct kmem_list3对象*/

                                     cachep->nodelists[node] =

                                         kmalloc_node(sizeof(struct kmem_list3),

                                                        gfp, node);

                                     BUG_ON(!cachep->nodelists[node]);

                /*初始化slab三链*/

                                     kmem_list3_init(cachep->nodelists[node]);

                            }

                   }

         }

         cachep->nodelists[numa_node_id()]->next_reap =

                            jiffies + REAPTIMEOUT_LIST3 +

                            ((unsigned long)cachep) % REAPTIMEOUT_LIST3;

         cpu_cache_get(cachep)->avail = 0;

         cpu_cache_get(cachep)->limit = BOOT_CPUCACHE_ENTRIES;

         cpu_cache_get(cachep)->batchcount = 1;

         cpu_cache_get(cachep)->touched = 0;

         cachep->batchcount = 1;

         cachep->limit = BOOT_CPUCACHE_ENTRIES;

         return 0;

}

首先分析最开始的“if (g_cpucache_up == FULL)”,这里涉及了slab初始化进度的内容,静态全局变量g_cpucache_up就定义在slab.c文件中,它记录着slab初始化的情况流程如下:

NONE:最开始;

PARTIAL_AC:创建sizeof(struct arraycache_init)大小的cache之后;

PARTIAL_L3:创建sizeof(struct kmem_list3)大小的cache之后;

EARLY:kmem_cache_init函数末尾;

FULL:start_kernel调用kmem_cache_init_late时;

先说在达到FULL之后,调用的是enable_cpucache函数,这个函数根据我们需要申请的对象的大小确定上限limit,然后先后调用do_tune_cpucache和alloc_kmemlist函数创建该cache的array和slab三链,这两个函数我也分析了,但个人认为可以不作为分析重点,因为它们这时都是已经可以轻松的通过kmalloc申请sizeof(struct arraycache_init)大小的空间和sizeof(struct kmem_list3)大小的空间分别存储自己的array和slab三链,这里应该重点看下在slab初始化完成之前的情况:

我们已经知道g_cpucache_up还为NONE说明内核还没有创建sizeof(struct arraycache_init)大小的cache,所以第一个if (g_cpucache_up == NONE)里边就是创建sizeof(struct arraycache_init)大小的cache的过程,如下:

cachep->array[smp_processor_id()] = &initarray_generic.cache;

发现没有,cache的array是被赋值一个全局变量,为什么?这是因为在这时候内核还没有创建过sizeof(struct arraycache_init)大小的cache,所以第一个array没法通过kmalloc创建,只能借助一个全局变量模拟一下,看看这个全局变量initarray_generic:

static struct arraycache_init initarray_generic = { {0, BOOT_CPUCACHE_ENTRIES, 1, 0} };

其结构类型为:

struct arraycache_init {

         struct array_cache cache;

         void *entries[BOOT_CPUCACHE_ENTRIES];

};

BOOT_CPUCACHE_ENTRIES值为1,也就是全局变量initarray_generic的成员cache(struct array_cache结构类型)赋初值为{0, 1, 1, 0}、entries为NULL,把它作为第一个sizeof(struct arraycache_init)大小的cache;紧接着函数set_up_list3s如法炮制,用全局变量initkmem_list3实现了第一个slab三链;这之后,g_cpucache_up进度升级到PARTIAL_AC,INDEX_AC和INDEX_L3不可能相等的,顺便看看INDEX_AC和INDEX_L3是什么:

#define INDEX_AC index_of(sizeof(struct arraycache_init))

#define INDEX_L3 index_of(sizeof(struct kmem_list3))

static __always_inline int index_of(const size_t size)

{

         extern void __bad_size(void);

         if (__builtin_constant_p(size)) {

                   int i = 0;

#define CACHE(x) \

         if (size <=x) \

                   return i; \

         else \

                   i++;

#include <linux/kmalloc_sizes.h>

#undef CACHE

                  __bad_size();

         } else

                   __bad_size();

         return 0;

}

这里再次借用include/linux/kmalloc_sizes.h文件的CACHE(X)宏声明,只是重新定义了宏定义,如果展开就是21个if else判断,它实际上判断的是sizeof(struct arraycache_init)和sizeof(struct kmem_list3)即这两个结构体大小在20个长度分档中属于哪个分档,事实上在初始化中即函数kmem_cache_init中是会特意创建这两个长度的“规则”的cache,回到函数setup_cpu_cache,这里是比较这两个结构体在20个长度分档中的分档编号是否一样,应该说肯定不一样,所以g_cpucache_up在这时肯定赋值为PARTIAL_AC;

可见,执行到这里内核已经有了struct arraycache_init结构体长度的“规则”的cache,以后创建下一个新的长度的cache时,当申请其array成员时不需要借助什么全局变量了,直接可以kmalloc;事实上在初始化时,马上就会创建struct kmem_list3结构体长度的“规则”的cache,将会执行本函数的else,当申请其array成员时,就直接kmalloc即可;

并且此时,当g_cpucache_up如果为PARTIAL_AC,说明处于正在创建struct arraycache_init)结构体长度的“规则”的cache,这时内核还没有sizeof(struct kmem_list3)结构体长度的“规则”的cache,还得借助全局变量即调用函数set_up_list3s申请nodelists成员,然后g_cpucache_up初始化进度更新为PARTIAL_L3;当然如果g_cpucache_up值为EARLY的话说明已经kmem_cache_init函数已调用结束即sizeof(struct kmem_list3)结构体长度的“规则”的cache已经创建,则直接通过调用kmalloc申请nodelists成员;

最后注意一下底下的一些初始化操作,注意这些只是在slab没有完全初始化完毕即g_cpucache_up还不为FULL时调用,nodelists的next_reap成员个人暂时可以不关注;array的avail成员表示目前可用的slab初始化为0,这说明我们虽然创建了这样“规则”的长度的cache,但并没有实际从伙伴系统申请物理页;limit是指slab的个数上限为1;batchcount是指批量移入/移出的个数,slab的申请/释放的单位是batchcount值,这在后面会显而易见的发现;touched指slab是否被动过,个人认为暂无需关注;最后是cache的limit和batchcount;

至此,应该能对内核slab分配器工作原理有个初步的认识了:它需要根据所需长度创建相应长度的“规则”的cache,这样今后在申请这样长度的内存,就可以直接用kmalloc/kmem_cache_alloc即可使用slab的服务了;仅仅创建这样长度的“规则”的cache并没有真正分配内存创建相应的slab,这将在调用kmalloc/kmem_cache_alloc函数时去真正分配内存创建slab;不同长度的cache会有不同长度的物理页、slab个数、碎片大小,slab分为内置和外置方式存储,主要体现在slab管理对象存储位置的不同(内置下和slab实际内存在一起,外置则另外申请内存存储),slab管理对象包括slab结构体和每个slab的对象描述符;

 

2.3、slab分配机制:

不论kmalloc还是kmem_cache_alloc,最终都是调用函数__cache_alloc,这是给调用者分配slab的总接口:

static __always_inline void *

__cache_alloc(struct kmem_cache *cachep, gfp_t flags, void *caller)

{

         unsigned long save_flags;

         void *objp;

         flags &= gfp_allowed_mask;

         lockdep_trace_alloc(flags);

         if (slab_should_failslab(cachep, flags))

                   return NULL;

    /*分配前要调试检测*/

         cache_alloc_debugcheck_before(cachep, flags);

         local_irq_save(save_flags);

    /*实际分配*/

         objp = __do_cache_alloc(cachep, flags);

         local_irq_restore(save_flags);

    /*分配后调试检查*/

         objp = cache_alloc_debugcheck_after(cachep, flags, objp, caller);

    /*空函数*/

         kmemleak_alloc_recursive(objp, obj_size(cachep), 1, cachep->flags,

                                      flags);

         prefetchw(objp);

         if (likely(objp))

                   kmemcheck_slab_alloc(cachep, flags, objp, obj_size(cachep));

         if (unlikely((flags & __GFP_ZERO) && objp))

                   memset(objp, 0, obj_size(cachep));

         return objp;

}

只需关注函数__do_cache_alloc,它调用函数____cache_alloc,这是重点:

static inline void *____cache_alloc(struct kmem_cache *cachep, gfp_t flags)

{

         void *objp;

         struct array_cache *ac;

         check_irq_off();

    /*获取缓存的本地高速缓存的描述符array_cache*/

         ac = cpu_cache_get(cachep);

    /*如果本地高速缓存中没有可用的对象,那么就执行重新填充(cache_alloc_refill)*/

         if (likely(ac->avail)) {

        /*下面是空函数*/

                   STATS_INC_ALLOCHIT(cachep);

        /*将avail的值减1,这样avail对应的空闲对象是最热的,即最近释放出来的,更有可能驻留在CPU高速缓存中*/

                   ac->touched = 1;

        /*由于ac是记录着这次struct arrary_cache结构体存放地址,通过ac_entry()后,我们就得到下一紧接地址,

          这个地址可以看做是为本高速缓存内存的内存对象指针存放首地址,

          这里可以看出,我们是从最后一个对象开始分配的*/

                   objp = ac->entry[--ac->avail];

         }

    else {

        /*下面是空函数*/

                   STATS_INC_ALLOCMISS(cachep);

        /*为高速缓存内存空间增加新的内存对象*/

                   objp = cache_alloc_refill(cachep, flags);

         }

         /*

          * To avoid a false negative, if an object that is in one of the

          * per-CPU caches is leaked, we need to make sure kmemleak doesn't

          * treat the array pointers as a reference to the object.

          */

         /*空函数*/

         kmemleak_erase(&ac->entry[ac->avail]);

         return objp;

}

首先获取cache在当前CPU的array成员,它就是所谓的本地缓存(local cache),获取slab就是从array获取,即便这个array成员里边没有slab,也是从cache的slab三链中把内存转给array,即便slab三链也没有slab,那么就让slab三链从伙伴系统buddy获取物理内存再转给array,这是slab分配的原理;

static inline struct array_cache *cpu_cache_get(struct kmem_cache *cachep)

{

         return cachep->array[smp_processor_id()];

}

接下来的if判断,判断array的avail成员是否为0即判断array里边是否有空闲的slab,可以回忆前面一节的cache创建,在函数setup_cpu_cache的中,不论是slab初始化完毕前还是初始化完毕后,avail成员都是置为0即当前没有空闲slab,结合这里一起验证了第一次通过kmalloc/kmem_cache_alloc获取物理内存时,会触发物理内存的实际分配,即函数____cache_alloc的else分支;

如果avail成员非0说明目前array中有空闲的slab,那么就直接把它的第一个slab返回给kmalloc/kmem_cache_alloc调用者,并更新array的avail值(运算符—即减一),同时置位array的touched;

如果avail成员为0说明目前array中没有空闲的slab,这就需要按照前面加深的描述的道理设法分配slab,调用函数cache_alloc_refill:

static void *cache_alloc_refill(struct kmem_cache *cachep, gfp_t flags)

{

         int batchcount;

         struct kmem_list3 *l3;

         struct array_cache *ac;

         int node;

 

retry:

         check_irq_off();

         node = numa_node_id();

    /*本地高速缓存*/

         ac = cpu_cache_get(cachep);

    /*准备填充本地高速缓存,这里先记录填充对象个数,即batchcount成员(批量转入转出的个数)*/

         batchcount = ac->batchcount;

         if (!ac->touched && batchcount > BATCHREFILL_LIMIT) {

                   /*

                    * If there was little recent activity on this cache, then

                    * perform only a partial refill.  Otherwise we could generate

                    * refill bouncing.

                    */

                   batchcount = BATCHREFILL_LIMIT;

         }

    /*获得本内存节点、本cache的slab三链*/

         l3 = cachep->nodelists[node];

 

         BUG_ON(ac->avail > 0 || !l3);

         spin_lock(&l3->list_lock);

 

         /* See if we can refill from the shared array */

    /*如果有共享本地高速缓存,则从共享本地高速缓存填充

      仅用于多核,多个CPU共享的高速缓存*/

         if (l3->shared && transfer_objects(ac, l3->shared, batchcount))

                   goto alloc_done;

 

    /*从本地的高速缓存的kmem_list3的slab三链表中分配*/

         while (batchcount > 0) {

                   struct list_head *entry;

                   struct slab *slabp;

                   /* Get slab alloc is to come from. */

        /*首先先访问部分空闲SLAB链表,使entry指向第一个节点*/

                   entry = l3->slabs_partial.next;

        /*如果半空闲的都没了,就找全空闲的*/

                   if (entry == &l3->slabs_partial) {

            /*在访问全空闲SLAB链表前先做一个标记,表示全空闲SLAB链表被使用过了*/

                            l3->free_touched = 1;

            /*使entry指向第一个节点*/

                            entry = l3->slabs_free.next;

            /*全空闲的也没了,必须扩充了*/

                            if (entry == &l3->slabs_free)

                                     goto must_grow;

                   }

        /*至少全部空闲或者是部分空闲SLAB链表有一个不为空,获取其slab描述符*/

                   slabp = list_entry(entry, struct slab, list);

        /*底下是两个空函数*/

                   check_slabp(cachep, slabp);

                   check_spinlock_acquired(cachep);

 

                   /*

                    * The slab was either on partial or free list so

                    * there must be at least one object available for

                    * allocation.

                    */

                   BUG_ON(slabp->inuse >= cachep->num);

        /*如果高速缓存还存在空闲对象,就用batchcount个对象进行填充*/

                   while (slabp->inuse < cachep->num && batchcount--) {

            /*一般情况下下面是3个空函数*/

                            STATS_INC_ALLOCED(cachep);

                            STATS_INC_ACTIVE(cachep);

                            STATS_SET_HIGH(cachep);

            /*从slab中提取一个空闲对象,将其虚拟地址插入到local cache中

              从slab中提取一个空闲对象,每调用slab_get_obj一次,slabp的inuse计数加1,并更新第一个空闲对象的索引*/

                            ac->entry[ac->avail++] = slab_get_obj(cachep, slabp,

                                                                     node);

                   }

        /*一般情况下空函数*/

                   check_slabp(cachep, slabp);

 

                   /* move slabp to correct slabp list: */

        /*将slab从原先链表(full\parial\free)中删除*/

                   list_del(&slabp->list);

       

        /*根据目前的空缺情况,加在不同的链表上:

          此slab中已经没有空闲对象,添加到slab三链的full slab链表中

          此slab还有空闲对象,添加到slab三链的partial slab链表中*/

                   if (slabp->free == BUFCTL_END)

                            list_add(&slabp->list, &l3->slabs_full);

                   else

                            list_add(&slabp->list, &l3->slabs_partial);

         }

 

must_grow:

    /*前面从slab三链中添加avail个空闲对象到local cache中,更新slab三链的空闲对象数*/

         l3->free_objects -= ac->avail;

alloc_done:

         spin_unlock(&l3->list_lock);

 

         if (unlikely(!ac->avail)) {

                   int x;

        /*使用cache_grow为高速缓存分配一个新的slab

          参数分别是: cache指针、标志、内存节点、页虚拟地址(为空表示还未申请内存页,不为空,说明已申请内存页,可直接用来创建slab)

          返回值: 1为成功,0为失败*/

                   x = cache_grow(cachep, flags | GFP_THISNODE, node, NULL);

 

                   /* cache_grow can reenable interrupts, then ac could change. */

        /*上面的操作使能了中断,此期间local cache指针可能发生了变化,需要重新获得*/

                   ac = cpu_cache_get(cachep);

 

        /*无法新增空slab,local cache中也没有空闲对象,表明系统已经无法分配新的空闲对象了*/

                  if (!x && ac->avail == 0)   /* no objects in sight? abort */

                            return NULL;

       

        /*走到这有两种可能,

          第一种是无论新增空slab成功或失败,只要avail不为0,表明是其他进程重填了local cache,本进程就不需要重填了,不执行retry流程。

          第二种是avail为0,并且新增空slab成功,则进入retry流程,利用新分配的空slab填充local cache*/

                   if (!ac->avail)             /* objects refilled by interrupt? */

                            goto retry;

         }

 

    /*重填了local cache,设置近期访问标志touch*/

         ac->touched = 1;

    /*返回local cache中最后一个空闲对象的虚拟地址*/

         return ac->entry[--ac->avail];

}

首先注意一下“batchcount = ac->batchcount;”,前面曾说过这是slab批量移入/移出的个数单位;

然后通过“l3 = cachep->nodelists[node];”获取cache的slab三链,

接下来的shared的操作个人认为可暂不关注,先考虑单CPU的情况(事实上理解slab原理后悔发现shared这个机制对于多CPU还是很有效率的);

然后是while循环,依次检测slab三链的半空链表、全空链表是否有空闲的slab,如果有就可以直接从里边取出转给array,这里先看没有空闲slab的情况,这就跳到标号must_grow:

l3->free_objects -= ac->avail;

这里avail成员为0,相当于没减,关键是后面的alloc_done:

正常情况下,因为avail值为0所以会进入if (unlikely(!ac->avail))分支,因为这时slab三链也没有空闲的slab,所以需要从伙伴系统获取物理内存,调用函数cache_grow,这也就是很多文章包括ULK描述的创建新的slab的两个条件:1、array没有空闲slab;2、slab三链也没有空闲slab:

static int cache_grow(struct kmem_cache *cachep,

                   gfp_t flags, int nodeid, void *objp)

{

         struct slab *slabp;

         size_t offset;

         gfp_t local_flags;

         struct kmem_list3 *l3;

         /*

          * Be lazy and only check for valid flags here,  keeping it out of the

          * critical path in kmem_cache_alloc().

          */

         BUG_ON(flags & GFP_SLAB_BUG_MASK);

         local_flags = flags & (GFP_CONSTRAINT_MASK|GFP_RECLAIM_MASK);

         /* Take the l3 list lock to change the colour_next on this node */

         check_irq_off();

    /*获取本内存节点的cache分配器的slab三链*/

         l3 = cachep->nodelists[nodeid];

             spin_lock(&l3->list_lock);

         /* Get colour for the slab, and cal the next value. */

    /*获取待创建的slab的颜色偏移*/

         offset = l3->colour_next;

    /*更新(下一次)待创建的slab的颜色偏移*/

         l3->colour_next++;

    /*颜色编号须小于颜色个数

      不能超过着色区的总大小,如果超过了,重置为0,

      这就是前面分析过的着色循环问题。事实上,如果slab中浪费的空间很少,那么很快就会循环一次*/

         if (l3->colour_next >= cachep->colour)

                   l3->colour_next = 0;

         spin_unlock(&l3->list_lock);

    /*该cache的着色块偏移*/

         offset *= cachep->colour_off;

         if (local_flags & __GFP_WAIT)

                   local_irq_enable();

         /*

          * The test for missing atomic flag is performed here, rather than

          * the more obvious place, simply to reduce the critical path length

          * in kmem_cache_alloc(). If a caller is seriously mis-behaving they

          * will eventually be caught here (where it matters).

          */

         kmem_flagcheck(cachep, flags);

         /*

          * Get mem for the objs.  Attempt to allocate a physical page from

          * 'nodeid'.

          */

         /*从buddy获取物理页,返回的是虚拟地址objp*/

         if (!objp)

                   objp = kmem_getpages(cachep, local_flags, nodeid);

         if (!objp)

                   goto failed;

         /* Get slab management. */

/*获得一个新的slab描述符*/

         slabp = alloc_slabmgmt(cachep, objp, offset,

                            local_flags & ~GFP_CONSTRAINT_MASK, nodeid);

         if (!slabp)

                   goto opps1;

        /*把slab描述符slabp赋给物理页的prev字段,把高速缓存描述符cachep赋给物理页的lru字段

      本质是建立slab和cache到物理页的映射,用于快速根据物理页定位slab描述符和cache描述符*/

         slab_map_pages(cachep, slabp, objp);

    /*初始化cache描述符和slab对象描述符*/

         cache_init_objs(cachep, slabp);

         if (local_flags & __GFP_WAIT)

                   local_irq_disable();

         check_irq_off();

         spin_lock(&l3->list_lock);

         /* Make slab active. */

/*把slab描述符slabp尾插法加入到高速缓存描述符的全空slab链表*/

         list_add_tail(&slabp->list, &(l3->slabs_free));

         STATS_INC_GROWN(cachep);

    /*更新高速缓存中空闲对象计数器*/

         l3->free_objects += cachep->num;

         spin_unlock(&l3->list_lock);

         return 1;

opps1:

         kmem_freepages(cachep, objp);

failed:

         if (local_flags & __GFP_WAIT)

                   local_irq_disable();

         return 0;

}

首先是获取slab三链指针l3,处理一下着色问题,关于着色问题后面专门描述,不影响理解slab分配先忽略;

然后是从伙伴系统获取物理页,调用函数kmem_getpages:

static void *kmem_getpages(struct kmem_cache *cachep, gfp_t flags, int nodeid)

{

         struct page *page;

         int nr_pages;

         int i;

 

#ifndef CONFIG_MMU

         /*

          * Nommu uses slab's for process anonymous memory allocations, and thus

          * requires __GFP_COMP to properly refcount higher order allocations

          */

         flags |= __GFP_COMP;

#endif

 

    /*这意味着,如果SLAB_RECLAIM_ACCOUNT置位,那么flag置位__GFP_RECLAIMABLE,意为: 分配给slab的页将被记录为可回收的页*/

         flags |= cachep->gfpflags;

         if (cachep->flags & SLAB_RECLAIM_ACCOUNT)

                   flags |= __GFP_RECLAIMABLE;

   

    /*从buddy获取物理页,大小由cachep->gfporder决定(2^cachep->gfporder)*/

         page = alloc_pages_exact_node(nodeid, flags | __GFP_NOTRACK, cachep->gfporder);

         if (!page)

                   return NULL;

 

    /*计算出要获取的物理页个数(2^cachep->gfporder)*/

         nr_pages = (1 << cachep->gfporder);

   

    /*设置页的状态(是否可回收),在vmstat中设置*/

         if (cachep->flags & SLAB_RECLAIM_ACCOUNT)

                   add_zone_page_state(page_zone(page),

                            NR_SLAB_RECLAIMABLE, nr_pages);

         else

                   add_zone_page_state(page_zone(page),

                            NR_SLAB_UNRECLAIMABLE, nr_pages);

 

    /*把这些物理页设置属性为slab*/

         for (i = 0; i < nr_pages; i++)

                   __SetPageSlab(page + i);

 

    /*kmemcheck_enabled一般为0,不会执行下面*/

         if (kmemcheck_enabled && !(cachep->flags & SLAB_NOTRACK)) {

                   kmemcheck_alloc_shadow(page, cachep->gfporder, flags, nodeid);

 

                   if (cachep->ctor)

                            kmemcheck_mark_uninitialized_pages(page, nr_pages);

                   else

                            kmemcheck_mark_unallocated_pages(page, nr_pages);

         }

 

    /*返回该物理页的虚拟地址*/

         return page_address(page);

}

暂时直接看这个函数主要干了什么,主要就是调用函数alloc_pages_exact_node从伙伴系统申请物理页,需要注意申请的页数由cache的gfporder成员决定,它是在创建这个长度的“规则”的cache时计算出来的(最多2页),然后设置物理页的一些状态,注意下可回收标志这将在释放的时候有关系,最终返回的是该物理页在内存页表中映射的虚拟地址;现在变量objp保存了从伙伴系统这申请的物理页对应的虚拟地址;

然后是获取一个slab描述符,通过函数alloc_slabmgmt,这也是重点:

static struct slab *alloc_slabmgmt(struct kmem_cache *cachep, void *objp,

                                        int colour_off, gfp_t local_flags,

                                        int nodeid)

{

         struct slab *slabp;

 

         if (OFF_SLAB(cachep)) {

                   /* Slab management obj is off-slab. */

                   slabp = kmem_cache_alloc_node(cachep->slabp_cache,

                                                     local_flags, nodeid);

                   /*

                    * If the first object in the slab is leaked (it's allocated

                    * but no one has a reference to it), we want to make sure

                    * kmemleak does not treat the ->s_mem pointer as a reference

                    * to the object. Otherwise we will not report the leak.

                    */

                   kmemleak_scan_area(slabp, offsetof(struct slab, list),

                                        sizeof(struct list_head), local_flags);

                   if (!slabp)

                            return NULL;

         }

    /*对于内置slab,slab描述符就在该slab所占空间的起始,即所占虚拟起始地址加上它的着色偏移

      然后应该更新着色偏移即加上管理对象的空间*/

    else {

                   slabp = objp + colour_off;

                   colour_off += cachep->slab_size;

         }

         slabp->inuse = 0;

    /*第一个对象的页内偏移,

      可见对于内置式slab,colouroff成员不仅包括着色区,还包括管理对象占用的空间,

              外置式slab,colouroff成员只包括着色区*/

         slabp->colouroff = colour_off;

    /*第一个对象的虚拟地址,这时着色偏移对于内置slab已加上了管理对象的空间*/

         slabp->s_mem = objp + colour_off;

         slabp->nodeid = nodeid;

    /*第一个空闲对象索引为0,即kmem_bufctl_t数组的第一个元素*/

         slabp->free = 0;

         return slabp;

}

这验证了前面描述过的,对于外置slab,它的管理对象即slab描述符和每个对象的对象描述符(其实就是编号),slab结构体变量需要从该结构体长度的cache申请,对象描述符不需要申请,而内置的情况slab管理对象是和slab申请的内存在一起的,确切的说它在slab申请的内存位置处,所以它的slabp就在slab申请的内存的着色处之后,而外置的slabp是另外申请的位置处,这就是为什么对于内置情况,着色偏移还要加上cachep->slab_size的原因(注意slab_size值为slab结构体长度+每个对象描述符之和),而外置情况不用;

这里容易理不清,细致的描述下:

对于外置,cachep->slabp_cache在创建本cache时即调用函数kmem_cache_create时已经初始化了它的slabp_cache成员,值为slab_size所处的长度分档的cache,对于外置还是内置都是结构体slab长度和每个对象描述符之和,外置下在本函数alloc_slabmgmt的if判断中将从该cache(slab_size所处的长度分档的cache)申请一段内存用于存储所申请cache的slab管理对象,长度是结构体slab长度和每个对象描述符之和,这也就体现了外置的特点即slab管理对象在外部另外申请,如下图:

 

 

Slab结构体 + 每个对象描述符

 

Slab管理对象

 

Slab内存

 

着色偏移 + 每个对象

 

 

 

而内置是都在一起,如下图:

 

 

着色偏移 + slab结构体 + 每个对象描述符 +每个对象

 

Slab内存

 

 

 

注意外置和内置的slabp指针都是指向slab管理对象,但管理对象的位置不同,slab描述符的colouroff成员赋值为colour_off,但对于外置和内置,该值是不同的,内置还要多slab结构体及所有对象描述符的长度,这也就使外置和内置情况的slab的s_mem成员即第一个对象的虚拟地址的值不一样,objp都是指向slab内存,但colour_off的不同使外置只需向后偏移着色偏移即可,而内置还需多偏移slab_size个长度;

最后,slab的inuse成员标识当前正在使用的对象个数,初始值为0;free成员标识第一个空闲对象的编号,初始值为0;nodeid成员标识内存节点;

接下来是调用函数slab_map_pages,把slab描述符slabp赋给物理页的prev字段,把高速缓存描述符cachep赋给物理页的lru字段,本质是建立slab和cache到物理页的映射,用于快速根据物理页定位slab描述符和cache描述符,可先不太关注;

接下来是调用cache_init_objs,初始化cache描述符和slab对象描述符:

static void cache_init_objs(struct kmem_cache *cachep,

                                struct slab *slabp)

{

         int i;

 

         for (i = 0; i < cachep->num; i++) {

                   void *objp = index_to_obj(cachep, slabp, i);

#if DEBUG

                   /* need to poison the objs? */

                   if (cachep->flags & SLAB_POISON)

                            poison_obj(cachep, objp, POISON_FREE);

                   if (cachep->flags & SLAB_STORE_USER)

                            *dbg_userword(cachep, objp) = NULL;

 

                   if (cachep->flags & SLAB_RED_ZONE) {

                            *dbg_redzone1(cachep, objp) = RED_INACTIVE;

                            *dbg_redzone2(cachep, objp) = RED_INACTIVE;

                   }

                   /*

                    * Constructors are not allowed to allocate memory from the same

                    * cache which they are a constructor for.  Otherwise, deadlock.

                    * They must also be threaded.

                    */

                   if (cachep->ctor && !(cachep->flags & SLAB_POISON))

                            cachep->ctor(objp + obj_offset(cachep));

 

                   if (cachep->flags & SLAB_RED_ZONE) {

                            if (*dbg_redzone2(cachep, objp) != RED_INACTIVE)

                                     slab_error(cachep, "constructor overwrote the"

                                                  " end of an object");

                            if (*dbg_redzone1(cachep, objp) != RED_INACTIVE)

                                     slab_error(cachep, "constructor overwrote the"

                                                  " start of an object");

                   }

                   if ((cachep->buffer_size % PAGE_SIZE) == 0 &&

                                OFF_SLAB(cachep) && cachep->flags & SLAB_POISON)

                            kernel_map_pages(virt_to_page(objp),

                                                cachep->buffer_size / PAGE_SIZE, 0);

#else

                   if (cachep->ctor)

                            cachep->ctor(objp);

#endif

        /*初始时所有对象都是空闲的,只需按照数组顺序串起来即可*/

                   slab_bufctl(slabp)[i] = i + 1;

         }

 

    /*最后一个指向BUFCTL_END*/

         slab_bufctl(slabp)[i - 1] = BUFCTL_END;

}

根据对象个数,利用循环定位每一个对象objp,并调用本cache的构造函数ctor初始化每个对象,但事实上大多数cache在创建时,ctor都是NULL即无需初始化每个对象;比较重要的是“slab_bufctl(slabp)[i] = i + 1”,它初始化每个对象的描述符为1、2、3、……、BUFCTL_END,这是对每个对象描述符的初始化;

到这里,这个slab已初始化好,内存也分配了,各个属性也初始化了,现在把它链在本cache的slab三链上,即         “list_add_tail(&slabp->list, &(l3->slabs_free));”,并更新slab三链的空闲对象个数成员free_objects;最终返回1意为slab创建成功,cache_grow函数调用成功。

回到函数cache_alloc_refill,这时需要重新获取本cache的array,因为上面的操作使能了中断,此期间local cache指针可能发生了变化,然后最终做判断if (!ac->avail),如果有其他模块的操作填充了本cache的array那么直接return,多数情况下array的avail还是0,返回标号retry,重新进行一次,这时因为slab三链肯定是有了空闲slab了,所以肯定可以在while循环中执行“ac->entry[ac->avail++] = slab_get_obj(cachep, slabp, node);”,即把slab转给array:

static void *slab_get_obj(struct kmem_cache *cachep, struct slab *slabp,

                                     int nodeid)

{

    /*获得一个空闲的对象,free成员是本slab中第一个空闲对象的索引*/

         void *objp = index_to_obj(cachep, slabp, slabp->free);

         kmem_bufctl_t next;

 

    /*更新在用对象计数*/

         slabp->inuse++;

    /*获得第一个空闲对象的索引 */

         next = slab_bufctl(slabp)[slabp->free];

#if DEBUG

         slab_bufctl(slabp)[slabp->free] = BUFCTL_FREE;

         WARN_ON(slabp->nodeid != nodeid);

#endif

 

    /*free指向下一个空闲的对象*/

         slabp->free = next;

 

         return objp;

}

这个函数本身很明显,不断取出slab三链中的这个slab里的对象并同时更新该slab的空闲对象编号和可用空闲对象个数,要注意是在while循环中调用该函数,它是需要批量移出的即转给array共计batchcount个对象(在不超过该cache的对象个数前提下,正常情况下不会超出);

最终,这个slab的对象由全空闲减少了batchcount个对象,根据其是否还剩下对象的情况,把它从slab三链的全空闲链表中摘下放入半空闲链表或全满链表。

以上就是slab分配的过程和原理!

 

2.4、Slab初始化:

最后才描述初始化,原因就是在不理解slab的原理和使用,不理解cache,空看初始化会白费很多时间和精力;

其实看到这里,应该可以对slab的初始化能猜出大概了,在start_kernel的mm_init函数中调用函数kmem_cache_init初始化slab,为了能创建各种长度的cache,尤其是struct array和struct kmem_list3这两个长度的“规则”的cache,供后续创建每个新的cache时方便使用kmalloc直接申请,注意最早的cache即kmem_cache结构体长度的cache同样是通过全局变量(cache_cache)实现的;

另外注意下一些名词,普通缓存和专用缓存,所谓普通缓存就是初始化阶段创建的这些,包括kmem_cache、array、kmem_list3在内的长度从32到4194304的20个长度分档的cache称为普通缓存,它们可以直接为其他模块使用kmalloc轻松匹配找到其相应的cache,即kmalloc无需创建cache,参数输入要申请的内存长度即可,而专用缓存是指申请者自己通过调用函数kmem_cache_create创建一个新的长度的cache即称为专用缓存,没什么特别的;

void __init kmem_cache_init(void)

{

         size_t left_over;

         struct cache_sizes *sizes;

         struct cache_names *names;

         int i;

         int order;

         int node;

 

         if (num_possible_nodes() == 1)

                   use_alien_caches = 0;

    /*初始化每个node的所有slab中的三个链表

      全局静态变量cache_cache,这个变量是用来管理所有缓存的kmem_cache的,

      也就是说,在初始化阶段,将会创建一个slab,用来存放所有缓存的kmem_cache*/

         for (i = 0; i < NUM_INIT_LISTS; i++) {

                   kmem_list3_init(&initkmem_list3[i]);

                   if (i < MAX_NUMNODES)

                            cache_cache.nodelists[i] = NULL;

         }

    /*把initkmem_list3[0]与cache_cache关联上,CACHE_CACHE为index值为0*/

         set_up_list3s(&cache_cache, CACHE_CACHE);

 

         /*

          * Fragmentation resistance on low memory - only use bigger

          * page orders on machines with more than 32MB of memory.

          */

         /*totalram_pages是记录系统实际存在物理内存的总页数,如果大于32M,

      才可以创建高阶指数内存页数的高速缓存内存对象*/

         if (totalram_pages > (32 << 20) >> PAGE_SHIFT)

                   slab_break_gfp_order = BREAK_GFP_ORDER_HI;

 

         /* Bootstrap is tricky, because several objects are allocated

          * from caches that do not exist yet:

          * 1) initialize the cache_cache cache: it contains the struct

          *    kmem_cache structures of all caches, except cache_cache itself:

          *    cache_cache is statically allocated.

          *    Initially an __init data area is used for the head array and the

          *    kmem_list3 structures, it's replaced with a kmalloc allocated

          *    array at the end of the bootstrap.

          * 2) Create the first kmalloc cache.

          *    The struct kmem_cache for the new cache is allocated normally.

          *    An __init data area is used for the head array.

          * 3) Create the remaining kmalloc caches, with minimally sized

          *    head arrays.

          * 4) Replace the __init data head arrays for cache_cache and the first

          *    kmalloc cache with kmalloc allocated arrays.

          * 5) Replace the __init data for kmem_list3 for cache_cache and

          *    the other cache's with kmalloc allocated memory.

          * 6) Resize the head arrays of the kmalloc caches to their final sizes.

          */

 

         node = numa_node_id();

/*第一步,创建struct kmem_cache所在的cache链表cache_chain,由全局变量cache_cache指向,这里只是初始化数据结构,并未真正创建这些

  对象,要待分配时才创建。全局变量cache_chain是内核slab cache链表的表头*/

         /* 1) create the cache_cache */

    /*初始化保存所有slab cache的全局链表cache_chain*/

         INIT_LIST_HEAD(&cache_chain);

    /*将cache_cache加入到slab cache链表表头cache_chain*/

         list_add(&cache_cache.next, &cache_chain);

 

    /*设置cache着色基本单位为cache line的大小:32字节(L1_CACHE_BYTES)*/

         cache_cache.colour_off = cache_line_size();

    /*初始化cache_cache的local cache,同样这里也不能使用kmalloc,需要使用静态分配的全局变量initarray_cache*/

         cache_cache.array[smp_processor_id()] = &initarray_cache.cache;

    /*初始化slab链表 ,用全局变量initkmem_list3*/

         cache_cache.nodelists[node] = &initkmem_list3[CACHE_CACHE + node];

 

         /*

          * struct kmem_cache size depends on nr_node_ids, which

          * can be less than MAX_NUMNODES.

          */

         /*buffer_size保存slab中对象的大小,首先计算struct kmem_cache实际大小,首先不计入不包含nodelists的大小,

           nr_node_ids为内存节点个数(UMA下为1),

           所以nodelists偏移 + struct kmem_list3的大小即为struct kmem_cache的大小*/

         cache_cache.buffer_size = offsetof(struct kmem_cache, nodelists) +

                                      nr_node_ids * sizeof(struct kmem_list3 *);

#if DEBUG

         cache_cache.obj_size = cache_cache.buffer_size;

#endif

    /*把buffer_size做32字节对齐*/

         cache_cache.buffer_size = ALIGN(cache_cache.buffer_size,

                                               cache_line_size());

    /*计算对象大小的倒数,用于计算对象在slab中的索引*/

         cache_cache.reciprocal_buffer_size =

                   reciprocal_value(cache_cache.buffer_size);

 

    /*计算cache_cache中的对象数目,num不为0意味着创建struct kmem_cache对象成功,退出

      cache_line_size()值为32*/

         for (order = 0; order < MAX_ORDER; order++) {

                   cache_estimate(order, cache_cache.buffer_size,

                            cache_line_size(), 0, &left_over, &cache_cache.num);

                   if (cache_cache.num)

                            break;

         }

         BUG_ON(!cache_cache.num);

    /*gfporder表示本slab包含2^gfporder个页面*/

         cache_cache.gfporder = order;

    /*着色区的大小,以colour_off为单位*/

         cache_cache.colour = left_over / cache_cache.colour_off;

    /*slab管理对象的大小*/

         cache_cache.slab_size = ALIGN(cache_cache.num * sizeof(kmem_bufctl_t) +

                                           sizeof(struct slab), cache_line_size());

 

         /* 2+3) create the kmalloc caches */

    /*malloc_sizes保存各级别普通高速缓存大小*/

         sizes = malloc_sizes;

    /*cache_names保存各级别普通高速缓存名称*/

         names = cache_names;

 

         /*

          * Initialize the caches that provide memory for the array cache and the

          * kmem_list3 structures first.  Without this, further allocations will

          * bug.

          */

    /*首先创建struct array_cache和struct kmem_list3所用的general cache,它们是后续初始化动作的基础

      INDEX_AC是计算本地高速缓存所用的struct arraycache_init对象在kmalloc size中的索引,

      即属于哪一级别大小的general cache,创建此大小级别的cache为local cache所用*/

         sizes[INDEX_AC].cs_cachep = kmem_cache_create(names[INDEX_AC].name,

                                               sizes[INDEX_AC].cs_size,

                                               ARCH_KMALLOC_MINALIGN,

                                               ARCH_KMALLOC_FLAGS|SLAB_PANIC,

                                               NULL);

    /*如果struct kmem_list3和struct arraycache_init对应的kmalloc size索引不同,

      即大小属于不同的级别,则创建struct kmem_list3所用的cache,否则共用一个cache*/

         if (INDEX_AC != INDEX_L3) {

                   sizes[INDEX_L3].cs_cachep =

                            kmem_cache_create(names[INDEX_L3].name,

                                     sizes[INDEX_L3].cs_size,

                                     ARCH_KMALLOC_MINALIGN,

                                     ARCH_KMALLOC_FLAGS|SLAB_PANIC,

                                     NULL);

         }

    /*创建完上述两个general cache后,slab early init阶段结束,在此之前,不允许创建外置式slab*/

         slab_early_init = 0;

   

    /*循环创建各级别(2^0-2^12共13个级别)的普通高速缓存,每个普通高速缓存应包含2个(1个DMA和1个常规,arm没有DMA的)*/

         while (sizes->cs_size != ULONG_MAX) {

                   /*

                    * For performance, all the general caches are L1 aligned.

                    * This should be particularly beneficial on SMP boxes, as it

                    * eliminates "false sharing".

                    * Note for systems short on memory removing the alignment will

                    * allow tighter packing of the smaller caches.

                    */

                   if (!sizes->cs_cachep) {

                            sizes->cs_cachep = kmem_cache_create(names->name,

                                               sizes->cs_size,

                                               ARCH_KMALLOC_MINALIGN,

                                               ARCH_KMALLOC_FLAGS|SLAB_PANIC,

                                               NULL);

                   }

#ifdef CONFIG_ZONE_DMA

                   sizes->cs_dmacachep = kmem_cache_create(

                                               names->name_dma,

                                               sizes->cs_size,

                                               ARCH_KMALLOC_MINALIGN,

                                               ARCH_KMALLOC_FLAGS|SLAB_CACHE_DMA|

                                                        SLAB_PANIC,

                                               NULL);

#endif

                   sizes++;

                   names++;

         }

         /* 4) Replace the bootstrap head arrays */

    /*第四步,用kmalloc对象替换静态分配的全局变量。到目前为止一共使用了两个全局local cache,一个是cache_cache的

      local cache指向initarray_cache.cache,另一个是malloc_sizes[INDEX_AC].cs_cachep的local cache指向

      initarray_generic.cache,参见setup_cpu_cache函数。这里替换它们*/

         {

                   struct array_cache *ptr;

        /*申请cache_cache所用local cache的空间*/

                   ptr = kmalloc(sizeof(struct arraycache_init), GFP_NOWAIT);

 

                   BUG_ON(cpu_cache_get(&cache_cache) != &initarray_cache.cache);

        /*复制原cache_cache的local cache,即initarray_cache,到新的位置*/

                   memcpy(ptr, cpu_cache_get(&cache_cache),

                          sizeof(struct arraycache_init));

                   /*

                    * Do not assume that spinlocks can be initialized via memcpy:

                    */

                   spin_lock_init(&ptr->lock);

        /*cache_cache的本地高速缓存指向新的位置*/

                   cache_cache.array[smp_processor_id()] = ptr;

       

       

        /*申请malloc_sizes[INDEX_AC].cs_cachep所用local cache的空间*/

                   ptr = kmalloc(sizeof(struct arraycache_init), GFP_NOWAIT);

 

                   BUG_ON(cpu_cache_get(malloc_sizes[INDEX_AC].cs_cachep)

                          != &initarray_generic.cache);

        /*复制原本地高速缓存到新分配的位置*/

                   memcpy(ptr, cpu_cache_get(malloc_sizes[INDEX_AC].cs_cachep),

                          sizeof(struct arraycache_init));

                   /*

                    * Do not assume that spinlocks can be initialized via memcpy:

                    */

                   spin_lock_init(&ptr->lock);

 

                   malloc_sizes[INDEX_AC].cs_cachep->array[smp_processor_id()] =

                       ptr;

         }

         /* 5) Replace the bootstrap kmem_list3's */

    /*第五步,与第四步类似,用kmalloc的空间替换静态分配的slab三链*/

         {

                   int nid;

       

                   for_each_online_node(nid) {

            /*复制struct kmem_cache的slab三链*/

                            init_list(&cache_cache, &initkmem_list3[CACHE_CACHE + nid], nid);

            /*复制struct arraycache_init的slab三链*/

                            init_list(malloc_sizes[INDEX_AC].cs_cachep,

                                       &initkmem_list3[SIZE_AC + nid], nid);

            /*复制struct kmem_list3的slab三链*/

                            if (INDEX_AC != INDEX_L3) {

                                     init_list(malloc_sizes[INDEX_L3].cs_cachep,

                                                 &initkmem_list3[SIZE_L3 + nid], nid);

                            }

                   }

         }

         g_cpucache_up = EARLY;

}

现在来看一看slab的初始化,源码的注释分为3个阶段:

1、              通过初始化全局变量cache_cache,创造第一个cache,这为后续创建其他的cache提供了“规则”,注意所有的cache都是挂在链表cache_chain下,而cache_cache就是该链表的第一个节点;有了struct kmem_cache长度的“规则”的cache后,就可以从slab申请kmem_cache的内存了,这为创建其他“规则”的cache打下了基础;

2、              接下来陆续创建包括struct arraycache_init、struct kmem_list3在内的长度由32到4194304的20个cache,它们都是所谓的普通缓存,注意下标识初始化进度的全局变量g_cpucache_up在这期间的变迁,由NONE->PARTIAL_AC->PARTIAL_L3,前面细致描述过;

3、              通过kmalloc申请原先由全局变量模拟的cache,包括struct arraycache_init和struct kmem_list3的(分别是initarray_cache和initkmem_list3);这时slab初始化就完成了,其他模块都可以通过kmalloc轻松获取对应的物理内存了,初始化进度的全局变量g_cpucache_up置为EARLY;

在start_kernel中后续调用函数kmem_cache_init_late,将初始化进度的全局变量g_cpucache_up置为FULL,彻底完成slab初始化。

 

总结,slab源码的难度大一些,但还是很重要的,海量的内核模块都在使用,加深对它的理解,对于内核的理解有不小的影响。Slab真正有效的部分一个是它提供了分配小段内存的机制,另外它申请的物理内存不是返回给伙伴系统而是驻留在slab内部,这对代码执行高效的使用硬件cache是非常有效的,这是它很关键的一个特征。

posted @ 2018-11-30 03:11  rebeca8  阅读(836)  评论(0编辑  收藏  举报