Python源码解析-list对象的底层实现(PyListObject)

本文基于Python3.10.4。

简介

数组是程序中一个十分重要的概念,我们将符合某一特性的多个元素集合在一块形成一个数组,同时可以向其中增加删除元素。在C语言中就已经存在了数组的概念,同时在其它的编程语言中也基本都会实现数组这个概念。

PyListObject便是python中,实现数组的对象,它与C++ STL中的verctot较为相似。

PyListObject

PyListObject对象支持元素的增加、插入、删除等删除。在前面介绍到数组是符合某一特性的元素集合,在PyListObject中保存的都是Pyobject对象,而python中的所有对象都是基于Pyobject的。所以python中的list与C语言不同,熟悉C的应该清楚,C语言中的数组 ,只能是统一保存同一个类型,如int、double、float等,但是在python中的list可以保存任意的类型对象。

接下来,先看一下PyListObject的定义:

[Include/cpython/listobject.h]
typedef struct {
    PyObject_VAR_HEAD
    /* Vector of pointers to list elements.  list[0] is ob_item[0], etc. */
    PyObject **ob_item;

    /* ob_item contains space for 'allocated' elements.  The number
     * currently in use is ob_size.
     * Invariants:
     *     0 <= ob_size <= allocated
     *     len(list) == ob_size
     *     ob_item == NULL implies ob_size == allocated == 0
     * list.sort() temporarily sets allocated to -1 to detect mutations.
     *
     * Items must normally not be NULL, except during construction when
     * the list is not yet visible outside the function that builds it.
     */
    Py_ssize_t allocated;
} PyListObject;

PyObject_VAR_HEAD:python中的变长对象的共有参数部分,其中ob_size参数保存的对象的长度。

ob_item:list对象的存储,从注释可以看到,列表中的第n个元素,对应ob_item[n-1]

allocated:数组的容许长度(内存中申请的实际容量),这个和python的相关内存管理有关,马上会介绍到。

内存管理

前面介绍到,在pylistobject中,有一个ob_size参数保存的列表的长度,还有一个allocated参数保存列表的容许长度。那么这两者之间,有什么区别?python为什么要这么实现呢?

这里涉及到python中list对象的内存管理,熟悉C的应该清楚,C中的数组是需要提前申请好空间的。比如一个int[n]中就无法保存超过n个的元素,但是python中list是没长度限制的。那么如果python中,申请的空间使用完了,再想添加元素的话,就需要再申请一个元素,那每一次的元素添加,都需要申请空间,性能消耗较大。所以python中引入了一个allocated参数,保存当前已经申请了的长度,如果空间不够了的话,它会一次性都申请一些空间,避免每次增加元素时频繁申请空间造成的性能消耗。

所以如前面注释中写道,数组对象满足下面的公式:

0 <= ob_size <= allocated
len(list) == ob_size
ob_item == NULL implies ob_size == allocated == 0
  1. 列表的长度大于等于0,并小于等于已经申请的总长度。
  2. 列表的长度等于ob_size,即ob_size保存的就是列表长度。
  3. 数组对象为空时,列表长度等于申请总长度等于0。

那么python中是如何来决定,具体这次要申请多少空间的呢?

在python的实现中,通过list_resize函数来管理list对象的实际申请空间。

[Objects/listobject.c]
/* Ensure ob_item has room for at least newsize elements, and set
 * ob_size to newsize.  If newsize > ob_size on entry, the content
 * of the new slots at exit is undefined heap trash; it's the caller's
 * responsibility to overwrite them with sane values.
 * The number of allocated elements may grow, shrink, or stay the same.
 * Failure is impossible if newsize <= self.allocated on entry, although
 * that partly relies on an assumption that the system realloc() never
 * fails when passed a number of bytes <= the number of bytes last
 * allocated (the C standard doesn't guarantee this, but it's hard to
 * imagine a realloc implementation where it wouldn't be true).
 * Note that self->ob_item may change, and even if newsize is less
 * than ob_size on entry.
 */
static int
list_resize(PyListObject *self, Py_ssize_t newsize)
{
    PyObject **items;
    size_t new_allocated, num_allocated_bytes;
    Py_ssize_t allocated = self->allocated;

    /* Bypass realloc() when a previous overallocation is large enough
       to accommodate the newsize.  If the newsize falls lower than half
       the allocated size, then proceed with the realloc() to shrink the list.
    */
    if (allocated >= newsize && newsize >= (allocated >> 1)) {
        assert(self->ob_item != NULL || newsize == 0);
        Py_SET_SIZE(self, newsize);
        return 0;
    }

    /* This over-allocates proportional to the list size, making room
     * for additional growth.  The over-allocation is mild, but is
     * enough to give linear-time amortized behavior over a long
     * sequence of appends() in the presence of a poorly-performing
     * system realloc().
     * Add padding to make the allocated size multiple of 4.
     * The growth pattern is:  0, 4, 8, 16, 24, 32, 40, 52, 64, 76, ...
     * Note: new_allocated won't overflow because the largest possible value
     *       is PY_SSIZE_T_MAX * (9 / 8) + 6 which always fits in a size_t.
     */
    new_allocated = ((size_t)newsize + (newsize >> 3) + 6) & ~(size_t)3;
    /* Do not overallocate if the new size is closer to overallocated size
     * than to the old size.
     */
    if (newsize - Py_SIZE(self) > (Py_ssize_t)(new_allocated - newsize))
        new_allocated = ((size_t)newsize + 3) & ~(size_t)3;

    if (newsize == 0)
        new_allocated = 0;
    num_allocated_bytes = new_allocated * sizeof(PyObject *);
    items = (PyObject **)PyMem_Realloc(self->ob_item, num_allocated_bytes);
    if (items == NULL) {
        PyErr_NoMemory();
        return -1;
    }
    self->ob_item = items;
    Py_SET_SIZE(self, newsize);
    self->allocated = new_allocated;
    return 0;
}

从这里可以看出python中list内存的管理机制:

  1. 当数组的新长度,小于等于allocated,并大于等于allocated的一半时。不进行内存管理,直接修改ob_size的值,返回。

  2. 不然的话,通过固定公式计算出新的allocated值,调整列表的内存空间。

公式:

  1. 先初步计算,(x + floor(x / 8) + 6) & ~3
  2. 如果按第一步计算的值,新增的元素数大于,调整之后的空余长度。按(x + 3) & ~3 计算值。
  3. 如果新大小为0,列表的总长度调整为0。

创建list

在python中提供了一种方式来创建list。

[Include/listobject.h]
PyAPI_FUNC(PyObject *) PyList_New(Py_ssize_t size);

这里具体看一下list的创建逻辑。

PyObject *
PyList_New(Py_ssize_t size)
{
    if (size < 0) {
        PyErr_BadInternalCall();
        return NULL;
    }

    struct _Py_list_state *state = get_list_state();
    PyListObject *op;
#ifdef Py_DEBUG
    // PyList_New() must not be called after _PyList_Fini()
    assert(state->numfree != -1);
#endif
    if (state->numfree) {
        state->numfree--;
        op = state->free_list[state->numfree];
        _Py_NewReference((PyObject *)op);
    }
    else {
        op = PyObject_GC_New(PyListObject, &PyList_Type);
        if (op == NULL) {
            return NULL;
        }
    }
    if (size <= 0) {
        op->ob_item = NULL;
    }
    else {
        op->ob_item = (PyObject **) PyMem_Calloc(size, sizeof(PyObject *));
        if (op->ob_item == NULL) {
            Py_DECREF(op);
            return PyErr_NoMemory();
        }
    }
    Py_SET_SIZE(op, size);
    op->allocated = size;
    _PyObject_GC_TRACK(op);
    return (PyObject *) op;
}

这个函数接受一个size参数,可以在创建list的同时,指定列表初始的元素个数。这里主要的逻辑就是根据size创建相应大小的空间,初始化PyListObject对象中参数的值。

缓存池管理

在上面创建list的源码中,可以看到较为熟悉的python缓存池对象。

struct _Py_list_state *state = get_list_state();
    PyListObject *op;
#ifdef Py_DEBUG
    // PyList_New() must not be called after _PyList_Fini()
    assert(state->numfree != -1);
#endif
    if (state->numfree) {
        state->numfree--;
        op = state->free_list[state->numfree];
        _Py_NewReference((PyObject *)op);
    }

free_list中保存了python对于list对象缓存的数据,在创建list对象时,会优先检测缓存池中是否存在可用的对象,如果有的话,就直接使用这个可用对象。如果不行的话,需要另外从系统中申请内存,创建pylistobject对象。

在python中,默认list对象的缓存池大小为80,这个大小是源码中硬编码的,需要修改的话,得修改源码重新编译。

[Include/internal/pycore_interp.h]
#ifndef PyList_MAXFREELIST
#  define PyList_MAXFREELIST 80
#endif

struct _Py_list_state {
    PyListObject *free_list[PyList_MAXFREELIST];
    int numfree;
};

前面介绍了缓存池的使用和定义,那么缓存池中的数据是什么时候写入的呢?其实是在删除python中list对象时,写入的缓存池。python为了和其它对象一样,利用少量的空间来提高整体的运行性能,在删除list对象时,会将PyListObject对象放入缓存池,下次需要的时候,直接使用,跳过重新创建的步骤。

下面具体看一下删除list对象时的函数:

[Objects/listobject.c]
static void
list_dealloc(PyListObject *op)
{
    Py_ssize_t i;
    PyObject_GC_UnTrack(op);
    Py_TRASHCAN_BEGIN(op, list_dealloc)
    if (op->ob_item != NULL) {
        /* Do it backwards, for Christian Tismer.
           There's a simple test case where somehow this reduces
           thrashing when a *very* large list is created and
           immediately deleted. */
        i = Py_SIZE(op);
        while (--i >= 0) {
            Py_XDECREF(op->ob_item[i]);
        }
        PyMem_Free(op->ob_item);
    }
    struct _Py_list_state *state = get_list_state();
#ifdef Py_DEBUG
    // list_dealloc() must not be called after _PyList_Fini()
    assert(state->numfree != -1);
#endif
    if (state->numfree < PyList_MAXFREELIST && PyList_CheckExact(op)) {
        state->free_list[state->numfree++] = op;
    }
    else {
        Py_TYPE(op)->tp_free((PyObject *)op);
    }
    Py_TRASHCAN_END
}

这里可以看到一个list对象的删除,分成了两步。

  1. 销毁list对象中的所有元素
  2. 根据缓存池情况,决定放入缓存池还是释放空间
posted @ 2022-09-05 15:28  红雨520  阅读(399)  评论(0编辑  收藏  举报