Python源码解析-list对象的底层实现（PyListObject）

简介
PyListObject
内存管理
创建list
缓存池管理

本文基于Python3.10.4。

简介

数组是程序中一个十分重要的概念，我们将符合某一特性的多个元素集合在一块形成一个数组，同时可以向其中增加删除元素。在C语言中就已经存在了数组的概念，同时在其它的编程语言中也基本都会实现数组这个概念。

PyListObject便是python中，实现数组的对象，它与C++ STL中的verctot较为相似。

PyListObject

PyListObject对象支持元素的增加、插入、删除等删除。在前面介绍到数组是符合某一特性的元素集合，在PyListObject中保存的都是Pyobject对象，而python中的所有对象都是基于Pyobject的。所以python中的list与C语言不同，熟悉C的应该清楚，C语言中的数组，只能是统一保存同一个类型，如int、double、float等，但是在python中的list可以保存任意的类型对象。

接下来，先看一下PyListObject的定义：

[Include/cpython/listobject.h]
typedef struct {
    PyObject_VAR_HEAD
    /* Vector of pointers to list elements.  list[0] is ob_item[0], etc. */
    PyObject **ob_item;

    /* ob_item contains space for 'allocated' elements.  The number
     * currently in use is ob_size.
     * Invariants:
     *     0 <= ob_size <= allocated
     *     len(list) == ob_size
     *     ob_item == NULL implies ob_size == allocated == 0
     * list.sort() temporarily sets allocated to -1 to detect mutations.
     *
     * Items must normally not be NULL, except during construction when
     * the list is not yet visible outside the function that builds it.
     */
    Py_ssize_t allocated;
} PyListObject;

PyObject_VAR_HEAD：python中的变长对象的共有参数部分，其中ob_size参数保存的对象的长度。

ob_item：list对象的存储，从注释可以看到，列表中的第n个元素，对应ob_item[n-1]

allocated：数组的容许长度（内存中申请的实际容量），这个和python的相关内存管理有关，马上会介绍到。

内存管理

前面介绍到，在pylistobject中，有一个ob_size参数保存的列表的长度，还有一个allocated参数保存列表的容许长度。那么这两者之间，有什么区别？python为什么要这么实现呢？

这里涉及到python中list对象的内存管理，熟悉C的应该清楚，C中的数组是需要提前申请好空间的。比如一个int[n]中就无法保存超过n个的元素，但是python中list是没长度限制的。那么如果python中，申请的空间使用完了，再想添加元素的话，就需要再申请一个元素，那每一次的元素添加，都需要申请空间，性能消耗较大。所以python中引入了一个allocated参数，保存当前已经申请了的长度，如果空间不够了的话，它会一次性都申请一些空间，避免每次增加元素时频繁申请空间造成的性能消耗。

所以如前面注释中写道，数组对象满足下面的公式：

0 <= ob_size <= allocated
len(list) == ob_size
ob_item == NULL implies ob_size == allocated == 0

列表的长度大于等于0，并小于等于已经申请的总长度。
列表的长度等于ob_size，即ob_size保存的就是列表长度。
数组对象为空时，列表长度等于申请总长度等于0。

那么python中是如何来决定，具体这次要申请多少空间的呢？

在python的实现中，通过list_resize函数来管理list对象的实际申请空间。

[Objects/listobject.c]
/* Ensure ob_item has room for at least newsize elements, and set
 * ob_size to newsize.  If newsize > ob_size on entry, the content
 * of the new slots at exit is undefined heap trash; it's the caller's
 * responsibility to overwrite them with sane values.
 * The number of allocated elements may grow, shrink, or stay the same.
 * Failure is impossible if newsize <= self.allocated on entry, although
 * that partly relies on an assumption that the system realloc() never
 * fails when passed a number of bytes <= the number of bytes last
 * allocated (the C standard doesn't guarantee this, but it's hard to
 * imagine a realloc implementation where it wouldn't be true).
 * Note that self->ob_item may change, and even if newsize is less
 * than ob_size on entry.
 */
static int
list_resize(PyListObject *self, Py_ssize_t newsize)
{
    PyObject **items;
    size_t new_allocated, num_allocated_bytes;
    Py_ssize_t allocated = self->allocated;

    /* Bypass realloc() when a previous overallocation is large enough
       to accommodate the newsize.  If the newsize falls lower than half
       the allocated size, then proceed with the realloc() to shrink the list.
    */
    if (allocated >= newsize && newsize >= (allocated >> 1)) {
        assert(self->ob_item != NULL || newsize == 0);
        Py_SET_SIZE(self, newsize);
        return 0;
    }

    /* This over-allocates proportional to the list size, making room
     * for additional growth.  The over-allocation is mild, but is
     * enough to give linear-time amortized behavior over a long
     * sequence of appends() in the presence of a poorly-performing
     * system realloc().
     * Add padding to make the allocated size multiple of 4.
     * The growth pattern is:  0, 4, 8, 16, 24, 32, 40, 52, 64, 76, ...
     * Note: new_allocated won't overflow because the largest possible value
     *       is PY_SSIZE_T_MAX * (9 / 8) + 6 which always fits in a size_t.
     */
    new_allocated = ((size_t)newsize + (newsize >> 3) + 6) & ~(size_t)3;
    /* Do not overallocate if the new size is closer to overallocated size
     * than to the old size.
     */
    if (newsize - Py_SIZE(self) > (Py_ssize_t)(new_allocated - newsize))
        new_allocated = ((size_t)newsize + 3) & ~(size_t)3;

    if (newsize == 0)
        new_allocated = 0;
    num_allocated_bytes = new_allocated * sizeof(PyObject *);
    items = (PyObject **)PyMem_Realloc(self->ob_item, num_allocated_bytes);
    if (items == NULL) {
        PyErr_NoMemory();
        return -1;
    }
    self->ob_item = items;
    Py_SET_SIZE(self, newsize);
    self->allocated = new_allocated;
    return 0;
}

从这里可以看出python中list内存的管理机制：

当数组的新长度，小于等于allocated，并大于等于allocated的一半时。不进行内存管理，直接修改ob_size的值，返回。
不然的话，通过固定公式计算出新的allocated值，调整列表的内存空间。

公式：

先初步计算，（x + floor(x / 8) + 6） & ~3
如果按第一步计算的值，新增的元素数大于，调整之后的空余长度。按（x + 3） & ~3 计算值。
如果新大小为0，列表的总长度调整为0。

创建list

在python中提供了一种方式来创建list。

[Include/listobject.h]
PyAPI_FUNC(PyObject *) PyList_New(Py_ssize_t size);

这里具体看一下list的创建逻辑。

PyObject *
PyList_New(Py_ssize_t size)
{
    if (size < 0) {
        PyErr_BadInternalCall();
        return NULL;
    }

    struct _Py_list_state *state = get_list_state();
    PyListObject *op;
#ifdef Py_DEBUG
    // PyList_New() must not be called after _PyList_Fini()
    assert(state->numfree != -1);
#endif
    if (state->numfree) {
        state->numfree--;
        op = state->free_list[state->numfree];
        _Py_NewReference((PyObject *)op);
    }
    else {
        op = PyObject_GC_New(PyListObject, &PyList_Type);
        if (op == NULL) {
            return NULL;
        }
    }
    if (size <= 0) {
        op->ob_item = NULL;
    }
    else {
        op->ob_item = (PyObject **) PyMem_Calloc(size, sizeof(PyObject *));
        if (op->ob_item == NULL) {
            Py_DECREF(op);
            return PyErr_NoMemory();
        }
    }
    Py_SET_SIZE(op, size);
    op->allocated = size;
    _PyObject_GC_TRACK(op);
    return (PyObject *) op;
}

这个函数接受一个size参数，可以在创建list的同时，指定列表初始的元素个数。这里主要的逻辑就是根据size创建相应大小的空间，初始化PyListObject对象中参数的值。

缓存池管理

在上面创建list的源码中，可以看到较为熟悉的python缓存池对象。

struct _Py_list_state *state = get_list_state();
    PyListObject *op;
#ifdef Py_DEBUG
    // PyList_New() must not be called after _PyList_Fini()
    assert(state->numfree != -1);
#endif
    if (state->numfree) {
        state->numfree--;
        op = state->free_list[state->numfree];
        _Py_NewReference((PyObject *)op);
    }

free_list中保存了python对于list对象缓存的数据，在创建list对象时，会优先检测缓存池中是否存在可用的对象，如果有的话，就直接使用这个可用对象。如果不行的话，需要另外从系统中申请内存，创建pylistobject对象。

在python中，默认list对象的缓存池大小为80，这个大小是源码中硬编码的，需要修改的话，得修改源码重新编译。

[Include/internal/pycore_interp.h]
#ifndef PyList_MAXFREELIST
#  define PyList_MAXFREELIST 80
#endif

struct _Py_list_state {
    PyListObject *free_list[PyList_MAXFREELIST];
    int numfree;
};

前面介绍了缓存池的使用和定义，那么缓存池中的数据是什么时候写入的呢？其实是在删除python中list对象时，写入的缓存池。python为了和其它对象一样，利用少量的空间来提高整体的运行性能，在删除list对象时，会将PyListObject对象放入缓存池，下次需要的时候，直接使用，跳过重新创建的步骤。

下面具体看一下删除list对象时的函数：

[Objects/listobject.c]
static void
list_dealloc(PyListObject *op)
{
    Py_ssize_t i;
    PyObject_GC_UnTrack(op);
    Py_TRASHCAN_BEGIN(op, list_dealloc)
    if (op->ob_item != NULL) {
        /* Do it backwards, for Christian Tismer.
           There's a simple test case where somehow this reduces
           thrashing when a *very* large list is created and
           immediately deleted. */
        i = Py_SIZE(op);
        while (--i >= 0) {
            Py_XDECREF(op->ob_item[i]);
        }
        PyMem_Free(op->ob_item);
    }
    struct _Py_list_state *state = get_list_state();
#ifdef Py_DEBUG
    // list_dealloc() must not be called after _PyList_Fini()
    assert(state->numfree != -1);
#endif
    if (state->numfree < PyList_MAXFREELIST && PyList_CheckExact(op)) {
        state->free_list[state->numfree++] = op;
    }
    else {
        Py_TYPE(op)->tp_free((PyObject *)op);
    }
    Py_TRASHCAN_END
}

这里可以看到一个list对象的删除，分成了两步。

销毁list对象中的所有元素
根据缓存池情况，决定放入缓存池还是释放空间

posted @ 2022-09-05 15:28 红雨520 阅读(463) 评论(0) 编辑收藏举报

刷新页面返回顶部

极速快码

Python源码解析-list对象的底层实现（PyListObject）

简介

PyListObject

内存管理

创建list

缓存池管理

公告