Python源码解析-list对象的底层实现(PyListObject)
本文基于Python3.10.4。
简介
数组是程序中一个十分重要的概念,我们将符合某一特性的多个元素集合在一块形成一个数组,同时可以向其中增加删除元素。在C语言中就已经存在了数组的概念,同时在其它的编程语言中也基本都会实现数组这个概念。
PyListObject便是python中,实现数组的对象,它与C++ STL中的verctot较为相似。
PyListObject
PyListObject对象支持元素的增加、插入、删除等删除。在前面介绍到数组是符合某一特性的元素集合,在PyListObject中保存的都是Pyobject对象,而python中的所有对象都是基于Pyobject的。所以python中的list与C语言不同,熟悉C的应该清楚,C语言中的数组 ,只能是统一保存同一个类型,如int、double、float等,但是在python中的list可以保存任意的类型对象。
接下来,先看一下PyListObject的定义:
[Include/cpython/listobject.h]
typedef struct {
PyObject_VAR_HEAD
/* Vector of pointers to list elements. list[0] is ob_item[0], etc. */
PyObject **ob_item;
/* ob_item contains space for 'allocated' elements. The number
* currently in use is ob_size.
* Invariants:
* 0 <= ob_size <= allocated
* len(list) == ob_size
* ob_item == NULL implies ob_size == allocated == 0
* list.sort() temporarily sets allocated to -1 to detect mutations.
*
* Items must normally not be NULL, except during construction when
* the list is not yet visible outside the function that builds it.
*/
Py_ssize_t allocated;
} PyListObject;
PyObject_VAR_HEAD:python中的变长对象的共有参数部分,其中ob_size参数保存的对象的长度。
ob_item:list对象的存储,从注释可以看到,列表中的第n个元素,对应ob_item[n-1]
allocated:数组的容许长度(内存中申请的实际容量),这个和python的相关内存管理有关,马上会介绍到。
内存管理
前面介绍到,在pylistobject中,有一个ob_size参数保存的列表的长度,还有一个allocated参数保存列表的容许长度。那么这两者之间,有什么区别?python为什么要这么实现呢?
这里涉及到python中list对象的内存管理,熟悉C的应该清楚,C中的数组是需要提前申请好空间的。比如一个int[n]中就无法保存超过n个的元素,但是python中list是没长度限制的。那么如果python中,申请的空间使用完了,再想添加元素的话,就需要再申请一个元素,那每一次的元素添加,都需要申请空间,性能消耗较大。所以python中引入了一个allocated参数,保存当前已经申请了的长度,如果空间不够了的话,它会一次性都申请一些空间,避免每次增加元素时频繁申请空间造成的性能消耗。
所以如前面注释中写道,数组对象满足下面的公式:
0 <= ob_size <= allocated
len(list) == ob_size
ob_item == NULL implies ob_size == allocated == 0
- 列表的长度大于等于0,并小于等于已经申请的总长度。
- 列表的长度等于ob_size,即ob_size保存的就是列表长度。
- 数组对象为空时,列表长度等于申请总长度等于0。
那么python中是如何来决定,具体这次要申请多少空间的呢?
在python的实现中,通过list_resize函数来管理list对象的实际申请空间。
[Objects/listobject.c]
/* Ensure ob_item has room for at least newsize elements, and set
* ob_size to newsize. If newsize > ob_size on entry, the content
* of the new slots at exit is undefined heap trash; it's the caller's
* responsibility to overwrite them with sane values.
* The number of allocated elements may grow, shrink, or stay the same.
* Failure is impossible if newsize <= self.allocated on entry, although
* that partly relies on an assumption that the system realloc() never
* fails when passed a number of bytes <= the number of bytes last
* allocated (the C standard doesn't guarantee this, but it's hard to
* imagine a realloc implementation where it wouldn't be true).
* Note that self->ob_item may change, and even if newsize is less
* than ob_size on entry.
*/
static int
list_resize(PyListObject *self, Py_ssize_t newsize)
{
PyObject **items;
size_t new_allocated, num_allocated_bytes;
Py_ssize_t allocated = self->allocated;
/* Bypass realloc() when a previous overallocation is large enough
to accommodate the newsize. If the newsize falls lower than half
the allocated size, then proceed with the realloc() to shrink the list.
*/
if (allocated >= newsize && newsize >= (allocated >> 1)) {
assert(self->ob_item != NULL || newsize == 0);
Py_SET_SIZE(self, newsize);
return 0;
}
/* This over-allocates proportional to the list size, making room
* for additional growth. The over-allocation is mild, but is
* enough to give linear-time amortized behavior over a long
* sequence of appends() in the presence of a poorly-performing
* system realloc().
* Add padding to make the allocated size multiple of 4.
* The growth pattern is: 0, 4, 8, 16, 24, 32, 40, 52, 64, 76, ...
* Note: new_allocated won't overflow because the largest possible value
* is PY_SSIZE_T_MAX * (9 / 8) + 6 which always fits in a size_t.
*/
new_allocated = ((size_t)newsize + (newsize >> 3) + 6) & ~(size_t)3;
/* Do not overallocate if the new size is closer to overallocated size
* than to the old size.
*/
if (newsize - Py_SIZE(self) > (Py_ssize_t)(new_allocated - newsize))
new_allocated = ((size_t)newsize + 3) & ~(size_t)3;
if (newsize == 0)
new_allocated = 0;
num_allocated_bytes = new_allocated * sizeof(PyObject *);
items = (PyObject **)PyMem_Realloc(self->ob_item, num_allocated_bytes);
if (items == NULL) {
PyErr_NoMemory();
return -1;
}
self->ob_item = items;
Py_SET_SIZE(self, newsize);
self->allocated = new_allocated;
return 0;
}
从这里可以看出python中list内存的管理机制:
-
当数组的新长度,小于等于allocated,并大于等于allocated的一半时。不进行内存管理,直接修改ob_size的值,返回。
-
不然的话,通过固定公式计算出新的allocated值,调整列表的内存空间。
公式:
- 先初步计算,(x + floor(x / 8) + 6) & ~3
- 如果按第一步计算的值,新增的元素数大于,调整之后的空余长度。按(x + 3) & ~3 计算值。
- 如果新大小为0,列表的总长度调整为0。
创建list
在python中提供了一种方式来创建list。
[Include/listobject.h]
PyAPI_FUNC(PyObject *) PyList_New(Py_ssize_t size);
这里具体看一下list的创建逻辑。
PyObject *
PyList_New(Py_ssize_t size)
{
if (size < 0) {
PyErr_BadInternalCall();
return NULL;
}
struct _Py_list_state *state = get_list_state();
PyListObject *op;
#ifdef Py_DEBUG
// PyList_New() must not be called after _PyList_Fini()
assert(state->numfree != -1);
#endif
if (state->numfree) {
state->numfree--;
op = state->free_list[state->numfree];
_Py_NewReference((PyObject *)op);
}
else {
op = PyObject_GC_New(PyListObject, &PyList_Type);
if (op == NULL) {
return NULL;
}
}
if (size <= 0) {
op->ob_item = NULL;
}
else {
op->ob_item = (PyObject **) PyMem_Calloc(size, sizeof(PyObject *));
if (op->ob_item == NULL) {
Py_DECREF(op);
return PyErr_NoMemory();
}
}
Py_SET_SIZE(op, size);
op->allocated = size;
_PyObject_GC_TRACK(op);
return (PyObject *) op;
}
这个函数接受一个size参数,可以在创建list的同时,指定列表初始的元素个数。这里主要的逻辑就是根据size创建相应大小的空间,初始化PyListObject对象中参数的值。
缓存池管理
在上面创建list的源码中,可以看到较为熟悉的python缓存池对象。
struct _Py_list_state *state = get_list_state();
PyListObject *op;
#ifdef Py_DEBUG
// PyList_New() must not be called after _PyList_Fini()
assert(state->numfree != -1);
#endif
if (state->numfree) {
state->numfree--;
op = state->free_list[state->numfree];
_Py_NewReference((PyObject *)op);
}
free_list中保存了python对于list对象缓存的数据,在创建list对象时,会优先检测缓存池中是否存在可用的对象,如果有的话,就直接使用这个可用对象。如果不行的话,需要另外从系统中申请内存,创建pylistobject对象。
在python中,默认list对象的缓存池大小为80,这个大小是源码中硬编码的,需要修改的话,得修改源码重新编译。
[Include/internal/pycore_interp.h]
#ifndef PyList_MAXFREELIST
# define PyList_MAXFREELIST 80
#endif
struct _Py_list_state {
PyListObject *free_list[PyList_MAXFREELIST];
int numfree;
};
前面介绍了缓存池的使用和定义,那么缓存池中的数据是什么时候写入的呢?其实是在删除python中list对象时,写入的缓存池。python为了和其它对象一样,利用少量的空间来提高整体的运行性能,在删除list对象时,会将PyListObject对象放入缓存池,下次需要的时候,直接使用,跳过重新创建的步骤。
下面具体看一下删除list对象时的函数:
[Objects/listobject.c]
static void
list_dealloc(PyListObject *op)
{
Py_ssize_t i;
PyObject_GC_UnTrack(op);
Py_TRASHCAN_BEGIN(op, list_dealloc)
if (op->ob_item != NULL) {
/* Do it backwards, for Christian Tismer.
There's a simple test case where somehow this reduces
thrashing when a *very* large list is created and
immediately deleted. */
i = Py_SIZE(op);
while (--i >= 0) {
Py_XDECREF(op->ob_item[i]);
}
PyMem_Free(op->ob_item);
}
struct _Py_list_state *state = get_list_state();
#ifdef Py_DEBUG
// list_dealloc() must not be called after _PyList_Fini()
assert(state->numfree != -1);
#endif
if (state->numfree < PyList_MAXFREELIST && PyList_CheckExact(op)) {
state->free_list[state->numfree++] = op;
}
else {
Py_TYPE(op)->tp_free((PyObject *)op);
}
Py_TRASHCAN_END
}
这里可以看到一个list对象的删除,分成了两步。
- 销毁list对象中的所有元素
- 根据缓存池情况,决定放入缓存池还是释放空间