python中的内存管理和垃圾回收机制(附gc(Garbage Collection)模块常用方法)
python中的内存管理和垃圾回收机制(附gc(Garbage Collection)模块常用方法)
python是一种动态的脚本语言,因此它的一些特性和cpp语言等有所不同。python是通过对象和引用来将实际的储存区域中的内容和变量建立联系的。在python中的内存管理和垃圾回收主要有以下几个需要了解的地方:减小频繁申请和释放内存带来的不便利所引入的内存池机制,以及垃圾回收的标记-清除和分代回收机制。下面简单介绍。
python的内存管理
_____ ______ ______ ________
[ int ] [ dict ] [ list ] ... [ string ] Python core |
+3 | <----- Object-specific memory -----> | <-- Non-object memory --> |
_______________________________ | |
[ Python's object allocator ] | |
+2 | ####### Object memory ####### | <------ Internal buffers ------> |
______________________________________________________________ |
[ Python's raw memory allocator (PyMem_ API) ] |
+1 | <----- Python memory (under PyMem manager's control) ------> | |
__________________________________________________________________
[ Underlying general-purpose allocator (ex: C library malloc) ]
0 | <------ Virtual memory allocated for the python process -------> |
=========================================================================
_______________________________________________________________________
[ OS-specific Virtual Memory Manager (VMM) ]
-1 | <--- Kernel dynamic storage allocation & management (page-based) ---> |
__________________________________ __________________________________
[ ] [ ]
-2 | <-- Physical memory: ROM/RAM --> | | <-- Secondary storage (swap) --> |
可以看出,python的内存管理主要分为四个层次结构,直接面向用户的是标记为+3的一层,它提供给我们int,list,string,dict等方法;第0层是操作系统的内存管理借口,python无法干涉;第1层是python管理维护,为了适应不同的系统可移植性的需要建立的,是对第0层的进一步封装;第2层也是python管理维护,负责对象的建立和销毁,包含参数和引用计数等。
python的内存池机制
为了便于较小的对象的管理和使用,python中引入了所谓的内存池机制,对于小于256Byte的,直接通过内存池分配,内存池也有其结构,最小的单元为Block,大小为8的整数倍。对于像整数之类的不可变对象,还有对象缓冲池。比如:
a = 1
b = 3
c = 1
hex(id(a))
Out[5]: '0x23c2168'
hex(id(b))
Out[6]: '0x23c2138'
hex(id(c))
Out[7]: '0x23c2168'
可以看出,虽然a和c不是用a = c赋值的,但是实际上他们还是指向同一个位置。说明在python中,1这个不可变的常数对象是有固定的位置的,而所有赋值为1的变量都是它的引用。短字符串也是这样的:
a = 'haha'
b = 'hahaha'
c = 'haha'
hex(id(a))
Out[11]: '0x7fc94388bab0'
hex(id(b))
Out[12]: '0x7fc94388b870'
hex(id(c))
Out[13]: '0x7fc94388bab0'
python的垃圾回收策略:标记-清除 和 分代回收
由于python是对象和引用构成的,因此我们不需要预先声明变量类型和开辟空间就可以直接将某个变量赋值。这个赋值的过程实际上就是把某个引用指向某个既定的对象的过程。就好比给一个物品上贴上标签。比如:
ref1 = [1,2,3,4]
ref2 = [3,4,5]
ref3 = ref1
ref3
Out[22]: [1, 2, 3, 4]
hex(id(ref3)) == hex(id(ref1))
Out[23]: True
ref3[1] = 4
ref3
Out[25]: [1, 4, 3, 4]
ref1
Out[26]: [1, 4, 3, 4]
ref3 = 6
ref3
Out[28]: 6
ref1
Out[29]: [1, 4, 3, 4]
可以看出,直接赋值只是给已经存在的对象再贴上一个标签而已,因为两者指向的是同一个对象,所以修改一个会影响其他的。
可以使得引用增加的操作:
- 对象被创建,a = 1
- 对象被引用,b = a
- 对象被作为参数传入 fun(a)
- 对象被作为容器元素 c = [a,a,a]
可以使得引用减少的操作:
- 对象被显式销毁,del a
- 对象的引用指向了新的对象,如本来a = b,然后a = c,那么b的引用就少一个
- 对象所在的容器被销毁
- 局部变量离开作用域
由这种模式下,何时进行废弃对象的回收呢?一个很直接的思路就是,如果没有标签贴在某个实体上,也就是说,一个对象的引用数为零的时候,就要对它进行回收。这个就是标记清除的基本思路,也可以叫 引用计数 的策略。但是对于一种特殊情形,即循环引用的情况,这种等到引用数为零在回收的想法就遇到了困难,比如我们建立两个对象A和B,其中A.next = B,而B.pre = A,这在python中是可行的,因为允许class的属性在使用中加入。这样以来,AB都有至少一个来自对方的引用,然而可能两个都没有外部引用,也就是不使用了,但是直接查找的话也无法销毁这两个对象。
如果每一次有新的操作都进行一次检查看看是否有需要回收的话,就比较费时费力,因此我们的检查应该不能过于频繁,而且又要保证及时的回收。python中应用了分代回收机制 。简单来说就是,将存在时间短的对象容易死掉,而老年的对象不太容易死,这叫做弱代假说(weak generation hypothesis),这也很好理解,一般生命周期长的对象往往是全局变量,而短的多为局部变量或者临时定义的变量。那么,我们把当前的对象作为第0代,我们每当allocation比deallocation多到某个阈值时,就对这些对象做一次检查和清理,没有被清理的那些就存活下来,进入第1代,第一代检查做若干次后,对1代清理,存活下来的进入第2代,第二代也是如此。这样就实现了分代回收的操作。
gc 模块的简要说明及其主要方法
gc模块的回收就是用分代回收的策略来解决循环引用的不可回收的问题,下面是gc模块的常见的方法及其说明:
gc.enable()
Enable automatic garbage collection.
gc.disable()
Disable automatic garbage collection.
gc.isenabled()
Returns true if automatic collection is enabled.
gc.collect(generation=2)
With no arguments, run a full collection. The optional argument generation may be an integer specifying which generation to collect (from 0 to 2). A ValueError is raised if the generation number is invalid. The number of unreachable objects found is returned.
The free lists maintained for a number of built-in types are cleared whenever a full collection or collection of the highest generation (2) is run. Not all items in some free lists may be freed due to the particular implementation, in particular float.
gc.set_debug(flags)
Set the garbage collection debugging flags. Debugging information will be written to sys.stderr. See below for a list of debugging flags which can be combined using bit operations to control debugging.
gc.get_debug()
Return the debugging flags currently set.
gc.get_objects()
Returns a list of all objects tracked by the collector, excluding the list returned.
gc.get_stats()
Return a list of three per-generation dictionaries containing collection statistics since interpreter start. The number of keys may change in the future, but currently each dictionary will contain the following items:
collections is the number of times this generation was collected;
collected is the total number of objects collected inside this generation;
uncollectable is the total number of objects which were found to be uncollectable (and were therefore moved to the garbage list) inside this generation.
New in version 3.4.
gc.set_threshold(threshold0[, threshold1[, threshold2]])
Set the garbage collection thresholds (the collection frequency). Setting threshold0 to zero disables collection.
The GC classifies objects into three generations depending on how many collection sweeps they have survived. New objects are placed in the youngest generation (generation 0). If an object survives a collection it is moved into the next older generation. Since generation 2 is the oldest generation, objects in that generation remain there after a collection. In order to decide when to run, the collector keeps track of the number object allocations and deallocations since the last collection. When the number of allocations minus the number of deallocations exceeds threshold0, collection starts. Initially only generation 0 is examined. If generation 0 has been examined more than threshold1 times since generation 1 has been examined, then generation 1 is examined as well. Similarly, threshold2 controls the number of collections of generation 1 before collecting generation 2.
gc.get_count()
Return the current collection counts as a tuple of (count0, count1, count2).
gc.get_threshold()
Return the current collection thresholds as a tuple of (threshold0, threshold1, threshold2).
下面这些是gc.debug的flag的说明:
gc.DEBUG_STATS
Print statistics during collection. This information can be useful when tuning the collection frequency.
gc.DEBUG_COLLECTABLE
Print information on collectable objects found.
gc.DEBUG_UNCOLLECTABLE
Print information of uncollectable objects found (objects which are not reachable but cannot be freed by the collector). These objects will be added to the garbage list.
Changed in version 3.2: Also print the contents of the garbage list at interpreter shutdown, if it isn’t empty.
gc.DEBUG_SAVEALL
When set, all unreachable objects found will be appended to garbage rather than being freed. This can be useful for debugging a leaking program.
gc.DEBUG_LEAK
The debugging flags necessary for the collector to print information about a leaking program (equal to DEBUG_COLLECTABLE | DEBUG_UNCOLLECTABLE | DEBUG_SAVEALL).
gc的垃圾回收在程序结束、达到阈值,以及显式调用gc.collect()时启动。
2018年04月13日00:11:42