【转】Garbage collection in Python: things you need to know
Link to original article: https://rushter.com/blog/python-garbage-collector/
This article describes garbage collection (GC) in Python 3.6.
Usually, you do not need to worry about memory management. When objects are no longer needed, Python automatically reclaims memory from them. However, understanding how GC works can help you write better Python programs.
Memory management
Unlike many other languages, Python does not necessarily release the memory back to the Operating System. Instead, it has a dedicated object allocator for small objects (smaller or equal to 512 bytes), which keeps some chunks of already allocated memory for further use in the future. The amount of memory that Python holds depends on the usage patterns. In some cases, all allocated memory is never released.
Therefore, if a long-running Python process takes more memory over time, it does not necessarily mean that you have memory leaks. If you are interested in Python's memory model, consider reading my article on memory management.
Garbage collection algorithms
Standard CPython's garbage collector has two components, the reference counting collector and the generational garbage collector, known as gc module.
The reference counting algorithm is incredibly efficient and straightforward, but it cannot detect reference cycles. That is why Python has a supplemental algorithm called generational cyclic GC, that specifically deals with reference cycles.
The reference counting module is fundamental to Python and can't be disabled, whereas the cyclic GC is optional and can be used manually.
Reference counting
Reference counting is a simple technique in which objects are deallocated when there is no reference to them in a program.
Every variable in Python is a reference (a pointer) to an object and not the actual value itself. For example, the assignment statement just adds a new reference to the right-hand side.
To keep track of references, every object (even integer) has an extra field called reference count that is increased or decreased when a pointer to the object is copied or deleted. See Objects, Types and Reference Counts section, for a detailed explanation.
Examples, where the reference count increases:
- assignment operator
- argument passing
- appending an object to a list (object's reference count will be increased).
If reference counting field reaches zero, CPython automatically calls the object-specific deallocation function. If an object contains references to other objects, then their reference count is decremented too. Thus other objects may be deallocated in turn. For example, when a list is deleted the reference count for all its items is decreased.
Variables, which are declared outside of functions, classes, and blocks are called globals. Usually, such variables live until the end of the Python's process. Thus, the reference count of objects, which are referred by global variables, never drops to 0.
Variables, which are defined inside blocks (e.g., in a function or class) have a local scope (i.e., they are local to its block). If Python interpreter exits from the block, it destroys all references created inside the block.
You can always check the number of current references using sys.getrefcount function.
Here is a simple example:
1 foo = [] 2 3 # 2 references, 1 from the foo var and 1 from getrefcount 4 print(sys.getrefcount(foo)) 5 6 def bar(a): 7 # 4 references 8 # from the foo var, function argument, getrefcount and Python's function stack 9 print(sys.getrefcount(a)) 10 11 bar(foo) 12 # 2 references, the function scope is destroyed 13 print(sys.getrefcount(foo))
he main reason why CPython uses reference counting is historical. There are a lot of debates nowadays about weaknesses of such technique. Some people claim that modern garbage collection algorithms can be more efficient without reference counting at all. The reference counting algorithm has a lot of issues, such as circular references, thread locking and memory and performance overhead.
The main advantage of such approach is that the objects can be immediately destroyed after they are no longer needed.
Generational garbage collector
Why do we need additional garbage collector when we have reference counting?
Unfortunately, classical reference counting has a fundamental problem — it cannot detect reference cycles. A reference cycle occurs when one or more objects are referencing each other.
Here are two examples:
As we can see, the 'lst' object is pointing to itself, moreover, object 1
and object 2
are pointing to each other. The reference count for such objects is always at least 1.
To get a better idea you can play with a simple Python example:
1 import gc 2 3 # We are using ctypes to access our unreachable objects by memory address. 4 class PyObject(ctypes.Structure): 5 _fields_ = [("refcnt", ctypes.c_long)] 6 7 8 gc.disable() # Disable generational gc 9 10 lst = [] 11 lst.append(lst) 12 13 # Store address of the list 14 lst_address = id(lst) 15 16 # Destroy the lst reference 17 del lst 18 19 object_1 = {} 20 object_2 = {} 21 object_1['obj2'] = object_2 22 object_2['obj1'] = object_1 23 24 obj_address = id(object_1) 25 26 # Destroy references 27 del object_1, object_2 28 29 # Uncomment if you want to manually run garbage collection process 30 # gc.collect() 31 32 # Check the reference count 33 print(PyObject.from_address(obj_address).refcnt) # output is 1 34 print(PyObject.from_address(lst_address).refcnt) # output is 1
In the example above, the del statement removes the references to our objects (i.e., decreases reference count by 1). After Python executes the del statement, our objects are no longer accessible from Python code. However, such objects are still sitting in the memory, that's because they are still referencing each other and the reference count of each object is 1. You can visually explore such relations using objgraph module.
To resolve this issue, the additional cycle-detecting algorithm was introduced in Python 1.5. The gc module is responsible for this and exists only for dealing with such problem.
Reference cycles can only occur in container objects (i.e., in objects which can contain other objects), such as lists, dictionaries, classes, tuples. The GC does not track all immutable types except for a tuple. Tuples and dictionaries containing only immutable objects can also be untracked depending on certain conditions. Thus, the reference counting technique handles all non-circular references.
When does the generational GC trigger
Unlike the reference counting, the cyclic GC does not work in real-time and runs periodically. To reduce the frequency of GC calls and pauses CPython uses various heuristics.
The GC classifies container objects into three generations. Every new object starts in the first generation. If an object survives a garbage collection round, it moves to the older (higher) generation. Lower generations are collected more often than higher. Because most of the newly created objects die young, it improves GC performance and reduces the GC pause time.
In order to decide when to run, each generation has an individual counter and threshold. The counter stores the number of object allocations minus deallocations since the last collection. Every time you allocate a new container object, CPython checks whenever the counter of the first generation exceeds the threshold value. If so Python initiates the сollection process.
If we have two or more generations that currently exceed the threshold, GC chooses the oldest one. That is because oldiest generations are also collecting all previous (younger) generations. To reduce performance degradation for long-living objects the third generation has additional requirements in order to be chosen.
The standard threshold values are set to (700, 10, 10) respectively, but you can always check them using the gc.get_threshold function.
Python使用一种不同的链表来持续追踪活跃的对象。而不将其称之为“活跃列表”,Python的内部C代码将其称为零代(Generation Zero)。每次当你创建一个对象或其他什么值的时候,Python会将其加入零代链表。请注意到这并不是一个真正的列表,并不能直接在你的代码中访问,事实上这个链表是一个完全内部的Python运行时。
Python解释器保持对新创建的对象,以及因为引用计数为零而被释放掉的对象的追踪。从理论上说,这两个值应该保持一致,因为程序新建的每个对象都应该最终被释放掉。
默认地,gc.get_threshold=(700, 10, 10)
当然,事实并非如此。因为循环引用的原因,并且因为你的程序使用了一些比其他对象存在时间更长的对象,从而被分配对象的计数值与被释放对象的计数值之间的差异在逐渐增长。一旦这个差异累计超过某个阈值(默认是700个),则Python的收集机制就启动了,并且触发上边所说到的零代算法,释放“浮动的垃圾”,并且将剩下的对象移动到一代列表。
根据弱代假说(越年轻的对象越容易死掉,老的对象通常会存活更久。)
新生的对象被放入0代,如果该对象在第0代的一次gc垃圾回收中活了下来,那么它就被放到第1代里面(它就升级了)。如果第1代里面的对象在第1代的一次gc垃圾回收中活了下来,它就被放到第2代里面。
0代对象经历过10次扫描之后,对1代对象进行扫描,扫描10次之后,对2代对象进行扫描1次,然后对所有的0,1,2对象进行扫描。这里出现的两个10 ,就是上面返回的两个参数。
通过这种方法,你的代码所长期使用的对象,那些你的代码持续访问的活跃对象,会从零代链表转移到一代再转移到二代。通过不同的阈值设置,Python可以在不同的时间间隔处理这些对象。Python处理零代最为频繁,其次是一代然后才是二代。
How to find reference cycles
It is hard to explain the reference cycle detection algorithm in a few paragraphs. But basically, the GC iterates over each container object and temporarily removes all references to container objects it references. After full iteration, all objects which reference count lower than two are unreachable from Python's code and thus can be collected.
To fully understand the cycle-finding algorithm I recommend you to read an original proposal from Neil Schemenauer and collect function from CPython's source code. Also, the Quora answers and The Garbage Collector blog post can be helpful.
Note that, the problem with finalizers, which was described in the original proposal, has been fixed since Python 3.4. You can read about it in the PEP 442.
Python会复制每个对象的引用计数,比如有两个相互引用的对象a和b,此时a的引用计数我们用gc_ref_a 来表示,同理用gc_ref_b 来表示b的引用计数,然后Python会遍历所有的引用对象,这里只有a和b,遍历到a的时候,a指向b,将 b的gc_ref_b的值减1,同理遍历b的时候将a的gc_ref_a的值减1,结果他们的值都为0,最后将不为0的对象保留,为0 的对象进行垃圾回收。
1 # 情景A 2 a=[1,3] 3 b=[2,4] 4 a.append(b) 5 b.append(a) 6 del a 7 del b 8 9 # 情景B 10 a=[1,3] 11 b=[2,4] 12 a.append(b) 13 b.append(a) 14 del a在标记-清除算法中,有两个集中营,一个是root链表(root object),另外一个是unreachable链表。
- 对于情景A,原来再未执行DEL语句的时候,a,b的引用计数都为2(init+append=2),但是在DEL执行完以后,a,b引用次数互相减1。a,b陷入循环引用的圈子中,然后标记-清除算法开始出来做事,找到其中一端a,开始拆这个a,b的引用环(我们从A出发,因为它有一个对B的引用,则将B的引用计数减1;然后顺着引用达到B,因为B有一个对A的引用,同样将A的引用减1,这样,就完成了循环引用对象间环摘除),去掉以后发现,a,b循环引用变为了0,所以a,b就被处理到unreachable链表中直接被做掉。
- 对于情景B,简单一看那b取环后引用计数还为1,但是a取环,就为0了。这个时候a已经进入unreachable链表中,已经被判为死刑了,但是这个时候,root链表中有b。如果a被做掉,那世界上还有什么正义... ,在root链表中的b会被进行引用检测引用了a,如果a被做掉了,那么b就...凉凉,一审完事,二审a无罪,所以被拉到了root链表中。
之所以要剖成两个链表,是基于这样的一种考虑:现在的unreachable可能存在被root链表中的对象,直接或间接引用的对象,这些对象是不能被回收的,一旦在标记的过程中,发现这样的对象,就将其从unreachable链表中移到root链表中;当完成标记后,unreachable链表中剩下的所有对象就是名副其实的垃圾对象了,接下来的垃圾回收只需限制在unreachable链表中即可。
Performance tips
Cycles can easily happen in real life. Typically you encounter them in graphs, linked lists or in structures, in which you need to keep track of relations between objects. If your program has an intensive workload and requires low latency, you should avoid reference cycles as possible.
To avoid circular references in your code, you need to use weak references, which are implemented in the weakref module. Unlike the usual references, the weakref.ref doesn't increase the reference count and returns None if an object was destroyed.
In some cases, it is useful to disable GC and use it manually. The automatic collection can be disabled by calling gc.disable(). To manually run collection process you need to use gc.collect().
How to find and debug reference cycles
Debugging reference cycles can be very frustrating especially when you use a lot of third-party libraries.
The standard gc module provides a lot of useful helpers that can help in debugging. If you set debugging flags to DEBUG_SAVEALL, all unreachable objects found will be appended to gc.garbage list.
1 import gc 2 3 gc.set_debug(gc.DEBUG_SAVEALL) 4 5 print(gc.get_count()) 6 lst = [] 7 lst.append(lst) 8 list_id = id(lst) 9 del lst 10 gc.collect() 11 for item in gc.garbage: 12 print(item) 13 assert list_id == id(item)
Once you have identified a problematic spot in your code you can visually explore object's relations using objgraph.
Conclusion
Most of the garbage collection is done by reference counting algorithm, which we cannot tune at all. So, be aware of implementation specifics, but don't worry about potential GC problems prematurely.
Hopefully, you have learned something new. If you have any questions left, I will be glad to answer them.
引用部分参考以下链接:
https://www.jianshu.com/p/1e375fb40506