Python3 os.walk()函数导致buffer/cache占用过高问题处理

一、背景说明

os.walk()应该是当前python中遍历目录最推荐的函数，之前用python写了一个用于收集系统用到的第三方组件的脚本，在测试时使用os.walk()遍历了部分目录，并通过了全网的测试。但在改成遍历根目录后，被业务反馈说脚本占用内存过高导致了内存告警。

在直观感觉上，只遍历目录又不打开文件，应该只是相当于加载了一个目录树，不可能造成几十G内存的上涨。但一方面内存上涨时间和脚本的时间是一致的，另一方面在杀除脚本后内存出现了下降。所以基本可以确定内存上涨确实和该脚本是有关系的。

经过反复的测试和观察，总结出以下两个现象：

python的os.walk和系统tree命令，只要文件一多，占用的buffer/cache就会明显上涨。
find命令，如果/proc目录文件一多，占用的buffer/cache也会明显上涨。

将该结论反馈给技术大佬，他分析之后给出这两个现象的更根本原因：

python的os.walk和系统tree命令在遍历目录时除了加载目录树还会加载文件的stat信息，所以文件一多就会占用很多buffer/cache。
find在遍历其他目录时只加载目录树不加域stat信息，所以不明显占用buffer/cache；但在遍历/proc时也会加载stat信息，所以/proc文件一多也会导致buffer/cache上涨。其实只是简单根据文件名的find不加载stat信息，如果根据日期等条件去find还是要加载stat信息。

另外对于由于目录问题导致的buffer/cache上涨，可使用以下命令进行清理：

sync; echo 2 > /proc/sys/vm/drop_caches

参考：https://www.tecmint.com/clear-ram-memory-cache-buffer-and-swap-space-on-linux/

二、问题处理

所以buffer/cache涨不涨的两个因素已经很明显了：文件数量和加不加载文件stat信息。

回到我们最初的目标收集所有第三方组件，这必然要求遍历整个磁盘，所以文件数量是不可限制的，所以只能想办法不加载文件的stat信息。不加载stat信息到现在看只好用普通的find命令，但这不是python原生的做法而且限制比较大。后来技术大佬看了find源码，仿照写了个不加stat信息的函数。

import os
import pdb
from ctypes import CDLL, c_char_p, c_int, c_long, c_ushort, c_byte, c_char, Structure, POINTER
from ctypes.util import find_library


class c_dir(Structure):
    """Opaque type for directory entries, corresponds to struct DIR"""
    pass


c_dir_p = POINTER(c_dir)


class c_dirent(Structure):
    """Directory entry"""
    # FIXME not sure these are the exactly correct types!
    _fields_ = (
        ('d_ino', c_long),  # inode number
        ('d_off', c_long),  # offset to the next dirent
        ('d_reclen', c_ushort),  # length of this record
        ('d_type', c_byte),  # type of file; not supported by all file system types
        ('d_name', c_char * 4096)  # filename
    )


c_dirent_p = POINTER(c_dirent)
c_lib = CDLL(find_library("c"))
opendir = c_lib.opendir
opendir.argtypes = [c_char_p]
opendir.restype = c_dir_p
# FIXME Should probably use readdir_r here
readdir = c_lib.readdir
readdir.argtypes = [c_dir_p]
readdir.restype = c_dirent_p
closedir = c_lib.closedir
closedir.argtypes = [c_dir_p]
closedir.restype = c_int

DT_FIFO = 1
DT_CHR = 2
DT_DIR = 4
DT_BLK = 6
DT_REG = 8
DT_LNK = 10
DT_SOCK = 12
DT_WHT = 14


def listdir(path):
    """
    A generator to return the names of files in the directory passed in
    """
    dir_p = opendir(path)
    try:
        while True:
            p = readdir(dir_p)
            if not p:
                break
            name = p.contents.d_name
            if name not in (".", ".."):
                yield name, p.contents.d_type
    finally:
        closedir(dir_p)


def _traversal_path(name, parent, res_array, follow_link=False, ):
    if not os.path.exists(name):
        return
    cur = os.path.join(parent, name)
    if not os.path.isdir(name):
        res_array.append(cur)
    elif cur in dir_white_list:
        return
    else:
        for cn, ct in listdir(name):
            if ct & DT_DIR != DT_DIR:
                res_array.append(os.path.join(cur, cn))
            elif not follow_link and (
                    ct & DT_LNK == DT_LNK
            ):
                res_array.append(os.path.join(cur, cn))
            else:
                os.chdir(name)
                _traversal_path(cn, cur, res_array, follow_link)
                os.chdir("..")


def traversal_path(path, follow_link=False):
    # pdb.set_trace()
    files = []
    name = os.path.basename(path)
    parent = os.path.dirname(path)
    if name == "":
        name = parent
        parent = "."
    cur = os.curdir
    os.chdir(parent)
    _traversal_path(name, parent, files, follow_link)
    os.chdir(cur)
    return files


# 白名单目录
# 其实做了不加载stat信息处理，所以即便遍历/proc预期上也不会导致buffer/cache上涨
# 但一般这些目录都是系统目录，尤其是/proc文件系统还比较复杂，所以我们直接略过省时省心
dir_white_list = ["/proc", "/sys", "/dev", "/boot"]

if __name__ == '__main__':
    for f in traversal_path("/"):
        print(f)

posted on 2020-06-23 19:48 诸子流阅读(1495) 评论(0) 编辑收藏举报