pandas - apply 加速

pandas 中 apply 是个很常用的方法，但其效率是比较低的，本文介绍一些加速方法

数据准备

df = pd.DataFrame(np.random.randint(0, 11, size=(1000000, 5)),
                      columns=('a','b','c','d','e'))

apply 效率测试

if __name__ == '__main__':def func(a,b,c,d,e):
        if e == 10:
            return c*d
        elif (e < 10) and (e>=5):
            return c+d
        elif e < 5:
            return a+b

    time.process_time()
    df['new'] = df.apply(lambda x: func(x['a'], x['b'], x['c'], x['d'], x['e']), axis=1)
    print(time.process_time())      # 19s

耗时 19s

swift 加速

if __name__ == '__main__':### swift 加速
    import swifter
    time.process_time()
    df['new'] = df.swifter.apply(lambda x : func(x['a'],x['b'],x['c'],x['d'],x['e']),axis=1)
    print(time.process_time())      # 7s    注意观察，启动线程需 7s，最终执行只是 1s

耗时 7s

向量化

处理 pandas 和 numpy 最好的方法就是向量化，即当做向量进行操作，胜过一切方法、函数、乱七八糟的加速手段

if __name__ == '__main__':### 向量化
    # 向量操作快于 apply
    time.process_time()
    df['new'] = df['c'] * df['d']  # default case e = =10
    mask = df['e'] < 10
    df.loc[mask, 'new'] = df['c'] + df['d']
    mask = df['e'] < 5
    df.loc[mask, 'new'] = df['a'] + df['b']
    print(time.process_time())      # 1.2s

耗时 1.2s

本文的重点其实不是 apply 方法，记住一点即可：把 pandas 和 numpy 当做向量处理是最快的

参考资料还有其他更快的方法，但我实验不成功，就没写，大家可以试试

完整代码

import time
import pandas as pd
import numpy as np


if __name__ == '__main__':
    df = pd.DataFrame(np.random.randint(0, 11, size=(1000000, 5)),
                      columns=('a','b','c','d','e'))

    def func(a,b,c,d,e):
        if e == 10:
            return c*d
        elif (e < 10) and (e>=5):
            return c+d
        elif e < 5:
            return a+b

    # time.clock()
    time.process_time()
    df['new'] = df.apply(lambda x: func(x['a'], x['b'], x['c'], x['d'], x['e']), axis=1)
    # print(time.clock())
    print(time.process_time())      # 19s

    ### swift 加速
    import swifter
    time.process_time()
    df['new'] = df.swifter.apply(lambda x : func(x['a'],x['b'],x['c'],x['d'],x['e']),axis=1)
    print(time.process_time())      # 7s    注意观察，启动线程需 7s，最终执行只是 1s

    ### 向量化
    # 向量操作快于 apply
    time.process_time()
    df['new'] = df['c'] * df['d']  # default case e = =10
    mask = df['e'] < 10
    df.loc[mask, 'new'] = df['c'] + df['d']
    mask = df['e'] < 5
    df.loc[mask, 'new'] = df['a'] + df['b']
    print(time.process_time())      # 1.2s

    ### 类型转换 + 向量化
    df = df.astype(np.int16)
    time.process_time()
    # df = df.astype(np.int16)
    df['new'] = df['c'] * df['d']  # default case e = =10
    mask = df['e'] < 10
    df.loc[mask, 'new'] = df['c'] + df['d']
    mask = df['e'] < 5
    df.loc[mask, 'new'] = df['a'] + df['b']
    print(time.process_time())      # 1.3s

    ### values
    time.process_time()
    df = df.astype(np.int16)
    df['new'] = df['c'].values * df['d'].values  # default case e = =10
    mask = df['e'].values < 10
    df.loc[mask, 'new'] = df['c'] + df['d']
    mask = df['e'].values < 5
    df.loc[mask, 'new'] = df['a'] + df['b']
    print(time.process_time())          # 1s

参考资料：

https://mp.weixin.qq.com/s/cfoToYjcXXV5NJfwUr_1wA　　Pandas中Apply函数加速百倍的技巧

发表于 2021-08-30 18:00 努力的孔子阅读(682) 评论(0) 编辑收藏举报

刷新页面返回顶部

pandas - apply 加速

导航