pandas 中 apply 是个很常用的方法,但其效率是比较低的,本文介绍一些加速方法
数据准备
df = pd.DataFrame(np.random.randint(0, 11, size=(1000000, 5)), columns=('a','b','c','d','e'))
apply 效率测试
if __name__ == '__main__':def func(a,b,c,d,e): if e == 10: return c*d elif (e < 10) and (e>=5): return c+d elif e < 5: return a+b time.process_time() df['new'] = df.apply(lambda x: func(x['a'], x['b'], x['c'], x['d'], x['e']), axis=1) print(time.process_time()) # 19s
耗时 19s
swift 加速
if __name__ == '__main__':### swift 加速 import swifter time.process_time() df['new'] = df.swifter.apply(lambda x : func(x['a'],x['b'],x['c'],x['d'],x['e']),axis=1) print(time.process_time()) # 7s 注意观察,启动线程需 7s,最终执行只是 1s
耗时 7s
向量化
处理 pandas 和 numpy 最好的方法就是向量化,即当做向量进行操作,胜过一切 方法、函数、乱七八糟的加速手段
if __name__ == '__main__':### 向量化 # 向量操作快于 apply time.process_time() df['new'] = df['c'] * df['d'] # default case e = =10 mask = df['e'] < 10 df.loc[mask, 'new'] = df['c'] + df['d'] mask = df['e'] < 5 df.loc[mask, 'new'] = df['a'] + df['b'] print(time.process_time()) # 1.2s
耗时 1.2s
本文的重点 其实不是 apply 方法,记住一点即可:把 pandas 和 numpy 当做向量处理是最快的
参考资料还有其他更快的方法,但我实验不成功,就没写,大家可以试试
完整代码
import time import pandas as pd import numpy as np if __name__ == '__main__': df = pd.DataFrame(np.random.randint(0, 11, size=(1000000, 5)), columns=('a','b','c','d','e')) def func(a,b,c,d,e): if e == 10: return c*d elif (e < 10) and (e>=5): return c+d elif e < 5: return a+b # time.clock() time.process_time() df['new'] = df.apply(lambda x: func(x['a'], x['b'], x['c'], x['d'], x['e']), axis=1) # print(time.clock()) print(time.process_time()) # 19s ### swift 加速 import swifter time.process_time() df['new'] = df.swifter.apply(lambda x : func(x['a'],x['b'],x['c'],x['d'],x['e']),axis=1) print(time.process_time()) # 7s 注意观察,启动线程需 7s,最终执行只是 1s ### 向量化 # 向量操作快于 apply time.process_time() df['new'] = df['c'] * df['d'] # default case e = =10 mask = df['e'] < 10 df.loc[mask, 'new'] = df['c'] + df['d'] mask = df['e'] < 5 df.loc[mask, 'new'] = df['a'] + df['b'] print(time.process_time()) # 1.2s ### 类型转换 + 向量化 df = df.astype(np.int16) time.process_time() # df = df.astype(np.int16) df['new'] = df['c'] * df['d'] # default case e = =10 mask = df['e'] < 10 df.loc[mask, 'new'] = df['c'] + df['d'] mask = df['e'] < 5 df.loc[mask, 'new'] = df['a'] + df['b'] print(time.process_time()) # 1.3s ### values time.process_time() df = df.astype(np.int16) df['new'] = df['c'].values * df['d'].values # default case e = =10 mask = df['e'].values < 10 df.loc[mask, 'new'] = df['c'] + df['d'] mask = df['e'].values < 5 df.loc[mask, 'new'] = df['a'] + df['b'] print(time.process_time()) # 1s
参考资料:
https://mp.weixin.qq.com/s/cfoToYjcXXV5NJfwUr_1wA Pandas中Apply函数加速百倍的技巧
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· 记一次.NET内存居高不下排查解决与启示
· 探究高空视频全景AR技术的实现原理
· 理解Rust引用及其生命周期标识(上)
· 浏览器原生「磁吸」效果!Anchor Positioning 锚点定位神器解析
· 没有源码,如何修改代码逻辑?
· 全程不用写代码,我用AI程序员写了一个飞机大战
· DeepSeek 开源周回顾「GitHub 热点速览」
· 记一次.NET内存居高不下排查解决与启示
· MongoDB 8.0这个新功能碉堡了,比商业数据库还牛
· .NET10 - 预览版1新功能体验(一)