上g超大文件,使用joblib结合操作,速度会更快

pip install joblib 

joblib 是一个并行计算包,提高代码运行效率

Parallel对象会创建一个进程池,以便在多进程中执行每一个列表项

 

from math import sqrt
data=[sqrt(i) for i in range(10)]
print(data)

 

默认情况下,Parallel使用多进程

n_jobs任务数,进程数

 

from joblib import Parallel, delayed
data=Parallel(n_jobs=2)(delayed(sqrt)(i) for i in range(10))
print(data)

 

函数delayed

创建元组(function,args,kwargs ) 核心思想是把代码写成生成器表达式

多线程   backend='threading'

from joblib import Parallel, delayed
data=Parallel(n_jobs=2,backend="threading")(delayed(sqrt)(i) for i in range(10))
print(data)

 

from sklearn.utils import parallel_backend
with parallel_backend('threading'):
print(Parallel(n_jobs=10)(delayed(sqrt)(i) for i in range(10)))

from sklearn.utils import parallel_backend
with parallel_backend('multiprocessing'):
print(Parallel(n_jobs=10)(delayed(sqrt)(i) for i in range(10)))

 

 

 

案例:

def grouptest(df):
return df.groupby('test1')['次数'].sum()
l=[]
for e in ce:
df=pd.read_csv(e,chunksize=20000,iterator=True)
df1=pd.concat(Parallel(n_jobs=4)(delayed(grouptest)(df) for df in df))
l.append(df1)
pd.concat(l).groupby(level=[0].sum())

import pandas as pd
from joblib import Parallel,delayed
import glob
ce=glob.glob(r'E:\`````)

posted @ 2021-08-04 16:52  万物细雨  阅读(574)  评论(0)    收藏  举报