上g超大文件,使用joblib结合操作,速度会更快
pip install joblib
joblib 是一个并行计算包,提高代码运行效率
Parallel对象会创建一个进程池,以便在多进程中执行每一个列表项
from math import sqrt
data=[sqrt(i) for i in range(10)]
print(data)
默认情况下,Parallel使用多进程
n_jobs任务数,进程数
from joblib import Parallel, delayed
data=Parallel(n_jobs=2)(delayed(sqrt)(i) for i in range(10))
print(data)
函数delayed
创建元组(function,args,kwargs ) 核心思想是把代码写成生成器表达式
多线程 backend='threading'
from joblib import Parallel, delayed
data=Parallel(n_jobs=2,backend="threading")(delayed(sqrt)(i) for i in range(10))
print(data)
from sklearn.utils import parallel_backend
with parallel_backend('threading'):
print(Parallel(n_jobs=10)(delayed(sqrt)(i) for i in range(10)))
from sklearn.utils import parallel_backend
with parallel_backend('multiprocessing'):
print(Parallel(n_jobs=10)(delayed(sqrt)(i) for i in range(10)))
案例:
def grouptest(df):
return df.groupby('test1')['次数'].sum()
l=[]
for e in ce:
df=pd.read_csv(e,chunksize=20000,iterator=True)
df1=pd.concat(Parallel(n_jobs=4)(delayed(grouptest)(df) for df in df))
l.append(df1)
pd.concat(l).groupby(level=[0].sum())
import pandas as pd
from joblib import Parallel,delayed
import glob
ce=glob.glob(r'E:\`````)

浙公网安备 33010602011771号