如何快速地从mongo中提取数据到numpy以及pandas中去
mongo数据通常过于庞大,很难一下子放进内存里进行分析,如果直接在python里使用字典来存贮每一个文档,使用list来存储数据的话,将很快是内存沾满。型号拥有numpy和pandas
import numpy import pymongo c = pymongo.MongoClient() collection = c.mydb.collection num = collection.count() arrays = [ numpy.zeros(num) for i in range(5) ] for i, record in enumerate(collection.find()): for x in range(5): arrays[x][i] = record["x%i" % x+1] for array in arrays: # prove that we did something... print numpy.mean(array)
上面的代码在处理大量数据时,发现消耗时间的关键在于pymongo cursor的迭代,为此有一个c写好的库monary 来直接实现这种转换来提高效率
from monary import Monary import numpy with Monary("127.0.0.1") as monary: arrays = monary.query( "mydb", # database name "collection", # collection name {}, # query spec ["x1", "x2", "x3", "x4", "x5"], # field names (in Mongo record) ["float64"] * 5 # Monary field types (see below) ) for array in arrays: # prove that we did something... print numpy.mean(array)
那转换成pandas呢?