15.分块读取
分块读取:
当我们处理大型文件的时候,读入文件的一个小片段或者按小块遍历文件是比较好的做法。
在这之前,我们最好先对Pandas的显示设置进行调整,使之更为紧凑:
In [45]: pd.options.display.max_rows = 10
这样,即使是大文件,最多也只会显式10行具体内容。
In [46]: result = pd.read_csv('d:/ex6.csv')
In [47]: result
Out[47]:
one two three four key
0 0.467976 -0.038649 -0.295344 -1.824726 L
1 -0.358893 1.404453 0.704965 -0.200638 B
2 -0.501840 0.659254 -0.421691 -0.057688 G
3 0.204886 1.074134 1.388361 -0.982404 R
4 0.354628 -0.133116 0.283763 -0.837063 Q
... ... ... ... ... ..
9995 2.311896 -0.417070 -1.409599 -0.515821 L
9996 -0.479893 -0.650419 0.745152 -0.646038 E
9997 0.523331 0.787112 0.486066 1.093156 K
9998 -0.362559 0.598894 -1.843201 0.887292 G
9999 -0.096376 -1.012999 -0.657431 -0.573315 0
[10000 rows x 5 columns]
或者使用nrows参数,指明从文件开头往下只读n行:
In [48]: result = pd.read_csv('d:/ex6.csv',nrows=5)
In [49]: result
Out[49]:
one two three four key
0 0.467976 -0.038649 -0.295344 -1.824726 L
1 -0.358893 1.404453 0.704965 -0.200638 B
2 -0.501840 0.659254 -0.421691 -0.057688 G
3 0.204886 1.074134 1.388361 -0.982404 R
4 0.354628 -0.133116 0.283763 -0.837063 Q
或者指定chunksize作为每一块的行数,分块读入文件:
In [50]: chunker = pd.read_csv('d:/ex6.csv', chunksize=1000)
In [51]: chunker
Out[51]: <pandas.io.parsers.TextFileReader at 0x2417d6cfb38>
上面的TextFileReader对象是一个可迭代对象。例如我们可以遍历它,并对‘key’列进行聚合获得计数值:
In [52]: total = pd.Series([])
In [53]: for piece in chunker:
...: total = total.add(piece['key'].value_counts(), fill_value=0)
...: total = total.sort_values(ascending=False)
In [54]: total
Out[54]:
E 368.0
X 364.0
L 346.0
O 343.0
Q 340.0
...
5 157.0
2 152.0
0 151.0
9 150.0
1 146.0
Length: 36, dtype: float64
In [55]: total[:10]
Out[55]:
E 368.0
X 364.0
L 346.0
O 343.0
Q 340.0
M 338.0
J 337.0
F 335.0
K 334.0
H 330.0
dtype: float64