Numpy基础学习笔记2
3.数组处理数据
Numpy数组可以代替循环,进行矢量化的运算,通常会比纯python的方式快一两个数量级。
3.1 将条件逻辑表述为数组运算
np.where函数是x if condition else y
的矢量化版本。
In [15]: yarr = np.array([2.1,2.2,2.3,2.4,2.5])
In [16]: cond = np.array([True,False,True,True,False])
In [17]: xarr = np.array([1.1,1.2,1.3,1.4,1.5])
In [18]: np.where(cond,xarr,yarr) # 判断cond条件,真zarr,假yarr
Out[18]: array([ 1.1, 2.2, 1.3, 1.4, 2.5])
另一个例子,希望将一组随机数,正数替换为2,负数替换为-2
In [19]: arr = np.random.randn(4,4)
In [20]: arr
Out[20]:
array([[ 1.18242592, 0.34138367, 0.36648288, 0.87214939],
[ 0.67129526, 0.2410077 , 0.37928273, -0.43982009],
[ 0.47559093, -0.050917 , -0.10229582, 1.58122926],
[ 0.83486166, -1.27310522, 0.17164926, 0.77951888]])
In [21]: np.where(arr > 0,2,-2)
Out[21]:
array([[ 2, 2, 2, 2],
[ 2, 2, 2, -2],
[ 2, -2, -2, 2],
[ 2, -2, 2, 2]])
In [22]: np.where(arr > 0,2,arr) # 负数还是arr
Out[22]:
array([[ 2. , 2. , 2. , 2. ],
[ 2. , 2. , 2. , -0.43982009],
[ 2. , -0.050917 , -0.10229582, 2. ],
[ 2. , -1.27310522, 2. , 2. ]])
3.2 数学和统计方法
这些方法一般可以作为实例方法调用,也可以当做Numpy函数使用。
In [23]: arr = np.random.randn(5,4)
In [24]: arr.mean()
Out[24]: -0.024836906150552153
In [25]: np.mean(arr)
Out[25]: -0.024836906150552153
基本数组统计方法如下:
In [26]: arr
Out[26]:
array([[-0.03065448, 0.91344557, -0.77812406, -1.608862 ],
[ 1.58463814, 0.98126805, 1.06389757, -1.17451329],
[ 1.48408281, 0.02386196, -0.80217916, 0.29413806],
[ 0.11536984, 1.73736452, 0.93596778, 0.26898712],
[-2.05527855, 0.49837502, -2.56571303, -1.38280997]])
In [27]: arr.sum()
Out[27]: -0.49673812301104303
In [28]: arr.sum(axis=0)
Out[28]: array([ 1.09815775, 4.15431511, -2.14615091, -3.60306008])
In [29]: arr.sum(axis=1)
Out[29]: array([-1.50419497, 2.45529046, 0.99990367, 3.05768925, -5.50542653]
)
In [30]: arr.mean()
Out[30]: -0.024836906150552153
In [31]: arr.mean(axis=1)
Out[31]: array([-0.37604874, 0.61382262, 0.24997592, 0.76442231, -1.37635663]
)
In [32]: arr.std()
Out[32]: 1.2223549632355621
In [33]: arr.var()
Out[33]: 1.4941516561466126
In [34]: arr.min()
Out[34]: -2.565713031578829
In [35]: arr.max()
Out[35]: 1.7373645152425918
In [36]: arr.argmin()
Out[36]: 18
In [37]: arr.cumsum()
Out[37]:
array([-0.03065448, 0.88279109, 0.10466703, -1.50419497, 0.08044316,
1.06171121, 2.12560878, 0.95109549, 2.4351783 , 2.45904026,
1.6568611 , 1.95099916, 2.066369 , 3.80373352, 4.73970129,
5.00868841, 2.95340986, 3.45178488, 0.88607184, -0.49673812])
In [38]: arr.cumprod()
Out[38]:
array([ -3.06544789e-02, -2.80011979e-02, 2.17884059e-02,
-3.50545383e-02, -5.55487582e-02, -5.45082216e-02,
-5.79911645e-02, 6.81113935e-02, 1.01082948e-01,
2.41203713e-03, -1.93488591e-03, -5.69123592e-04,
-6.56596961e-05, -1.14074826e-04, -1.06770361e-04,
-2.87198518e-05, 5.90272954e-05, 2.94177294e-05,
-7.54774516e-05, 1.04370972e-04])
3.3 用于布尔型数组的方法
布尔值是True和False,同时也是1和0。我们可以使用sum来统计True值得计数。
In [39]: arr
Out[39]:
array([[-0.03065448, 0.91344557, -0.77812406, -1.608862 ],
[ 1.58463814, 0.98126805, 1.06389757, -1.17451329],
[ 1.48408281, 0.02386196, -0.80217916, 0.29413806],
[ 0.11536984, 1.73736452, 0.93596778, 0.26898712],
[-2.05527855, 0.49837502, -2.56571303, -1.38280997]])
In [40]: (arr>0).sum()
Out[40]: 12
In [41]: arr>0
Out[41]:
array([[False, True, False, False],
[ True, True, True, False],
[ True, True, False, True],
[ True, True, True, True],
[False, True, False, False]], dtype=bool)
还有ang和all两个方法,可以用于布尔型数组,也可以用于非布尔型。在用于非布尔型数组时,所有非0元素都被当做True。
In [46]: bools = arr > 0 #将arr>0这个bool型数组赋值
In [47]: bools
Out[47]:
array([[False, True, False, False],
[ True, True, True, False],
[ True, True, False, True],
[ True, True, True, True],
[False, True, False, False]], dtype=bool)
In [48]: bools.any()
Out[48]: True
In [49]: bools.all()
Out[49]: False
In [50]: arr.any() #非0值将当成True处理。
Out[50]: True
3.4 排序
Numpy数组可以通过sort方法就地排序。
In [51]: arr
Out[51]:
array([[-0.03065448, 0.91344557, -0.77812406, -1.608862 ],
[ 1.58463814, 0.98126805, 1.06389757, -1.17451329],
[ 1.48408281, 0.02386196, -0.80217916, 0.29413806],
[ 0.11536984, 1.73736452, 0.93596778, 0.26898712],
[-2.05527855, 0.49837502, -2.56571303, -1.38280997]])
In [52]: arr.sort()
In [53]: arr
Out[53]:
array([[-1.608862 , -0.77812406, -0.03065448, 0.91344557],
[-1.17451329, 0.98126805, 1.06389757, 1.58463814],
[-0.80217916, 0.02386196, 0.29413806, 1.48408281],
[ 0.11536984, 0.26898712, 0.93596778, 1.73736452],
[-2.56571303, -2.05527855, -1.38280997, 0.49837502]])
In [54]: arr.sort(axis=0)
In [55]: arr
Out[55]:
array([[-2.56571303, -2.05527855, -1.38280997, 0.49837502],
[-1.608862 , -0.77812406, -0.03065448, 0.91344557],
[-1.17451329, 0.02386196, 0.29413806, 1.48408281],
[-0.80217916, 0.26898712, 0.93596778, 1.58463814],
[ 0.11536984, 0.98126805, 1.06389757, 1.73736452]])
In [56]: arr.sort(1)
In [57]: arr
Out[57]:
array([[-2.56571303, -2.05527855, -1.38280997, 0.49837502],
[-1.608862 , -0.77812406, -0.03065448, 0.91344557],
[-1.17451329, 0.02386196, 0.29413806, 1.48408281],
[-0.80217916, 0.26898712, 0.93596778, 1.58463814],
[ 0.11536984, 0.98126805, 1.06389757, 1.73736452]])
举个例子,求一个数组百分之5的分位数。
In [62]: arr = np.random.randn(1000)
In [63]: arr.sort()
In [64]: arr[int(0.05 * len(arr))]
Out[64]: -1.6307748333138019
In [67]: arr[50]
Out[67]: -1.6307748333138019
3.5 唯一化(去重)以及数组的集合运算
np.unique
方法为数组去重,并排序。
In [68]: names = np.array(["Bob","Joe","Will","Bob","Will","Joe","Joe"])
In [69]: np.unique(names)
Out[69]:
array(['Bob', 'Joe', 'Will'],
dtype='<U4')
# 该方法类似于纯python中的如下:
In [70]: sorted(set(names))
Out[70]: ['Bob', 'Joe', 'Will']
其他集合运算:
In [71]: x = np.arange(1,101)
In [72]: y = np.arange(51,151)
In [73]: x
Out[73]:
array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,
14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26,
27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39,
40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52,
53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65,
66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78,
79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91,
92, 93, 94, 95, 96, 97, 98, 99, 100])
In [74]: y
Out[74]:
array([ 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63,
64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76,
77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89,
90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102,
103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115,
116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128,
129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141,
142, 143, 144, 145, 146, 147, 148, 149, 150])
In [75]: np.intersect1d(x,y)
Out[75]:
array([ 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63,
64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76,
77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89,
90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100])
In [77]: np.union1d(x,y)
Out[77]:
array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,
14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26,
27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39,
40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52,
53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65,
66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78,
79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91,
92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104,
105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117,
118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130,
131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143,
144, 145, 146, 147, 148, 149, 150])
In [78]: np.in1d(x,y)
Out[78]:
array([False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, True, True, True, True,
True, True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True, True, True], dt
ype=bool)
In [79]: np.setdiff1d(x,y)
Out[79]:
array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,
18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34,
35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50])
In [80]: np.setxor1d(x,y)
Out[80]:
array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,
14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26,
27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39,
40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 101, 102,
103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115,
116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128,
129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141,
142, 143, 144, 145, 146, 147, 148, 149, 150])
4.文件处理
Numpy可以读写文本数据或二进制数据。后续有pandas来处理文本,因此本部分简单介绍。
4.1 以二进制方式保存和读取numpy数组
单个数组,保存时会自动添加后缀名.npy
In [86]: arr = np.arange(10)
In [88]: np.save("some_array", arr)
In [90]: np.load("some_array.npy")
Out[90]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
多个数组,可以使用压缩方式存储,后缀名.npz
In [91]: arr
Out[91]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
In [92]: arr2 = np.arange(20)
In [93]: np.savez("array_archive.npz",a=arr,b=arr2)
In [94]: arch = np.load("array_archive.npz")
In [95]: arch
Out[95]: <numpy.lib.npyio.NpzFile at 0x7084f98>
In [96]: arch['b']
Out[96]:
array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
17, 18, 19])
4.2 存取文本文件
使用np.savetxt
和np.loadtxt
两个方法来实现。后面会主要介绍pandas中的read_csv和read_table函数,这里不详细介绍。
In [99]: arr = np.random.randn(5,5)
In [102]: np.savetxt("arr.txt",arr,delimiter=",")
In [103]: np.loadtxt("arr.txt",delimiter=",")
Out[103]:
array([[ 0.45439906, -0.11067033, 1.67561654, 0.14142381, 0.1016269 ],
[-1.09070259, 0.41627682, -0.81896911, -0.14980666, -1.06391152],
[-0.88333647, 0.28268258, 0.69605952, 0.36348569, -0.53223699],
[-0.50561387, -0.65916355, 1.40181374, 1.17810701, 1.31155551],
[ 0.060254 , -1.02915195, -0.59382843, 0.49100178, -0.9541697 ]])
In [104]: arr
Out[104]:
array([[ 0.45439906, -0.11067033, 1.67561654, 0.14142381, 0.1016269 ],
[-1.09070259, 0.41627682, -0.81896911, -0.14980666, -1.06391152],
[-0.88333647, 0.28268258, 0.69605952, 0.36348569, -0.53223699],
[-0.50561387, -0.65916355, 1.40181374, 1.17810701, 1.31155551],
[ 0.060254 , -1.02915195, -0.59382843, 0.49100178, -0.9541697 ]])
待续。。。