NumPy

参见维基百科NumPy

NumPy

Type: module


Provides

  1. An array object of arbitrary homogeneous items
  2. Fast mathematical operations over arrays
  3. Linear Algebra, Fourier Transforms, Random Number Generation

How to use the documentation


Documentation is available in two forms: docstrings provided
with the code, and a loose standing reference guide, available from
the NumPy homepage http://www.scipy.org_.

We recommend exploring the docstrings using
IPython http://ipython.scipy.org_, an advanced Python shell with
TAB-completion and introspection capabilities.

For some objects, np.info(obj) may provide additional help(用来获取函数,类,模块的一些相关信息). This is
particularly true if you see the line "Help on ufunc object:" at the top
of the help() page. Ufuncs are implemented in C, not Python, for speed.
The native Python help() does not know how to view their help, but our
np.info() function does.

To search for documents containing a keyword, do::

import numpy as np
np.lookfor('keyword')

General-purpose documents like a glossary and help on the basic concepts
of numpy are available under the doc sub-module::

from numpy import doc
help(doc)
Available subpackages

---------------------
doc
    Topical documentation on broadcasting, indexing, etc.
lib
    Basic functions used by several sub-packages.
random
    Core Random Tools
linalg
    Core Linear Algebra Tools
fft
    Core FFT routines
polynomial
    Polynomial tools
testing
    NumPy testing tools
f2py
    Fortran to Python Interface Generator.
distutils
    Enhancements to distutils with support for
    Fortran compilers support and more.

Utilities

---------
test
    Run numpy unittests
show_config
    Show numpy build configuration
dual
    Overwrite certain functions with high-performance Scipy tools
matlib
    Make everything matrices.
__version__
    NumPy version string

下面举几个例子:

import numpy as np
help(doc)   

help(doc.creation)

doc.basics?

help(np.lib)

ndarray预览

翻译自Quickstart tutorial¶
NumPy的主要的对象是同类的多维数组(homogeneous multidimensional array)。 NumPy的维度(dimensions)被称为轴(axes)。 轴的数字代表rank

例如,在三维空间中一个坐标(coordinates)为[1, 2, 1]的点是一维数组,axis的长度(length)是3。而

[[ 1., 0., 0.],
 [ 0., 1., 2.]]

的rank是 2 (此数组是2-dimensional)。它的第一个维度(dimension (axis) )的长度是 2, 第二个维度长度是3。

NumPy的array类被称为ndarray

  • ndarray.ndim: 数组的坐标轴(或轴或维度)(axes (dimensions))的个数。
  • ndarray.shape: 数组的维度(dimensions),是由每个维度的length组成的整数元组。
    对于一个n行m列的矩阵(matrix), shape便是(n,m)
  • ndarray.size: 数组的元素(elements)的总数,等于shape的元素的积。
  • ndarray.dtype:一个描述数组的元素的类型的对象。
  • ndarray.itemsize:数组的每个元素的二进制表示的大小。 例如,元素的类型为float64的数组有 8 (=64/8)个itemsize,类型为 complex32itemsize 4 (=32/8)
  • ndarray.data:the buffer containing the actual elements of the array. Normally, we won’t need to use this attribute because we will access the elements in an array using indexing facilities.

下面有一些示例:

z = np.array([[ 0,  1,  2,  3,  4],
              [ 5,  6,  7,  8,  9],
              [10, 11, 12, 13, 14]])
t = np.array([z, 2 * z + 1])
t
array([[[ 0,  1,  2,  3,  4],
        [ 5,  6,  7,  8,  9],
        [10, 11, 12, 13, 14]],

       [[ 1,  3,  5,  7,  9],
        [11, 13, 15, 17, 19],
        [21, 23, 25, 27, 29]]])
print('z.ndim = ', z.ndim)
print('t.ndim = ', t.ndim)
z.ndim =  2
t.ndim =  3
print('z.shape = ',z.shape)
print('t.shape = ',t.shape)
z.shape =  (3, 5)
t.shape =  (2, 3, 5)
print('z.size = ',z.size)
print('t.size = ',t.size)
z.size =  15
t.size =  30
t.dtype.name
'int32'
t.itemsize
4
type(t)
numpy.ndarray

ndarray索引

z
array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14]])
z[0]  # 第一行元素
array([0, 1, 2, 3, 4])
z[0, 2] # 第一行的第三个元素
2
t[0]
array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14]])
t[0][2]
array([10, 11, 12, 13, 14])
t[0, 2]
array([10, 11, 12, 13, 14])
t[0, 2, 3]
13
t[0, :2, 2:4]
array([[2, 3],
       [7, 8]])

对于列表

e = [1, 2, 3, 4]
p = [e, e]
p[0][0]
1
p[0,0]  # 这种语法是错误的
---------------------------------------------------------------------------

TypeError                                 Traceback (most recent call last)

<ipython-input-300-d527d1725556> in <module>()
----> 1 p[0,0]  # 这种语法是错误的


TypeError: list indices must be integers or slices, not tuple

ndarray支持向量化运算

作用于每个元素的运算

z
array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14]])
z.sum()  # 所有元素的sum
105
z.sum(axis = 0)    # sum along axis 0, i.e. column-wise sum,相当于矩阵的行向量
array([15, 18, 21, 24, 27])
z.sum(axis = 1)   # 相当于矩阵的列向量
array([10, 35, 60])
z.std()  # 所有元素标准差
4.3204937989385739
z.std(axis = 0)
array([ 4.0824829,  4.0824829,  4.0824829,  4.0824829,  4.0824829])
z.cumsum()  # 所有元素的累积和
array([  0,   1,   3,   6,  10,  15,  21,  28,  36,  45,  55,  66,  78,
        91, 105], dtype=int32)
z * 2   # 类似矩阵的数量乘法
array([[ 0,  2,  4,  6,  8],
       [10, 12, 14, 16, 18],
       [20, 22, 24, 26, 28]])
z ** 2  
array([[  0,   1,   4,   9,  16],
       [ 25,  36,  49,  64,  81],
       [100, 121, 144, 169, 196]], dtype=int32)
np.sqrt(z)
array([[ 0.        ,  1.        ,  1.41421356,  1.73205081,  2.        ],
       [ 2.23606798,  2.44948974,  2.64575131,  2.82842712,  3.        ],
       [ 3.16227766,  3.31662479,  3.46410162,  3.60555128,  3.74165739]])
y = np.arange(10)  # 类似 Python 的 range, 但是回传 array
y
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
a = np.array([1, 2, 3, 6])
b = np.linspace(0, 2, 4)  # 建立一個array, 在0与2的范围之间4等分
c = a - b
c
array([ 1.        ,  1.33333333,  1.66666667,  4.        ])
# 全域方法
a = np.linspace(-np.pi, np.pi, 100) 
b = np.sin(a)
c = np.cos(a)
b = np.array([1,2,3,4])
a = np.array([4,5,6,7])
print('a + b = ', a + b)
print('a - b = ', a - b)
print('a * b = ', a * b)
print('a / b = ', a / b)
print('a // b = ', a // b)
print('a % b = ', a % b)
a + b =  [ 5  7  9 11]
a - b =  [3 3 3 3]
a * b =  [ 4 10 18 28]
a / b =  [ 4.    2.5   2.    1.75]
a // b =  [4 2 2 1]
a % b =  [0 1 0 3]

对于非数值型数组

a = np.array(list('python'))
a
array(['p', 'y', 't', 'h', 'o', 'n'],
      dtype='<U1')
b = np.array(list('numpy'))
b
array(['n', 'u', 'm', 'p', 'y'],
      dtype='<U1')
a + b
---------------------------------------------------------------------------

TypeError                                 Traceback (most recent call last)

<ipython-input-153-f96fb8f649b6> in <module>()
----> 1 a + b


TypeError: ufunc 'add' did not contain a loop with signature matching types dtype('<U1') dtype('<U1') dtype('<U1')
list(a) + list(b)
['p', 'y', 't', 'h', 'o', 'n', 'n', 'u', 'm', 'p', 'y']

线性代数

from numpy.random import rand
from numpy.linalg import solve, inv
a = np.array([[1, 2, 3], [3, 4, 6.7], [5, 9.0, 5]])
a.transpose()
array([[ 1. ,  3. ,  5. ],
       [ 2. ,  4. ,  9. ],
       [ 3. ,  6.7,  5. ]])
inv(a)
array([[-2.27683616,  0.96045198,  0.07909605],
       [ 1.04519774, -0.56497175,  0.1299435 ],
       [ 0.39548023,  0.05649718, -0.11299435]])
b =  np.array([3, 2, 1])
solve(a, b)  # 解方程式 ax = b
array([-4.83050847,  2.13559322,  1.18644068])
c = rand(3, 3)  # 建立一個 3x3 随机矩阵
c
array([[ 0.98539238,  0.62602057,  0.63592577],
       [ 0.84697864,  0.86223698,  0.20982139],
       [ 0.15532627,  0.53992238,  0.65312854]])
np.dot(a, c)  # 矩阵相乘
array([[  3.14532847,   3.97026167,   3.01495417],
       [  7.38477771,   8.94448958,   7.1230241 ],
       [ 13.32640097,  13.58984759,   8.33366406]])

数组的创建

参考 np.doc.creation?
There are 5 general mechanisms for creating arrays:

  1. Conversion from other Python structures (e.g., lists, tuples)
  2. Intrinsic numpy array array creation objects (e.g., arange, ones, zeros,
    etc.)
  3. Reading arrays from disk, either from standard or custom formats
  4. Creating arrays from raw bytes through the use of strings or buffers
  5. Use of special library functions (e.g., random)
import numpy as np
x = np.array([2,3,1,0])
x1 = np.array([[1,2.0],[0,0],(1+1j,3.)]) # note mix of tuple and lists, and types
x2 = np.array([[ 1.+0.j, 2.+0.j], [ 0.+0.j, 0.+0.j], [ 1.+1.j, 3.+0.j]])

y = np.zeros((2, 3))
y1 = np.ones((2,3))
y2 = np.arange(10)
y3 = np.arange(2, 10, dtype=np.float)
y4 = np.arange(2, 10, 0.2)
y5 = np.linspace(1., 4., 6)  # 将1和4之间六等分

z = np.indices((3,3))

r = [x, x1, x2, y, y1, y2, y3, y4, y5, z]
s = 'x, x1, x2, y, y1, y2, y3, y4, y5, z'.split(', ')

for i in range(len(r)):
    print('%s =  ' % s[i])
    print('')
    print(r[i])
    print(75 * '=')
x =  

[2 3 1 0]
===========================================================================
x1 =  

[[ 1.+0.j  2.+0.j]
 [ 0.+0.j  0.+0.j]
 [ 1.+1.j  3.+0.j]]
===========================================================================
x2 =  

[[ 1.+0.j  2.+0.j]
 [ 0.+0.j  0.+0.j]
 [ 1.+1.j  3.+0.j]]
===========================================================================
y =  

[[ 0.  0.  0.]
 [ 0.  0.  0.]]
===========================================================================
y1 =  

[[ 1.  1.  1.]
 [ 1.  1.  1.]]
===========================================================================
y2 =  

[0 1 2 3 4 5 6 7 8 9]
===========================================================================
y3 =  

[ 2.  3.  4.  5.  6.  7.  8.  9.]
===========================================================================
y4 =  

[ 2.   2.2  2.4  2.6  2.8  3.   3.2  3.4  3.6  3.8  4.   4.2  4.4  4.6  4.8
  5.   5.2  5.4  5.6  5.8  6.   6.2  6.4  6.6  6.8  7.   7.2  7.4  7.6  7.8
  8.   8.2  8.4  8.6  8.8  9.   9.2  9.4  9.6  9.8]
===========================================================================
y5 =  

[ 1.   1.6  2.2  2.8  3.4  4. ]
===========================================================================
z =  

[[[0 0 0]
  [1 1 1]
  [2 2 2]]

 [[0 1 2]
  [0 1 2]
  [0 1 2]]]
===========================================================================

Tips: 关于参数 order:

order 指内存中存储元素的顺序,C 指和 C语言 相似(即行优先),F 指和 Fortran 相似(即列优先)

g = np.ones((2,3,4), dtype = 'i', order = 'C')  # 还有 `np.zeros()`
g
array([[[1, 1, 1, 1],
        [1, 1, 1, 1],
        [1, 1, 1, 1]],

       [[1, 1, 1, 1],
        [1, 1, 1, 1],
        [1, 1, 1, 1]]], dtype=int32)
# 可将其他数组作为参数传入,返回传入数组的 `shape` 相同的全一矩阵
h = np.ones_like(g, dtype = 'float16', order = 'C')  # 还有 `np.zeros_like()`
h
array([[[ 1.,  1.,  1.,  1.],
        [ 1.,  1.,  1.,  1.],
        [ 1.,  1.,  1.,  1.]],

       [[ 1.,  1.,  1.,  1.],
        [ 1.,  1.,  1.,  1.],
        [ 1.,  1.,  1.,  1.]]], dtype=float16)

注意事项:

  1. 数组的组成/长度/大小在任何维度内都是同质的
  2. 整个数组只允许一种数据类型(numpy.dtype)。

NumPy dtype对象

dtype 描述 示例
t 位域 t4(4位)
b 布尔值 b(TrueFalse)
I 整数 i8(64位)
u 无符号整数 u8(64位)
f 浮点数 f8(64位)
c 浮点复数 c16(128位)
o 对象 o(指向对象的指针)
S,a 字符串 S24(24个字符)
U Unicode U24(24个Unicode字符)
V 其他 V12(12字节数据块)

结构数组

允许我们至少在每列上使用不同的NumPy数据类型。

np.info(np.dtype)
 dtype()

dtype(obj, align=False, copy=False)

Create a data type object.

A numpy array is homogeneous, and contains elements described by a
dtype object. A dtype object can be constructed from different
combinations of fundamental numeric types.

Parameters
----------
obj
    Object to be converted to a data type object.
align : bool, optional
    Add padding to the fields to match what a C compiler would output
    for a similar C-struct. Can be ``True`` only if `obj` is a dictionary
    or a comma-separated string. If a struct dtype is being created,
    this also sets a sticky alignment flag ``isalignedstruct``.
copy : bool, optional
    Make a new copy of the data-type object. If ``False``, the result
    may just be a reference to a built-in data-type object.

See also
--------
result_type

Examples
--------
Using array-scalar type:

>>> np.dtype(np.int16)
dtype('int16')

Structured type, one field name 'f1', containing int16:

>>> np.dtype([('f1', np.int16)])
dtype([('f1', '<i2')])

Structured type, one field named 'f1', in itself containing a structured
type with one field:

>>> np.dtype([('f1', [('f1', np.int16)])])
dtype([('f1', [('f1', '<i2')])])

Structured type, two fields: the first field contains an unsigned int, the
second an int32:

>>> np.dtype([('f1', np.uint), ('f2', np.int32)])
dtype([('f1', '<u4'), ('f2', '<i4')])

Using array-protocol type strings:

>>> np.dtype([('a','f8'),('b','S10')])
dtype([('a', '<f8'), ('b', '|S10')])

Using comma-separated field formats.  The shape is (2,3):

>>> np.dtype("i4, (2,3)f8")
dtype([('f0', '<i4'), ('f1', '<f8', (2, 3))])

Using tuples.  ``int`` is a fixed type, 3 the field's shape.  ``void``
is a flexible type, here of size 10:

>>> np.dtype([('hello',(np.int,3)),('world',np.void,10)])
dtype([('hello', '<i4', 3), ('world', '|V10')])

Subdivide ``int16`` into 2 ``int8``'s, called x and y.  0 and 1 are
the offsets in bytes:

>>> np.dtype((np.int16, {'x':(np.int8,0), 'y':(np.int8,1)}))
dtype(('<i2', [('x', '|i1'), ('y', '|i1')]))

Using dictionaries.  Two fields named 'gender' and 'age':

>>> np.dtype({'names':['gender','age'], 'formats':['S1',np.uint8]})
dtype([('gender', '|S1'), ('age', '|u1')])

Offsets in bytes, here 0 and 25:

>>> np.dtype({'surname':('S25',0),'age':(np.uint8,25)})
dtype([('surname', '|S25'), ('age', '|u1')])


Methods:

  newbyteorder  --  newbyteorder(new_order='S')
dt = np.dtype([('Name', 'S10'), ('Age', 'i4'),
               ('Height', 'f'), ('Children/Pets', 'i4', 2)])
s = np.array([('Smith', 45, 1.83, (0, 1)),
              ('Jones', 53, 1.72, (2, 2))], dtype=dt)
s
array([(b'Smith', 45,  1.83000004, [0, 1]),
       (b'Jones', 53,  1.72000003, [2, 2])],
      dtype=[('Name', 'S10'), ('Age', '<i4'), ('Height', '<f4'), ('Children/Pets', '<i4', (2,))])
s['Name']
array([b'Smith', b'Jones'],
      dtype='|S10')
s['Age']
array([45, 53])
s["Height"].mean()
1.7750001
s[1]
(b'Jones', 53,  1.72000003, [2, 2])
s[1]['Age']
53

代码向量化

r = np.array([[1,2,3],[2,3,4],[3,4,5],[4,5,6]])
s = np.array([[2,3,4],[3,4,5],[4,5,6],[6,7,8]])

简单的数学运算

r + s    
array([[ 3,  5,  7],
       [ 5,  7,  9],
       [ 7,  9, 11],
       [10, 12, 14]])
r * s
array([[ 2,  6, 12],
       [ 6, 12, 20],
       [12, 20, 30],
       [24, 35, 48]])
r % s
array([[1, 2, 3],
       [2, 3, 4],
       [3, 4, 5],
       [4, 5, 6]], dtype=int32)
s // r
array([[2, 1, 1],
       [1, 1, 1],
       [1, 1, 1],
       [1, 1, 1]], dtype=int32)

支持广播

更多内容参考http://www.cnblogs.com/lyon2014/p/4696989.html

r
array([[1, 2, 3],
       [2, 3, 4],
       [3, 4, 5],
       [4, 5, 6]])
2 * r + 3
array([[ 5,  7,  9],
       [ 7,  9, 11],
       [ 9, 11, 13],
       [11, 13, 15]])
f = np.array([9,8,7])
f
array([9, 8, 7])
r + f
array([[10, 10, 10],
       [11, 11, 11],
       [12, 12, 12],
       [13, 13, 13]])
# r.transpose() 转置
np.shape(r.T)
(3, 4)
def f(x):
    return 3 * x + 5
f(r.T)
array([[ 8, 11, 14, 17],
       [11, 14, 17, 20],
       [14, 17, 20, 23]])
np.sin(r)
array([[ 0.84147098,  0.90929743,  0.14112001],
       [ 0.90929743,  0.14112001, -0.7568025 ],
       [ 0.14112001, -0.7568025 , -0.95892427],
       [-0.7568025 , -0.95892427, -0.2794155 ]])
np.sin(np.pi)
1.2246467991473532e-16

ufunc

http://docs.scipy.org/doc/numpy/reference/ufuncs.html

Memory Layout(内存布局)

x = np.random.standard_normal((5, 10000000))
y = 2 * x + 3  # linear equation y = a * x + b
C = np.array((x, y), order='C')
F = np.array((x, y), order='F')
x = 0.0; y = 0.0  # memory clean-up
C[:2].round(2)
array([[[ 0.67,  0.29,  1.54, ...,  0.07,  2.64, -0.65],
        [ 0.4 , -0.63,  1.43, ...,  1.11,  0.93, -0.52],
        [-0.41,  2.23, -1.16, ..., -1.66,  0.07,  0.21],
        [ 1.46,  1.22,  0.2 , ..., -0.56,  2.36, -1.65],
        [-0.39,  1.73, -0.24, ..., -1.45,  0.43, -0.41]],

       [[ 4.34,  3.58,  6.08, ...,  3.15,  8.28,  1.69],
        [ 3.79,  1.73,  5.86, ...,  5.22,  4.87,  1.97],
        [ 2.17,  7.46,  0.67, ..., -0.32,  3.15,  3.42],
        [ 5.93,  5.44,  3.4 , ...,  1.89,  7.72, -0.3 ],
        [ 2.22,  6.46,  2.51, ...,  0.1 ,  3.85,  2.18]]])
%timeit C.sum()
135 ms ± 2.29 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit F.sum()
134 ms ± 499 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

加总数组元素时,两种内存布局没有显著差异。但是,考虑以下情况便会有显著的差异。

%timeit C[0].sum(axis=0)
128 ms ± 894 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit C[0].sum(axis=1)
66.5 ms ± 296 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit F.sum(axis=0)
1.06 s ± 48.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit F.sum(axis=1)
2.12 s ± 35.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
F = 0.0; C = 0.0  # memory clean-up

从上面可以看出:
在少量大型的向量上的操作比在大量小型向量上性能好。
少量大型向量的元素保存在相邻的内存位置上,这可以解释相对的性能优势。
但是,与类C语言变种相比,整体操作要慢得多。

选择合适的内存布局,可将代码执行速度提高2个以上的数量级。

结语:

  1. 基本数据类型(整数,浮点数,字符串)提供了原始数据类型。
  2. 标准数据结构(元组,列表,字典,集合类)提供了对数据集的各种操作。
  3. 数组(numpy.ndarray类)提供了代码的向量化操作,使得代码变得更加简洁、方便、高性能。

值得参考的资料:

posted @ 2017-09-10 12:32  xinet  阅读(436)  评论(0编辑  收藏  举报