数据分析三剑客

一.Numpy

1.使用np.array()创建数组

import numpy as np

np.array([1, 2, 3, 4, 5]) # 创建一个一维数组
np.array([[1,2,3,4], [5,6,7,8]]) # 创建一个二维数组

# 注意:
(1)numpy默认ndarray的所有元素的类型是相同的
(2)如果传进来的列表中包含不同的类型, 则统一为同一类型, 优先级: str > float > int

2.使用np的函数创建数组

import numpy as np


# np.ones(shape, dtype=None, order='C')
np.ones(shape=(5, 6), dtype=int) # 创建一个元素数值为1, 形状为5行5列的二维数组

# np.zeros(shape, dtype=None, order='C')
np.zeros((5, 5)) # 创建一个元素数值为0, 形状为5行5列的二维数组

# np.full(shape, fill_value, dtype=None, order='C')
np.full((5,5),fill_value=999) # 创建一个元素数值为999, 形状为五行五列的二维数组

# np.linspace(start, stop, num=50, endpoint=True, retstep=False, dtype=None)
np.linspace(1,100,num=50) # 创建一个元素数值范围在1到100, 元素个数为50的一维数组

# np.arange([start, ]stop, [step, ]dtype=None)
np.arange(0, 100, 2) #创建一个 元素数值范围为0到100,元素数值步长为2 的一维数组

# np.random.randint(low, high=None, size=None, dtype='l')
np.random.seed(1) # 使用random.seed()方法固定随机性
np.random.randint(0, 100, size=(5, 6)) # 生成一个5行6列的二维数组, 每一个元素的值是0到100的随机数

# np.random.randn(d0, d1, ..., dn)
np.random.seed(1) # 固定随机性
np.random.randn(4, 5, 6) # 标准正太分布

# np.random.random(size=None)
np.random.seed(1) # 固定随机性
np.random.random(size=(3, 3)) # 生成0到1的随机数, 左闭右开

3.ndarray的属性

4个必须记住的参数:
- ndim: 维度
- shape: 形状(各维度的长度)
- size: 总长度
- dtype: 元素类型

# 使用matplotlib.pyplot获取一个numpy数组,数据来源于一张图片
import matplotlib.pyplot as plt
img_arr = plt.imread('./wnh.jpg') # 读取图片数据,返回一个三维数组
plt.imshow(img_arr) # 显示该图片


img_arr.shape	# 数组的形状
img_arr.size	# 数组的总长度
img_arr.dtype 	# 数组元素的类型
img_arr.ndim 	# 数组的维度(几维数组)

4.ndarray的基本操作

首先创建一个数组, 如下所示:

import numpy as np

np.random.seed(1)
arr = np.random.randint(0, 100, size=(5, 5)) # 创建一个元素范围是0到100,形状是5行5列的随机数组
arr

##执行结果如下:
array([[37, 12, 72,  9, 75],
       [ 5, 79, 64, 16,  1],
       [76, 71,  6, 25, 50],
       [20, 18, 84, 11, 28],
       [29, 14, 50, 68, 87]])

(1)索引

# 根据索引查看元素. 注意:索引从0开始计数.
arr[0][0]


##执行结果如下:
37

# 根据索引修改元素数值
arr[0][0] = 666
arr


##执行结果如下:
array([[666,  12,  72,   9,  75],
       [  5,  79,  64,  16,   1],
       [ 76,  71,   6,  25,  50],
       [ 20,  18,  84,  11,  28],
       [ 29,  14,  50,  68,  87]])

(2)切片

# 查看当前数组
arr


##执行结果如下:
array([[666,  12,  72,   9,  75],
       [  5,  79,  64,  16,   1],
       [ 76,  71,   6,  25,  50],
       [ 20,  18,  84,  11,  28],
       [ 29,  14,  50,  68,  87]])

# 获取二维数组前两行
arr[0:2]


##执行结果如下:
array([[666,  12,  72,   9,  75],
       [  5,  79,  64,  16,   1]])

# 获取二位数组前两列
arr[:,0:2]


##执行结果如下:
array([[666,  12],
       [  5,  79],
       [ 76,  71],
       [ 20,  18],
       [ 29,  14]])

注意: 逗号左边是行切片, 右边是列切片

# 获取二维数组前两行和前两列的数据
arr[0:2, 0:2]


##执行结果如下:
array([[666,  12],
       [  5,  79]])

# 将数组的行 倒序排列
arr[::-1,::]


##执行结果如下:
array([[ 29,  14,  50,  68,  87],
       [ 20,  18,  84,  11,  28],
       [ 76,  71,   6,  25,  50],
       [  5,  79,  64,  16,   1],
       [666,  12,  72,   9,  75]])

# 将数组的列 倒序排列
arr[::, ::-1]


##执行结果如下:
array([[ 75,   9,  72,  12, 666],
       [  1,  16,  64,  79,   5],
       [ 50,  25,   6,  71,  76],
       [ 28,  11,  84,  18,  20],
       [ 87,  68,  50,  14,  29]])

# 将数组的行和列 都进行倒序排列
arr[::-1, ::-1]


##执行结果如下:
array([[87, 68, 50, 14, 29],
       [28, 11, 84, 18, 20],
       [50, 25,  6, 71, 76],
       [ 1, 16, 64, 79,  5],
       [75,  9, 72, 12, 37]])

(3)变形

将一维数组变形成多维数组

import numpy as np
arr_1 = np.arange(0, 100, 4) # 创建一个元素范围为0到100,元素步长为4的一维数组

arr_1.reshape((-1, 5)) # 将一维数组变形成多维数组

##执行结果如下:
array([[ 0,  4,  8, 12, 16],
       [20, 24, 28, 32, 36],
       [40, 44, 48, 52, 56],
       [60, 64, 68, 72, 76],
       [80, 84, 88, 92, 96]])

将多维数组变形成一维数组

import numpy as np
np.random.seed(1) # 固定随机性
arr_2 = np.random.randint(0, 100, size=(5, 5)) # 创建一个5行5列的随机数组

arr_2 = arr_2.reshape((25,)) # 将多维数组变形成一维数组
arr_2

##执行结果如下:
array([37, 12, 72,  9, 75,  5, 79, 64, 16,  1, 76, 71,  6, 25, 50, 20, 18, 84, 11, 28, 29, 14, 50, 68, 87])

(4)级联

语法: np.concatenate((arr1, arr2), axis=1)

参数:

arr1, arr2表示要级联的数组
axis=1表示横向级联, axis=0表示纵向级联

横向级联

import numpy as np
np.random.seed(1) # 固定随机性
arr = np.random.randint(0, 100, size=(5, 5)) # 创建一个5行5列的随机数组
arr


##执行结果如下:
array([[37, 12, 72,  9, 75],
       [ 5, 79, 64, 16,  1],
       [76, 71,  6, 25, 50],
       [20, 18, 84, 11, 28],
       [29, 14, 50, 68, 87]])

np.concatenate((arr, arr), axis=1) # 将数组arr进行横向级联


##执行结果如下:
array([[37, 12, 72,  9, 75, 37, 12, 72,  9, 75],
       [ 5, 79, 64, 16,  1,  5, 79, 64, 16,  1],
       [76, 71,  6, 25, 50, 76, 71,  6, 25, 50],
       [20, 18, 84, 11, 28, 20, 18, 84, 11, 28],
       [29, 14, 50, 68, 87, 29, 14, 50, 68, 87]])

纵向级联

np.random.seed(6) # 固定随机性
arr1 = np.random.randint(0, 100, size=(4, 5)) # 创建一个4行5列的随机数组
arr1


##执行结果如下:
array([[10, 73, 99, 84, 79],
       [80, 62, 25,  1, 75],
       [77, 57, 26, 33, 68],
       [33,  8,  2, 76, 84]])

np.concatenate((arr, arr1), axis=0) # 将数组arr和arr1进行纵向级联


##执行结果如下:
array([[37, 12, 72,  9, 75],
       [ 5, 79, 64, 16,  1],
       [76, 71,  6, 25, 50],
       [20, 18, 84, 11, 28],
       [29, 14, 50, 68, 87],
       [10, 73, 99, 84, 79],
       [80, 62, 25,  1, 75],
       [77, 57, 26, 33, 68],
       [33,  8,  2, 76, 84]])

级联需要注意的点:
- 级联的参数是列表: 一定要加中括号或小括号
- 维度必须相同
- 形状相符: 在维度保持一致的前提下, 如果进行横向(axis=1)级联, 必须保证进行级联的数组的行数保持一致. 如果进行纵向(axis=0)级联, 必须保证进行级联的数组的列数保持一致
- 可以通过axis参数改变级联的方向

(5)切分

与级联类似, 三个函数完成切分工作:
- np.split(arr, 行/列号, 轴)
  - 注意: 参数2是一个列表类型
- np.vsplit()
- np.hsplit()

(6)副本

import numpy as np
np.random.seed(1) # 固定随机性
arr = np.random.randint(0, 100, size=(5, 5)) # 创建一个5行5列的随机数组
arr


##执行结果如下:
array([[37, 12, 72,  9, 75],
       [ 5, 79, 64, 16,  1],
       [76, 71,  6, 25, 50],
       [20, 18, 84, 11, 28],
       [29, 14, 50, 68, 87]])

a = arr.copy() # 可以使用copy()函数创建副本,这样做不会对原数据造成影响
a[2][2] = 666 # 修改副本的数据不会对原数据造成影响
arr


##执行结果如下:
array([[37, 12, 72,  9, 75],
       [ 5, 79, 64, 16,  1],
       [76, 71,  6, 25, 50],
       [20, 18, 84, 11, 28],
       [29, 14, 50, 68, 87]])

a


##执行结果如下:
array([[ 37,  12,  72,   9,  75],
       [  5,  79,  64,  16,   1],
       [ 76,  71, 666,  25,  50],
       [ 20,  18,  84,  11,  28],
       [ 29,  14,  50,  68,  87]])

5.ndarray的聚合操作

(1)求和`np.sum()`

import numpy as np
np.random.seed(1) # 固定随机性
arr = np.random.randint(0, 100, size=(5, 5)) # 创建一个5行5列的随机数组
arr


##执行结果如下:
array([[37, 12, 72,  9, 75],
       [ 5, 79, 64, 16,  1],
       [76, 71,  6, 25, 50],
       [20, 18, 84, 11, 28],
       [29, 14, 50, 68, 87]])

axis=1 对所有行求和

arr.sum(axis=1) # axis=1 对所有行求和


##执行结果如下:
array([205, 186, 228, 161, 248])

axis=0 对所有列求和

arr.sum(axis=0) # axis=0 对所有列求和


##执行结果如下:
array([167, 215, 276, 129, 241])

(2)最大最小值: `np.max()`和`np.min()`

arr.max(axis=1) # 求每一行的最大值

arr.max(axis=0) # 求每一列的最大值

arr.min(axis=1) # 求每一行的最小值

arr.min(axis=0) # 求每一列的最小值

(3)平均值: `np.mean()`

arr.mean(axis=1) # 求每一行的平均值

arr.mean(axis=0) # 求每一列的平均值

(4)其他聚合操作

Function Name	NaN-safe Version	Description
np.sum	np.nansum	Compute sum of elements
np.prod	np.nanprod	Compute product of elements
np.mean	np.nanmean	Compute mean of elements
np.std	np.nanstd	Compute standard deviation
np.var	np.nanvar	Compute variance
np.min	np.nanmin	Find minimum value
np.max	np.nanmax	Find maximum value
np.argmin	np.nanargmin	Find index of minimum value
np.argmax	np.nanargmax	Find index of maximum value
np.median	np.nanmedian	Compute median of elements
np.percentile	np.nanpercentile	Compute rank-based statistics of elements
np.any	N/A	Evaluate whether any elements are true
np.all	N/A	Evaluate whether all elements are true
np.power		幂运算

6.ndarray的排序

快速排序

np.sort()与ndarray.sort()都可以, 但有区别:

np.sort()不改变输入
ndarray.sort()本地处理, 不占用空间, 但改变输入

np.sort(arr, axis=1) # 对每一行进行升序排序

np.sort(arr, axis=0) # 对每一列进行升序排序

二.Pandas

1.Series

Series是一种类似于一维数组的对象, 由下面两个部分组成:

values: 一组数据(ndarray类型)
index: 相关的数据索引标签

(1)Series的创建

a.由列表或numpy数组创建

# 导入pandas三剑客:
from pandas import Series, DataFrame
import pandas as pd
import numpy as np

# 使用列表创建Series
s = Series(data=[1, 2, 3], index=['a', 'b', 'c'])
s


##执行结果:
a    1
b    2
c    3
dtype: int64

# 使用numpy创建Series
s1 = Series(data=np.random.randint(0, 100, size=(3,)))
s1


##执行结果:
0    32
1    25
2    30
dtype: int32

s.index # 查看索引
s.values # 查看值

b. 由字典创建: 不能使用index, 但是依然存在默认索引

# 使用字典创建Series
dic = {
    'name': '王力宏',
    'gender': '男',
}
s2 = Series(dic)
s2


##执行结果:
gender      男
name      王力宏
dtype: object

小练习: 使用多种方法创建以下Series，命名为s1

语文 150
数学 150
英语 150
理综 300

dic = {
    '语文': 150,
    '数学': 150,
    '英语': 150,
    '理综': 300,
}

s1 = Series(dic)
s1

##执行结果:
数学    150
理综    300
英语    150
语文    150
dtype: int64

s2 = Series(data=[150, 150, 150, 300], index=['语文', '数学', '英语', '理综'])
s2


##执行结果:
语文    150
数学    150
英语    150
理综    300
dtype: int64

(2)Series的索引和切片

显示索引:

使用index中的元素作为索引值
使用s.loc[] (推荐)
- 注意: loc中括号里面放置的一定是显示索引
- 注意: 此时是闭区间

s2['语文']

##执行结果:
150

s2.loc['理综']

##执行结果:
300

隐式索引:

使用整数作为索引值
使用.iloc (推荐)
- 注意: iloc的中括号里面必须放置隐式索引
- 注意: 此时是半开区间

s2[1]

##执行结果:
150

s2.iloc[1]

##执行结果:
150

索引切片:

dic = {
    'math':100,
    'English':88
}
s3 = Series(data=dic)
s3

##执行结果:
English     88
math       100
dtype: int64

s3.iloc[0:1]

##执行结果:
English    88
dtype: int64

(3)Series的基本概念

可以把Series看成是一个定长的有序字典
向Series增加一行: 相当于给字典增加一组键值对

dic = {
    'math':100,
    'English':88
}
s3 = Series(data=dic)

s3['Chinese'] = 100
s3


##执行结果:
English     88
math       100
Chinese    100
dtype: int64

可以通过shape, size, index, values等得到series的属性

s3.shape

##执行结果:
(3,)

s3.size

##执行结果:
3

s3.index

##执行结果:
Index(['English', 'math', 'Chinese'], dtype='object')

s3.values

##执行结果:
array([ 88, 100, 100], dtype=int64)

可以使用s.head(), s.tail()分别查看前n个和后n个值

s3.head(2) # 查看前2个值

##执行结果:
English     88
math       100
dtype: int64

s3.tail(2) # 查看后2个值

##执行结果:
math       100
Chinese    100
dtype: int64

对Series元素进行去重: unique()

s = Series([1,1,2,2,3,3,3,3,3,4,5,6,5,5,3,5,1])
s.unique()


##执行结果:
array([1, 2, 3, 4, 5, 6], dtype=int64)

使得两个Series进行相加

s1 = Series(data=[1, 2, 3, 4, 5], index=['A', 'B', 'C', 'D', 'E'])
s2 = Series(data=[1, 2, 3, 4, 5], index=['A', 'B', 'C', 'D', 'E'])
s = s1 + s2
s

##执行结果:
A     2
B     4
C     6
D     8
E    10
dtype: int64

当索引没有对应的值时, 可能出现缺失数据显示NaN(not a number)的情况

s1 = Series(data=[1, 2, 3, 4, 5], index=['A', 'B', 'C', 'D', 'E'])
s2 = Series(data=[1, 2, 3, 4, 5], index=['A', 'B', 'C', 'F', 'E'])
s = s1 + s2
s

##执行结果:
A     2.0
B     4.0
C     6.0
D     NaN
E    10.0
F     NaN
dtype: float64

可以使用pd.isnull(), pd.notnull(), 或s.isnull(), s.notnull()函数检测缺失的数据

s.notnull()

##执行结果:
A     True
B     True
C     True
D    False
E     True
F    False
dtype: bool

s[['A', 'B', 'C']] # 显示索引查看值

##执行结果:
A    2.0
B    4.0
C    6.0
dtype: float64

s[[0, 1]] # 隐式索引查看值

##执行结果:
A    2.0
B    4.0
dtype: float64

s[[True, False, True, True, True, False]] # True表示显示该值,False表示不显示

##执行结果:
A     2.0
C     6.0
D     NaN
E    10.0
dtype: float64

s[s.notnull()] # 这个语法表示: 显示s中所有非空的值

##执行结果:
s[s.notnull()] # 这个语法表示: 显示s中所有非空的值
s[s.notnull()] # 这个语法表示: 显示s中所有非空的值
A     2.0
B     4.0
C     6.0
E    10.0
dtype: float64

(4)Series的运算

+, -, *, /

略

add(), sub(), mul(), div()

add()的语法是这样的:
- s1.add(s2, fill_value=0)

s1 = Series(data=[1, 2, 3, 4, 5], index=['A', 'B', 'C', 'D', 'E'])
s2 = Series(data=[1, 2, 3, 4, 5], index=['A', 'B', 'C', 'F', 'E'])

s1.add(s2)

##执行结果:
A     2.0
B     4.0
C     6.0
D     NaN
E    10.0
F     NaN
dtype: float64

Series之间的运算

在运算中自动对齐不同索引的数据
如果索引不对应, 则自动补NaN

2.DataFrame

DataFrame是一个表格型的数据结构. DataFrame由按照一定顺序排列的多列数据组成. 设计初衷是将Series的使用场景从一维拓展到多维. DataFrame既有行索引, 也有列索引.

行索引: index
列索引: columns
值: values

(1)DataFrame的创建

最常用的方法是传递一个字典来创建. DataFrame以字典的键作为每一列的名称, 以字典的值(一个数组)作为每一列.

此外, DataFrame会自动加上每一行的索引.

使用字典创建DataFrame后, 则columns参数将不可被使用.

同Series一样, 若传入的列与字典的键不匹配, 则相应的值为NaN.

使用ndarray创建DataFrame

df = DataFrame(data=np.random.randint(60, 120, size=(5, 5)), index=['a', 'b', 'c', 'd', 'e'])
df

DataFrame属性: values, columns, index, shape

df.values

df.columns

df.index

df.shape

小练习:

dic = {
    '张三': [130, 130, 130, 230],
    '李四': [140, 140, 140, 240],
    '王五': [150, 150, 150, 250],
}
df = DataFrame(data=dic, index=['语文', '数学', '英语', '理综'])
df

(2)DataFrame的索引

1.对列进行索引

通过类似字典的方式: df['q']
通过属性的方式: df.q

可以将DataFrame的列获取为一个Series. 返回的Series拥有原DataFrame相同的索引, 且name属性也已经设置好了, 就是相应的列名.

2.对行进行索引

使用 .loc[] 加index来进行行索引
使用 .iloc[] 加整数来进行行索引

同样返回一个Series, index为原来的columns

3.对元素索引的方法

使用列索引
使用行索引(iloc[3, 1] or loc['C', 'q'])
- 注意: 行索引在前, 列索引在后

4.切片

注意: 使用中括号时
- 索引表示的是列索引
- 切片表示的是行切片

(3)DataFrame的运算

(1)DataFrame之间的运算

同Series一样:

在运算中自动对齐不同索引的数据
如果索引不对应, 则不NaN

略

posted @ 2019-03-28 19:21 咕噜噜~ 阅读(862) 评论(0) 编辑收藏举报

刷新页面返回顶部

咕噜噜~

守正笃实久久为功

数据分析三剑客

一.Numpy

1.使用np.array()创建数组

2.使用np的函数创建数组

3.ndarray的属性

4.ndarray的基本操作

(1)索引

(2)切片

(3)变形

(4)级联

横向级联

纵向级联

(5)切分

(6)副本

5.ndarray的聚合操作

(1)求和`np.sum()`

(2)最大最小值: `np.max()`和`np.min()`

(3)平均值: `np.mean()`

(4)其他聚合操作

6.ndarray的排序

快速排序

二.Pandas

1.Series

(1)Series的创建

a.由列表或numpy数组创建

b. 由字典创建: 不能使用index, 但是依然存在默认索引

(2)Series的索引和切片

(3)Series的基本概念

(4)Series的运算

2.DataFrame

(1)DataFrame的创建

(2)DataFrame的索引

(3)DataFrame的运算

公告

咕噜噜~

守正笃实 久久为功

数据分析三剑客

一.Numpy

1.使用np.array()创建数组

2.使用np的函数创建数组

3.ndarray的属性

4.ndarray的基本操作

(1)索引

(2)切片

(3)变形

(4)级联

横向级联

纵向级联

(5)切分

(6)副本

5.ndarray的聚合操作

(1)求和np.sum()

(2)最大最小值: np.max()和np.min()

(3)平均值: np.mean()

(4)其他聚合操作

6.ndarray的排序

快速排序

二.Pandas

1.Series

(1)Series的创建

a.由列表或numpy数组创建

b. 由字典创建: 不能使用index, 但是依然存在默认索引

(2)Series的索引和切片

(3)Series的基本概念

(4)Series的运算

2.DataFrame

(1)DataFrame的创建

(2)DataFrame的索引

(3)DataFrame的运算

公告

守正笃实久久为功

(1)求和`np.sum()`

(2)最大最小值: `np.max()`和`np.min()`

(3)平均值: `np.mean()`