数据分析常用库(numpy，pandas，matplotlib，scipy)

概述

numpy

numpy（numeric python）是 python 的一个开源数值计算库，主要用于数组和矩阵计算。底层是 C 语言，运行效率远高于纯 python 代码。numpy主要包含2个重要的数据类型：

1）ndarray （N维数组，这个是我们要重点掌握的）

2）matrix （矩阵）

scipy

scipy 是基于 numpy 的的一个算法库和数学工具包，包含的模块有最优化、线性代数、积分、插值、特殊函数、快速傅里叶变换、信号处理和图像处理、常微分方程求解和其他科学与工程中常用的计算。

pandas

pandas 基于 numpy、scipy，补充了大量数据操作功能，能实现统计、分组、排序、透视表，可以代替Excel的绝大部分功能。

Pandas主要有2种重要数据类型：

1）Series（一维序列）

2）DataFrame（二维表）

matplotlib

Matplotlib 是一个Python绘图库,其设计理念是能够用轻松简单的方式生成强大的可视化效果，只需几行代码即可生成绘图，直方图，功率谱，条形图，错误图，散点图等，是Python学习过程中核心库之一。

NumPy快速入门

ndarray的创建

常见的方式有三种：

1）通过python的基础对象转换过来。

2）通过内置函数生成的

3）从硬盘里面读取数据生成的

# 通过python的基础对象转换
import numpy as np

li = [1.,28,10]

nd = np.array(li)
print(nd)

print(type(nd))
print(nd.dtype)


[ 1. 28. 10.]
<class 'numpy.ndarray'>
float64


#通过内置函数生成

nd0 = np.zeros((3,4))
nd1 = np.ones((3,4))
nd2 = np.random.randint(1,100,(2,3))
nd3 = np.tile(8,(2,3))

print(nd0)
print(nd1)
print(nd2)
print(nd3)
print(nd3.dtype)

[[0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]]
[[1. 1. 1. 1.]
 [1. 1. 1. 1.]
 [1. 1. 1. 1.]]
[[93 73 49]
 [ 2 75 94]]
[[8 8 8]
 [8 8 8]]
int32

#从硬盘里面读取数据生成

nd4 = np.loadtxt("./data/datingTestSet.txt",skiprows=1,usecols=[0,1,2])   #注意这里数据被转换成了float64，源文件中是int型
print(nd4)

[[4.0920000e+04 8.3269760e+00 9.5395200e-01]
 [1.4488000e+04 7.1534690e+00 1.6739040e+00]
 [2.6052000e+04 1.4418710e+00 8.0512400e-01]
 [2.6575000e+04 1.0650102e+01 8.6662700e-01]
 [4.8111000e+04 9.1345280e+00 7.2804500e-01]
 [4.3757000e+04 7.8826010e+00 1.3324460e+00]]

ndarray 相关属性

ndarray 对象具有如下常用属性：

ndim，shape，dtype，size，itemsize，

x = np.array([[1, 2, 3],
              [3, 4, 5]])

# 获取数组的维度
display(x.ndim)

# 获取数组的形状（每个维度上的长度）。返回一个元组。
display(x.shape)

# 获取数组的元素类型。
display(x.dtype)

# 返回数组元素的个数。(整体所有元素的个数)
display(x.size)

# 返回数组元素占用空间的大小。以字节为单位。
display(x.itemsize)


2
(2, 3)
dtype('int32')
6
4

ndarray 的特点

数据类型一致,
矢量化运算

nd5 = np.array([[1,2,3],
                [4,5,6]])

#标量运算，直接对每个元素做运算
nd6 = np.ones((2,3))/2
#          [[0.5 0.5 0.5]
#           [0.5 0.5 0.5]]

print(nd5 * 2)
print(nd5 ** 2)

#矢量运算，对应元素做运算   不需要写循环
print(nd5 + nd6)
print(nd5 * nd6)


[[ 2  4  6]
 [ 8 10 12]]
[[ 1  4  9]
 [16 25 36]]
[[1.5 2.5 3.5]
 [4.5 5.5 6.5]]
[[0.5 1.  1.5]
 [2.  2.5 3. ]]

访问 ndarray

索引，
切片

import numpy as np
nd7 = np.random.randint(1,100,(5,6))

#通过索引
display(nd7)

##访问列表的方式
display(nd7[0])
display(nd7[0][2])

##访问矩阵的方式
display(nd7[2,1])

## 通过整数数组
display(nd7[[1,3]])
      
## 通过布尔数组
display(nd7[[True,False,True,False,True]])


'''
array([[10, 82, 37, 63,  6, 77],
       [55, 22,  1, 83, 80, 58],
       [11, 59, 26, 98, 70, 11],
       [42, 94, 57, 65, 46, 89],
       [78, 87, 23, 19, 79, 38]])
array([10, 82, 37, 63,  6, 77])
37
59
array([[55, 22,  1, 83, 80, 58],
       [42, 94, 57, 65, 46, 89]])
array([[10, 82, 37, 63,  6, 77],
       [11, 59, 26, 98, 70, 11],
       [78, 87, 23, 19, 79, 38]])
'''

#切片 
print(nd7)

##从第0行到第2行，但是不包括第2行
print(nd7[0:2]) 

##从第0行到第2行，但是不包括第2行，再从里面再取第0列到第2列，但是不包括第2列
print(nd7[0:2,:])

##取第[1]行
print(nd7[1,:])

##取第[2]列
print(nd7[:,2])

print(nd7[:,:-1])

'''
[[10 82 37 63  6 77]
 [55 22  1 83 80 58]
 [11 59 26 98 70 11]
 [42 94 57 65 46 89]
 [78 87 23 19 79 38]]
[[10 82 37 63  6 77]
 [55 22  1 83 80 58]]
[[10 82 37 63  6 77]
 [55 22  1 83 80 58]]
[55 22  1 83 80 58]
[37  1 26 57 23]
[[10 82 37 63  6]
 [55 22  1 83 80]
 [11 59 26 98 70]
 [42 94 57 65 46]
 [78 87 23 19 79]]
'''

ndarray扁平化操作

np.ravel/ravel

flatten

我们可以通过调用ravel或flatten方法，对数组对象进行扁平化处理。
二者的区别在于，ravel返回原数组的视图，而flatten返回原数组的拷贝。

x = np.arange(12).reshape(3, 2, 2)
print(x)

print("***"*10)

y = np.ravel(x)
y[0] = 1000
print(x, y)


'''
[[[ 0  1]
  [ 2  3]]

 [[ 4  5]
  [ 6  7]]

 [[ 8  9]
  [10 11]]]
******************************
[[[1000    1]
  [   2    3]]

 [[   4    5]
  [   6    7]]

 [[   8    9]
  [  10   11]]] [1000    1    2    3    4    5    6    7    8    9   10   11]
'''

x = np.arange(12).reshape(3, 2, 2)
print(x)

print("***"*10)

#y = np.ravel(x)
#y[0] = 1000
#print(x, y)

y = x.flatten()
y[0] = 1000
print(x, y)

[[[ 0  1]
  [ 2  3]]

 [[ 4  5]
  [ 6  7]]

 [[ 8  9]
  [10 11]]]
******************************
[[[ 0  1]
  [ 2  3]]

 [[ 4  5]
  [ 6  7]]

 [[ 8  9]
  [10 11]]] [1000    1    2    3    4    5    6    7    8    9   10   11]

统计函数

nd8 = np.random.randint(1,100,(2,3))
print(nd8)
print("最大值",nd8.max())
print("平均值",nd8.mean())
print("总和",nd8.sum())
print("方差",nd8.var())

[[60 40 65]
 [48 32 43]]
最大值 65
平均值 48.0
总和 288
方差 129.66666666666666

pandas快速入门

pandas提供两个常用的数据类型：

Series

Series类型类似于Numpy的一维数组对象，可以将该类型看做是带有标签的一维数组对象

DataFrame

DataFrame是一个二维数据类型，我们可以将DataFrame理解成类似excel的表格型数据，由多列组成，每个列的类型可以不同。

由于DataFrame是多维数据类型，因此，DataFrame既有行索引，也有列索引。

Series常用的创建方式：

列表等可迭代对象,
ndarray数组对象,
字典对象,
标量,

# 创建Series 第一列是标签(索引)
import pandas as pd
import numpy as np

# 使用列表
s1 = pd.Series([1212, 2, 3, 4])

# 使用可迭代对象
s2 = pd.Series(range(10))

# 使用ndarray数组
s3 = pd.Series(np.array([1, 2, 3, 4]))

# 使用字典。字典的key充当索引，字典的value充当Series的值。
s4 = pd.Series({"a":"xy", "b":"34234", "c":"3243"})

# 标量,默认索引从0开始进行排列。
s5 = pd.Series(33)

# 在创建Series时，可以使用index参数来显式指定索引。
s6 = pd.Series(33, index=["k", "x", "y"])

print(s1)
print(s2)
print(s3)
print(s4)
print(s5)
print(s6)

'''
0    1212
1       2
2       3
3       4
dtype: int64
0    0
1    1
2    2
3    3
4    4
5    5
6    6
7    7
8    8
9    9
dtype: int64
0    1
1    2
2    3
3    4
dtype: int32
a       xy
b    34234
c     3243
dtype: object
0    33
dtype: int64
k    33
x    33
y    33
dtype: int64
'''

Series相关特性

Series在操作上，与Numpy数据具有如下的相似性：

支持广播与矢量化运算。

多个Series运算时，会根据索引进行对齐。当索引无法匹配时，结果值为NaN（缺失值）

支持索引与切片。

支持整数数组与布尔数组提取元素。

说明：

我们可以通过 pandas 或 Series 的 isnull 与 notnull 来判断数据是否缺失。

除了运算符以外，我们也可以使用 Series 对象提供的相关方法进行运算【可以指定缺失的填充值】。

尽管 Numpy 的一些函数，也适用于 Series 类型，但 Series 与 ndarray 数组对于空值NaN的计算处理方式上是不同的。【Numpy的计算，会得到NaN，而Series会忽略NaN】

s = pd.Series([1, 2, 3])
s2 = pd.Series([4, 5, 6])

# 矢量运算
#print(s * s2)

# 标量运算
print(s * 5)

# 对于numpy的一些函数，例如mean，sum等，也适用于Series。
#  s type为int64，但是用了mean输出的是float
print(np.mean(s))

s = pd.Series([1, 2, 3], index=[1, 2, 3])
s2 = pd.Series([4, 5, 6], index=[2, 3, 4])
#print(s)
#print(s2)
# Series与ndarray数组计算的不同。Series运行时，会根据标签进行对齐，如果标签无法匹配（对齐），就会产生空值（NaN）。
#print(s + s2)

# 如果不想产生空值，则可以使用Series提供的计算方法来代替运算符的计算。
#print(s.add(s2, fill_value=100))

# 判断是否为空值。
#s = pd.Series([1, 2, 3, float("NaN"), np.nan])
#print(s.isnull())

# 判断是否不是空值。
#print(pd.notnull(s))

# np.mean, np.sum等函数，在处理ndarray数组与Series时，表现不同:
#   Numpy的计算，会得到NaN，而Series会忽略NaN
# a = np.array([1, 2, 3, 4, np.nan])
# s = pd.Series([1, 2, 3, 4, np.nan])
# print(np.mean(a))
# print(np.mean(s))


'''
0     5
1    10
2    15
dtype: int64
2.0
'''

索引

标签索引与位置索引

如果 Series 对象的 index 值为非数值类型，通过 [索引] 访问元素，索引既可以是标签索引，也可以是位置索引。这会在一定程度上造成混淆。我们可以通过：

loc 仅通过标签索引访问。

iloc 仅通过位置索引访问。

这样，就可以更加具有针对性去访问元素。

s = pd.Series([1, 2, 3], index=list("abc"))
print(s.loc["b"])
print(s.iloc[0])

2
1

切片

# Series的索引分为标签索引与位置索引，二者在切片的行为上是不一致的。
# 通过位置索引切片，不包含末尾的值，通过标签索引切片，包含末尾的值。

s = pd.Series([1, 2, 3, 4], index=list("abcd"))
# 通过位置索引切片
print(s.iloc[0:3])
# 通过标签索引切片
print(s.loc["a":"d"])

'''
a    1
b    2
c    3
dtype: int64
a    1
b    2
c    3
d    4
dtype: int64
'''

Series的CRUD

Series索引-数值CRUD操作：

获取值,
修改值,
增加索引-值,
删除索引-值,

s = pd.Series([1, 2, 3, 4], index=list("abcd"))

# 获取值，通过标签索引或位置索引（或者是二者的数组）
#Sprint(s.loc["a"])

# 修改值
s.loc["a"] = 3000


# 增加值 就可以像字典那样进行操作
s["new_key"] = "123123sdfsadf"

# 删除值 类似字典的操作
#del s["a"]


# 删除值，通过drop方法。
# inplace参数表示就地修改。如果指定为True，则不会返回修改修改后的结果（返回None）。
#print(s.drop("d", inplace=True))

# 可以提供一个标签列表，删除多个值。
print(s.drop(["b", "c"], inplace=True))

print(s)

'''
None
a                   3000
d                      4
new_key    123123sdfsadf
dtype: object
'''

DataFrame创建方式

我们可以使用如下的方式创建 DataFrame 类型的对象：

二维数组结构（列表,ndarray数组，DataFrame等）类型。

字典类型，key为列名，value为一维数组结构（列表，ndarray数组,Series等）。

说明：

如果没有显式指定行与列索引，则会自动生成以 0 开始的整数值索引。

我们可以在创建DataFrame 对象时，通过 index 与 columns 参数指定。
可以通过 head，tail 访问前 / 后 N 行记录（数据）。

# 使用二维数据结构创建DataFrame
df1 = pd.DataFrame(np.random.rand(3, 5))

# 使用字典来创建DataFrame。一个键值对为一列。key指定列索引，value指定该列的值。
df2 = pd.DataFrame({"北京":[100, 200, 125], 
                    "天津":[109, 203, 123], 
                    "上海":[39, 90, 300]})
print(df1)
print(df2)

# 显示前（后）N条记录
print(df2.head(2))
print(df2.tail(2))

# 创建DataFrame，指定行，列索引。
df3 = pd.DataFrame(np.random.rand(3, 5), 
                   index=["地区1", "地区2", "地区3"], 
                   columns=["北京", "上海","广州", "深圳","武汉"])
print(df3)


'''
      0         1         2         3         4
0  0.410662  0.101513  0.587158  0.978215  0.429340
1  0.712213  0.388142  0.216256  0.249963  0.154190
2  0.327874  0.819344  0.909206  0.032725  0.373376
    北京   天津   上海
0  100  109   39
1  200  203   90
2  125  123  300
    北京   天津  上海
0  100  109  39
1  200  203  90
    北京   天津   上海
1  200  203   90
2  125  123  300
           北京        上海        广州        深圳        武汉
地区1  0.822757  0.122820  0.159488  0.252913  0.238214
地区2  0.914401  0.033803  0.867537  0.593349  0.729981
地区3  0.514004  0.867152  0.846361  0.854198  0.181037
'''

排序

索引排序

Series与DataFrame对象可以使用sort_index方法对索引进行排序。DataFrame对象在排序时，还可以通过axis参数来指定轴（行索引还是列索引）。也可以通过ascending参数指定升序还是降序。

值排序

Series与DataFrame对象可以使用sort_values方法对值进行排序。

df = pd.DataFrame([[1, 4],
                   [3, 2]], 
                  index=[2 ,1], columns=list("cb"))
print(df)
print(df.sort_values("b"))

'''
 c  b
2  1  4
1  3  2
   c  b
1  3  2
2  1  4
'''

统计

df = pd.DataFrame(np.random.rand(5, 5),
                  columns=list("abcde"), 
                  index=list("hijkl"))

print(df)
print(df.info())
print(df.describe())

'''
  a         b         c         d         e
h  0.522830  0.627477  0.382033  0.736568  0.026439
i  0.728560  0.196987  0.578368  0.491608  0.913291
j  0.727968  0.176615  0.725070  0.726866  0.480751
k  0.942187  0.963218  0.322355  0.024683  0.874741
l  0.078274  0.615003  0.901357  0.858762  0.617324
<class 'pandas.core.frame.DataFrame'>
Index: 5 entries, h to l
Data columns (total 5 columns):
a    5 non-null float64
b    5 non-null float64
c    5 non-null float64
d    5 non-null float64
e    5 non-null float64
dtypes: float64(5)
memory usage: 240.0+ bytes
None
              a         b         c         d         e
count  5.000000  5.000000  5.000000  5.000000  5.000000
mean   0.599964  0.515860  0.581837  0.567698  0.582509
std    0.327165  0.331355  0.239726  0.331370  0.359025
min    0.078274  0.176615  0.322355  0.024683  0.026439
25%    0.522830  0.196987  0.382033  0.491608  0.480751
50%    0.727968  0.615003  0.578368  0.726866  0.617324
75%    0.728560  0.627477  0.725070  0.736568  0.874741
max    0.942187  0.963218  0.901357  0.858762  0.913291
'''

matplotlib的简单使用

# 2D图
import matplotlib.pyplot as plt
import numpy as np

%matplotlib qt
#解决qt中文乱码
plt.rcParams["font.family"] = "SimHei"
plt.rcParams["axes.unicode_minus"] = False

x = np.linspace(-10, 10, 200)
# print(x,type(x),x.shape)

y1 = 2 * x + 10
y2 = x ** 2

# 创建画布
plt.figure()
#绘制直线
plt.plot(x, y1,"g-",label="直线")
#绘制抛物线
plt.plot(x, y2, "r-", linewidth = 1.0, linestyle = '--',label="抛物线")

plt.xlabel('x轴')
plt.ylabel('y轴')
plt.legend()
#plt.show()


# 3D曲面图
import matplotlib.pyplot as plt
import numpy as np
from mpl_toolkits.mplot3d import Axes3D


%matplotlib qt
#%matplotlib inline

# 定义figure
fig = plt.figure()
# 将figure变为3d
ax = Axes3D(fig)

# 定义x, y
x = np.arange(-4, 4, 0.2)
y = np.arange(-4, 4, 0.2)

# 生成网格数据
X, Y = np.meshgrid(x, y)

# 计算每个点对的长度
R = np.sqrt(X ** 2 + Y ** 2)
# 计算Z轴的高度
Z = np.sin(R)

# 绘制3D曲面
ax.plot_surface(X, Y, Z, rstride = 1, cstride = 1, cmap = 'rainbow',alpha=0.8)
# 绘制从3D曲面到底部的投影
# ax.contour(X, Y, Z, zdir = 'z', offset = -2, cmap = 'rainbow')
# 设置z轴的维度
ax.set_zlim(-2, 2)

plt.show()


# 3D散点图
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

%matplotlib qt

fig = plt.figure()
ax = Axes3D(fig)
 
x = np.random.randint(0, 500,100)
y = np.random.randint(0, 500,100)
z = np.random.randint(-200,200,100)
y3 = np.arctan2(x,y)
ax.scatter(x, y, z,c=y3, marker='.', s=1500)
plt.show()

posted @ 2019-07-11 09:41 Erio 阅读(1029) 评论(0) 编辑收藏举报

刷新页面返回顶部

登录后才能查看或发表评论，立即登录或者逛逛博客园首页

公告

昵称：Erio 园龄：6年5个月粉丝：115 关注：1

昵称： Erio
园龄： 6年5个月
粉丝： 115
关注： 1

+加关注

2025年3月

日

一

二

三

四

五

六

Erio