pandas百题笔记
Pandas百题笔记
1.导入 Pandas:
import pandas as pd
2.查看 Pandas 版本信息:
print(pd.__version__) ==>1.0.1
Pandas 的数据结构:Pandas 主要有 Series(一维数组),DataFrame(二维数组),Panel(三维数组),Panel4D(四维数组),PanelND(更多维数组)等数据结构。其中 Series 和 DataFrame 应用的最为广泛。
#Series 是一维带标签的数组,它可以包含任何数据类型。包括整数,字符串,浮点数,Python 对象等。Series 可以通过标签来定位。 #DataFrame 是二维的带标签的数据结构。我们可以通过标签来定位数据。这是 NumPy 所没有的。
创建 Series 数据类型
创建 Series 语法:s = pd.Series(data, index=index),可以通过多种方式进行创建,以下介绍了 3 个常用方法。
3.从列表创建 Series:
arr = [0, 1, 2, 3, 4] s1 = pd.Series(arr) # 如果不指定索引,则默认从 0 开始 s1 ==> 0 0 1 1 2 2 3 3 4 4 dtype: int64
4.从 Ndarray 创建 Series:
import numpy as np n = np.random.randn(5) # 创建一个随机 Ndarray 数组 index = ['a', 'b', 'c', 'd', 'e'] s2 = pd.Series(n, index=index) s2 ==> a -0.766282 b 0.134975 c 0.175090 d 0.298047 e 0.171916 dtype: float64
5.从字典创建 Series:
d = {'a': 1, 'b': 2, 'c': 3, 'd': 4, 'e': 5} # 定义示例字典 s3 = pd.Series(d) s3 ==> a 1 b 2 c 3 d 4 e 5 dtype: int64
Series 基本操作
6.修改 Series 索引:
print(s1) # 以 s1 为例 s1.index = ['A', 'B', 'C', 'D', 'E'] # 修改后的索引 s1 ==> 0 0 1 1 2 2 3 3 4 4 dtype: int64 A 0 B 1 C 2 D 3 E 4 dtype: int64
7.Series 纵向拼接:
s4 = s3.append(s1) # 将 s1 拼接到 s3 s4 ==> a 1 b 2 c 3 d 4 e 5 A 0 B 1 C 2 D 3 E 4 dtype: int64
8.Series 按指定索引删除元素:
print(s4) s4 = s4.drop('e') # 删除索引为 e 的值 s4 ==> a 1 b 2 c 3 d 4 e 5 A 0 B 1 C 2 D 3 E 4 dtype: int64 a 1 b 2 c 3 d 4 A 0 B 1 C 2 D 3 E 4 dtype: int64
9.Series 修改指定索引元素:
s4['A'] = 6 # 修改索引为 A 的值 = 6 s4 ==> a 1 b 2 c 3 d 4 A 6 B 1 C 2 D 3 E 4 dtype: int64
10.Series 按指定索引查找元素:
s4['B'] ==> 1
11.Series 切片操作:
例如对s4的前 3 个数据访问
s4[:3] ==> a 1 b 2 c 3 dtype: int64
Series 运算
12.Series 加法运算:
Series 的加法运算是按照索引计算,如果索引不同则填充为 NaN(空值)。
s4.add(s3) ==> A NaN B NaN C NaN D NaN E NaN a 2.0 b 4.0 c 6.0 d 8.0 e NaN dtype: float64
13.Series 减法运算:
Series的减法运算是按照索引对应计算,如果不同则填充为 NaN(空值)。
s4.sub(s3) ==> A NaN B NaN C NaN D NaN E NaN a 0.0 b 0.0 c 0.0 d 0.0 e NaN dtype: float64
14.Series 乘法运算:
Series 的乘法运算是按照索引对应计算,如果索引不同则填充为 NaN(空值)。
s4.mul(s3) ==> A NaN B NaN C NaN D NaN E NaN a 1.0 b 4.0 c 9.0 d 16.0 e NaN dtype: float64
15.Series 除法运算:
Series 的除法运算是按照索引对应计算,如果索引不同则填充为 NaN(空值)。
s4.div(s3) ==> A NaN B NaN C NaN D NaN E NaN a 1.0 b 1.0 c 1.0 d 1.0 e NaN dtype: float64
16.Series 求中位数:
s4.median() ==> 3.0
17.Series 求和:
s4.sum() ==> 26
18.Series 求最大值:
s4.max() ==> 6
19.Series 求最小值:
s4.min() ==> 1
创建 DataFrame 数据类型
与 Sereis 不同,DataFrame 可以存在多列数据。一般情况下,DataFrame 也更加常用。
20.通过 NumPy 数组创建 DataFrame:
dates = pd.date_range('today', periods=6) # 定义时间序列作为 index num_arr = np.random.randn(6, 4) # 传入 numpy 随机数组 columns = ['A', 'B', 'C', 'D'] # 将列表作为列名 df1 = pd.DataFrame(num_arr, index=dates, columns=columns) df1 ==> A B C D 2020-07-05 13:58:34.723797 -0.820141 0.205872 -0.928024 -1.828410 2020-07-06 13:58:34.723797 0.750014 -0.340494 1.190786 -0.204266 2020-07-07 13:58:34.723797 -2.062106 -1.520711 1.414341 1.057326 2020-07-08 13:58:34.723797 -0.821653 0.564271 -1.274913 2.340385 2020-07-09 13:58:34.723797 -1.936687 0.447897 -0.108420 0.133166 2020-07-10 13:58:34.723797 0.707222 -1.251812 -0.235982 0.340147
21.通过字典数组创建 DataFrame:
data = {'animal': ['cat', 'cat', 'snake', 'dog', 'dog', 'cat', 'snake', 'cat', 'dog', 'dog'], 'age': [2.5, 3, 0.5, np.nan, 5, 2, 4.5, np.nan, 7, 3], 'visits': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1], 'priority': ['yes', 'yes', 'no', 'yes', 'no', 'no', 'no', 'yes', 'no', 'no']} labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j'] df2 = pd.DataFrame(data, index=labels) df2 ==> animal age visits priority a cat 2.5 1 yes b cat 3.0 3 yes c snake 0.5 2 no d dog NaN 3 yes e dog 5.0 2 no f cat 2.0 3 no g snake 4.5 1 no h cat NaN 1 yes i dog 7.0 2 no j dog 3.0 1 no #字典中的键值直接变为列名
22.查看 DataFrame 的数据类型:
df2.dtypes ==> animal object age float64 visits int64 priority object dtype: object
DataFrame 基本操作
23.预览 DataFrame 的前 5 行数据:
此方法对快速了解陌生数据集结构十分有用。
df2.head() # 默认为显示 5 行,可根据需要在括号中填入希望预览的行数 ==> animal age visits priority a cat 2.5 1 yes b cat 3.0 3 yes c snake 0.5 2 no d dog NaN 3 yes e dog 5.0 2 no
24.查看 DataFrame 的后 3 行数据:
df2.tail(3) ==> animal age visits priority h cat NaN 1 yes i dog 7.0 2 no j dog 3.0 1 no
25.查看 DataFrame 的索引:
df2.index ==> Index(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j'], dtype='object')
26.查看 DataFrame 的列名:
df2.columns ==>Index(['animal', 'age', 'visits', 'priority'], dtype='object')
27.查看 DataFrame 的数值:
df2.values ==> array([['cat', 2.5, 1, 'yes'], ['cat', 3.0, 3, 'yes'], ['snake', 0.5, 2, 'no'], ['dog', nan, 3, 'yes'], ['dog', 5.0, 2, 'no'], ['cat', 2.0, 3, 'no'], ['snake', 4.5, 1, 'no'], ['cat', nan, 1, 'yes'], ['dog', 7.0, 2, 'no'], ['dog', 3.0, 1, 'no']], dtype=object)
28.查看 DataFrame 的统计数据:
df2.describe() ==> age visits count 8.000000 10.000000 mean 3.437500 1.900000 std 2.007797 0.875595 min 0.500000 1.000000 25% 2.375000 1.000000 50% 3.000000 2.000000 75% 4.625000 2.750000 max 7.000000 3.000000
29.DataFrame 转置操作:
df2.T ==> a b c d e f g h i j animal cat cat snake dog dog cat snake cat dog dog age 2.5 3 0.5 NaN 5 2 4.5 NaN 7 3 visits 1 3 2 3 2 3 1 1 2 1 priority yes yes no yes no no no yes no no
30.对 DataFrame 进行按列排序:
df2.sort_values(by='age') # 按 age 升序排列 ==> animal age visits priority c snake 0.5 2 no f cat 2.0 3 no a cat 2.5 1 yes b cat 3.0 3 yes j dog 3.0 1 no g snake 4.5 1 no e dog 5.0 2 no i dog 7.0 2 no d dog NaN 3 yes h cat NaN 1 yes
31.对 DataFrame 数据切片:
df2[1:3] ==> animal age visits priority b cat 3.0 3 yes c snake 0.5 2 no
32.对 DataFrame 通过标签查询(单列):
df2['age'] ==> a 2.5 b 3.0 c 0.5 d NaN e 5.0 f 2.0 g 4.5 h NaN i 7.0 j 3.0 Name: age, dtype: float64 df2.age # 等价于 df2['age']
33.对 DataFrame 通过标签查询(多列):
df2[['age', 'animal']] # 传入一个列名组成的列表 ==> age animal a 2.5 cat b 3.0 cat c 0.5 snake d NaN dog e 5.0 dog f 2.0 cat g 4.5 snake h NaN cat i 7.0 dog j 3.0 dog
34.对 DataFrame 通过位置查询:
df2.iloc[1:3] # 查询 2,3 行 ==> animal age visits priority b cat 3.0 3 yes c snake 0.5 2 no
35.DataFrame 副本拷贝:
生成 DataFrame 副本,方便数据集被多个不同流程使用
df3 = df2.copy() df3 ==> animal age visits priority a cat 2.5 1 yes b cat 3.0 3 yes c snake 0.5 2 no d dog NaN 3 yes e dog 5.0 2 no f cat 2.0 3 no g snake 4.5 1 no h cat NaN 1 yes i dog 7.0 2 no j dog 3.0 1 no
36.判断 DataFrame 元素是否为空:
df3.isnull() # 如果为空则返回为 True ==> animal age visits priority a False False False False b False False False False c False False False False d False True False False e False False False False f False False False False g False False False False h False True False False i False False False False j False False False False
37.添加列数据:
num = pd.Series([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], index=df3.index) df3['No.'] = num # 添加以 'No.' 为列名的新数据列 df3 ==> animal age visits priority No. a cat 2.5 1 yes 0 b cat 3.0 3 yes 1 c snake 0.5 2 no 2 d dog NaN 3 yes 3 e dog 5.0 2 no 4 f cat 2.0 3 no 5 g snake 4.5 1 no 6 h cat NaN 1 yes 7 i dog 7.0 2 no 8 j dog 3.0 1 no 9
38.根据 DataFrame 的下标值进行更改:
修改第 2 行与第 2 列对应的值 3.0 → 2.0
df3.iat[1, 1] = 2 # 索引序号从 0 开始,这里为 1, 1 df3 ==> animal age visits priority No. a cat 2.5 1 yes 0 b cat 2.0 3 yes 1 c snake 0.5 2 no 2 d dog NaN 3 yes 3 e dog 5.0 2 no 4 f cat 2.0 3 no 5 g snake 4.5 1 no 6 h cat NaN 1 yes 7 i dog 7.0 2 no 8 j dog 3.0 1 no 9
39.根据 DataFrame 的标签对数据进行修改:
df3.loc['f', 'age'] = 1.5 df3 ==> animal age visits priority No. a cat 2.5 1 yes 0 b cat 2.0 3 yes 1 c snake 0.5 2 no 2 d dog NaN 3 yes 3 e dog 5.0 2 no 4 f cat 1.5 3 no 5 g snake 4.5 1 no 6 h cat NaN 1 yes 7 i dog 7.0 2 no 8 j dog 3.0 1 no 9
40.DataFrame 求平均值操作:
df3.mean() ==> age 3.25 visits 1.90 No. 4.50 dtype: float64
41.对 DataFrame 中任意列做求和操作:
df3['visits'].sum() ==> 19
字符串操作
42.将字符串转化为小写字母:
string = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat']) print(string) string.str.lower() ==> 0 A 1 B 2 C 3 Aaba 4 Baca 5 NaN 6 CABA 7 dog 8 cat dtype: object 0 a 1 b 2 c 3 aaba 4 baca 5 NaN 6 caba 7 dog 8 cat dtype: object
43.将字符串转化为大写字母:
string.str.upper() ==> 0 A 1 B 2 C 3 AABA 4 BACA 5 NaN 6 CABA 7 DOG 8 CAT dtype: object
DataFrame 缺失值操作
44.对缺失值进行填充:
df4 = df3.copy() print(df4) df4.fillna(value=3) ==> animal age visits priority No. a cat 2.5 1 yes 0 b cat 2.0 3 yes 1 c snake 0.5 2 no 2 d dog NaN 3 yes 3 e dog 5.0 2 no 4 f cat 1.5 3 no 5 g snake 4.5 1 no 6 h cat NaN 1 yes 7 i dog 7.0 2 no 8 j dog 3.0 1 no 9 animal age visits priority No. a cat 2.5 1 yes 0 b cat 2.0 3 yes 1 c snake 0.5 2 no 2 d dog 3.0 3 yes 3 e dog 5.0 2 no 4 f cat 1.5 3 no 5 g snake 4.5 1 no 6 h cat 3.0 1 yes 7 i dog 7.0 2 no 8 j dog 3.0 1 no 9
45.删除存在缺失值的行:
df5 = df3.copy() print(df5) df5.dropna(how='any') # 任何存在 NaN 的行都将被删除 ==> animal age visits priority No. a cat 2.5 1 yes 0 b cat 2.0 3 yes 1 c snake 0.5 2 no 2 d dog NaN 3 yes 3 e dog 5.0 2 no 4 f cat 1.5 3 no 5 g snake 4.5 1 no 6 h cat NaN 1 yes 7 i dog 7.0 2 no 8 j dog 3.0 1 no 9 animal age visits priority No. a cat 2.5 1 yes 0 b cat 2.0 3 yes 1 c snake 0.5 2 no 2 e dog 5.0 2 no 4 f cat 1.5 3 no 5 g snake 4.5 1 no 6 i dog 7.0 2 no 8 j dog 3.0 1 no 9
46.DataFrame 按指定列对齐:
left = pd.DataFrame({'key': ['foo1', 'foo2'], 'one': [1, 2]}) right = pd.DataFrame({'key': ['foo2', 'foo3'], 'two': [4, 5]}) print(left) print(right)
按照 key 列对齐连接,只存在 foo2 相同,所以最后变成一行
pd.merge(left, right, on='key') ==> key one 0 foo1 1 1 foo2 2 key two 0 foo2 4 1 foo3 5 key one two 0 foo2 2 4
DataFrame 文件操作
47.CSV 文件写入:
df3.to_csv('animal.csv') print("写入成功.") ==> 写入成功.
48.CSV 文件读取:
df_animal = pd.read_csv('animal.csv') df_animal ==> Unnamed: 0 animal age visits priority No. 0 a cat 2.5 1 yes 0 1 b cat 2.0 3 yes 1 2 c snake 0.5 2 no 2 3 d dog NaN 3 yes 3 4 e dog 5.0 2 no 4 5 f cat 1.5 3 no 5 6 g snake 4.5 1 no 6 7 h cat NaN 1 yes 7 8 i dog 7.0 2 no 8 9 j dog 3.0 1 no 9
49.Excel 写入操作:
df3.to_excel('animal.xlsx', sheet_name='Sheet1') print("写入成功.") ==> 写入成功.
50.Excel 读取操作:
pd.read_excel('animal.xlsx', 'Sheet1', index_col=None, na_values=['NA']) ==> Unnamed: 0 animal age visits priority No. 0 a cat 2.5 1 yes 0 1 b cat 2.0 3 yes 1 2 c snake 0.5 2 no 2 3 d dog NaN 3 yes 3 4 e dog 5.0 2 no 4 5 f cat 1.5 3 no 5 6 g snake 4.5 1 no 6 7 h cat NaN 1 yes 7 8 i dog 7.0 2 no 8 9 j dog 3.0 1 no 9
进阶部分
时间序列索引
51.建立一个以 2018 年每一天为索引,值为随机数的 Series:
dti = pd.date_range(start='2018-01-01', end='2018-12-31', freq='D') s = pd.Series(np.random.rand(len(dti)), index=dti) s ==> 2018-01-01 0.441330 2018-01-02 0.182571 2018-01-03 0.141348 2018-01-04 0.604700 2018-01-05 0.300351 ... 2018-12-27 0.499318 2018-12-28 0.530867 2018-12-29 0.183895 2018-12-30 0.163899 2018-12-31 0.173812 Freq: D, Length: 365, dtype: float64
52.统计s 中每一个周三对应值的和:
周一从 0 开始
s[s.index.weekday == 2].sum() ==> 22.592391213957054
53.统计s中每个月值的平均值:
s.resample('M').mean() ==> 2018-01-31 0.441100 2018-02-28 0.506476 2018-03-31 0.501672 2018-04-30 0.510073 2018-05-31 0.416773 2018-06-30 0.525039 2018-07-31 0.433221 2018-08-31 0.472530 2018-09-30 0.388529 2018-10-31 0.550011 2018-11-30 0.486513 2018-12-31 0.443012 Freq: M, dtype: float64
54.将 Series 中的时间进行转换(秒转分钟):
s = pd.date_range('today', periods=100, freq='S') ts = pd.Series(np.random.randint(0, 500, len(s)), index=s) ts.resample('Min').sum() ==> 2020-07-05 14:48:00 15836 2020-07-05 14:49:00 9298 Freq: T, dtype: int64
55.UTC 世界时间标准:
s = pd.date_range('today', periods=1, freq='D') # 获取当前时间 ts = pd.Series(np.random.randn(len(s)), s) # 随机数值 ts_utc = ts.tz_localize('UTC') # 转换为 UTC 时间 ts_utc ==> 2020-07-05 14:48:38.609382+00:00 -0.348899 Freq: D, dtype: float64
56.转换为上海所在时区:
ts_utc.tz_convert('Asia/Shanghai') ==> 2020-07-05 22:48:38.609382+08:00 -0.348899 Freq: D, dtype: float64
57.不同时间表示方式的转换:
rng = pd.date_range('1/1/2018', periods=5, freq='M') ts = pd.Series(np.random.randn(len(rng)), index=rng) print(ts) ps = ts.to_period() print(ps) ps.to_timestamp() ==> 2018-01-31 0.621688 2018-02-28 -1.937715 2018-03-31 0.081314 2018-04-30 -1.308769 2018-05-31 -0.075345 Freq: M, dtype: float64 2018-01 0.621688 2018-02 -1.937715 2018-03 0.081314 2018-04 -1.308769 2018-05 -0.075345 Freq: M, dtype: float64 2018-01-01 0.621688 2018-02-01 -1.937715 2018-03-01 0.081314 2018-04-01 -1.308769 2018-05-01 -0.075345 Freq: MS, dtype: float64
Series 多重索引
58.创建多重索引 Series:
构建一个 letters = ['A', 'B', 'C'] 和 numbers = list(range(10))为索引,值为随机数的多重索引 Series。
letters = ['A', 'B', 'C'] numbers = list(range(10)) mi = pd.MultiIndex.from_product([letters, numbers]) # 设置多重索引 s = pd.Series(np.random.rand(30), index=mi) # 随机数 s ==> A 0 0.698046 1 0.380276 2 0.873395 3 0.628864 4 0.528025 5 0.677856 6 0.194495 7 0.164484 8 0.018238 9 0.747468 B 0 0.623616 1 0.560504 2 0.731296 3 0.760307 4 0.807663 5 0.347980 6 0.005892 7 0.807262 8 0.650353 9 0.803976 C 0 0.387503 1 0.943305 2 0.215817 3 0.128086 4 0.252103 5 0.048908 6 0.779633 7 0.825234 8 0.624257 9 0.263373 dtype: float64
59.多重索引 Series 查询:
查询索引为 1,3,6 的值
s.loc[:, [1, 3, 6]] ==> A 1 0.380276 3 0.628864 6 0.194495 B 1 0.560504 3 0.760307 6 0.005892 C 1 0.943305 3 0.128086 6 0.779633 dtype: float64
60.多重索引 Series 切片:
s.loc[pd.IndexSlice[:'B', 5:]] ==> A 5 0.677856 6 0.194495 7 0.164484 8 0.018238 9 0.747468 B 5 0.347980 6 0.005892 7 0.807262 8 0.650353 9 0.803976 dtype: float64
DataFrame 多重索引
61.根据多重索引创建 DataFrame:
创建一个以 letters = ['A', 'B'] 和 numbers = list(range(6))为索引,值为随机数据的多重索引 DataFrame。
frame = pd.DataFrame(np.arange(12).reshape(6, 2), index=[list('AAABBB'), list('123123')], columns=['hello', 'shiyanlou']) frame ==> hello shiyanlou A 1 0 1 2 2 3 3 4 5 B 1 6 7 2 8 9 3 10 11
62.多重索引设置列名称:
frame.index.names = ['first', 'second'] frame ==> hello shiyanlou first second A 1 0 1 2 2 3 3 4 5 B 1 6 7 2 8 9 3 10 11
63.DataFrame 多重索引分组求和:
frame.groupby('first').sum() ==> hello shiyanlou first A 6 9 B 24 27
64.DataFrame 行列名称转换:
print(frame) frame.stack() ==> hello shiyanlou first second A 1 0 1 2 2 3 3 4 5 B 1 6 7 2 8 9 3 10 11 first second A 1 hello 0 shiyanlou 1 2 hello 2 shiyanlou 3 3 hello 4 shiyanlou 5 B 1 hello 6 shiyanlou 7 2 hello 8 shiyanlou 9 3 hello 10 shiyanlou 11 dtype: int64
65.DataFrame 索引转换:
print(frame) frame.unstack() ==> hello shiyanlou first second A 1 0 1 2 2 3 3 4 5 B 1 6 7 2 8 9 3 10 11 hello shiyanlou second 1 2 3 1 2 3 first A 0 2 4 1 3 5 B 6 8 10 7 9 11
66.DataFrame 条件查找:
示例数据
data = {'animal': ['cat', 'cat', 'snake', 'dog', 'dog', 'cat', 'snake', 'cat', 'dog', 'dog'], 'age': [2.5, 3, 0.5, np.nan, 5, 2, 4.5, np.nan, 7, 3], 'visits': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1], 'priority': ['yes', 'yes', 'no', 'yes', 'no', 'no', 'no', 'yes', 'no', 'no']} labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j'] df = pd.DataFrame(data, index=labels)
查找 age 大于 3 的全部信息
df[df['age'] > 3] ==> animal age visits priority e dog 5.0 2 no g snake 4.5 1 no i dog 7.0 2 no
67.根据行列索引切片:
df.iloc[2:4, 1:3] ==> age visits c 0.5 2 d NaN 3
68.DataFrame 多重条件查询:
查找 age<3 且为 cat 的全部数据。
df = pd.DataFrame(data, index=labels) df[(df['animal'] == 'cat') & (df['age'] < 3)] ==> animal age visits priority a cat 2.5 1 yes f cat 2.0 3 no
69.DataFrame 按关键字查询:
df3[df3['animal'].isin(['cat', 'dog'])] ==> animal age visits priority No. a cat 2.5 1 yes 0 b cat 2.0 3 yes 1 d dog NaN 3 yes 3 e dog 5.0 2 no 4 f cat 1.5 3 no 5 h cat NaN 1 yes 7 i dog 7.0 2 no 8 j dog 3.0 1 no 9
70.DataFrame 按标签及列名查询:
df.loc[df2.index[[3, 4, 8]], ['animal', 'age']] ==> animal age d dog NaN e dog 5.0 i dog 7.0
71.DataFrame 多条件排序:
按照 age 降序,visits 升序排列
df.sort_values(by=['age', 'visits'], ascending=[False, True]) ==> animal age visits priority i dog 7.0 2 no e dog 5.0 2 no g snake 4.5 1 no j dog 3.0 1 no b cat 3.0 3 yes a cat 2.5 1 yes f cat 2.0 3 no c snake 0.5 2 no h cat NaN 1 yes d dog NaN 3 yes
72.DataFrame 多值替换:
将 priority 列的 yes 值替换为 True,no 值替换为 False。
df['priority'].map({'yes': True, 'no': False}) ==> a True b True c False d True e False f False g False h True i False j False Name: priority, dtype: bool
73.DataFrame 分组求和:
df4.groupby('animal').sum() ==> age visits No. animal cat 6.0 8 13 dog 15.0 8 24 snake 5.0 3 8
74.使用列表拼接多个 DataFrame:
temp_df1 = pd.DataFrame(np.random.randn(5, 4)) # 生成由随机数组成的 DataFrame 1 temp_df2 = pd.DataFrame(np.random.randn(5, 4)) # 生成由随机数组成的 DataFrame 2 temp_df3 = pd.DataFrame(np.random.randn(5, 4)) # 生成由随机数组成的 DataFrame 3 print(temp_df1) print(temp_df2) print(temp_df3) pieces = [temp_df1, temp_df2, temp_df3] pd.concat(pieces) ==> 0 1 2 3 0 1.061349 0.927805 -0.270724 0.232218 1 -2.049875 -0.896899 -0.738298 0.547709 2 0.084709 -1.801844 0.610220 -1.304246 3 1.384591 0.872657 -0.829547 -0.332316 4 -0.255004 2.177881 0.615079 0.767592 0 1 2 3 0 0.009016 1.181569 -1.403829 -0.745604 1 -0.270313 -0.258377 -1.067346 1.465726 2 -1.619676 -0.324374 -0.433600 0.211323 3 0.163223 0.144191 0.717129 -0.555298 4 -0.718321 1.688866 -0.607994 1.731248 0 1 2 3 0 -1.178622 0.415409 0.496004 1.368869 1 0.724433 -0.262059 0.514689 -1.666051 2 -0.325606 0.013015 1.010961 2.075196 3 2.212960 -0.132432 -1.603347 -1.182487 4 0.102536 1.384535 0.411434 -0.175592 0 1 2 3 0 1.061349 0.927805 -0.270724 0.232218 1 -2.049875 -0.896899 -0.738298 0.547709 2 0.084709 -1.801844 0.610220 -1.304246 3 1.384591 0.872657 -0.829547 -0.332316 4 -0.255004 2.177881 0.615079 0.767592 0 0.009016 1.181569 -1.403829 -0.745604 1 -0.270313 -0.258377 -1.067346 1.465726 2 -1.619676 -0.324374 -0.433600 0.211323 3 0.163223 0.144191 0.717129 -0.555298 4 -0.718321 1.688866 -0.607994 1.731248 0 -1.178622 0.415409 0.496004 1.368869 1 0.724433 -0.262059 0.514689 -1.666051 2 -0.325606 0.013015 1.010961 2.075196 3 2.212960 -0.132432 -1.603347 -1.182487 4 0.102536 1.384535 0.411434 -0.175592
75.找出 DataFrame 表中和最小的列:
df = pd.DataFrame(np.random.random(size=(5, 10)), columns=list('abcdefghij')) print(df) df.sum().idxmin() # idxmax(), idxmin() 为 Series 函数返回最大最小值的索引值 ==> a b c d e f g \ 0 0.931149 0.641776 0.758608 0.630512 0.170375 0.211306 0.973363 1 0.730186 0.682949 0.554609 0.356089 0.399012 0.939087 0.908047 2 0.261405 0.434525 0.490395 0.368307 0.832568 0.571115 0.936016 3 0.161993 0.132176 0.852158 0.140710 0.165902 0.564976 0.656718 4 0.810233 0.385639 0.127849 0.166585 0.302643 0.947498 0.164274 h i j 0 0.223378 0.115285 0.161207 1 0.765946 0.206518 0.951096 2 0.891956 0.430530 0.045640 3 0.955571 0.962989 0.123037 4 0.391810 0.696404 0.561719 'd'
76.DataFrame 中每个元素减去每一行的平均值:
df = pd.DataFrame(np.random.random(size=(5, 3))) print(df) df.sub(df.mean(axis=1), axis=0) ==> 0 1 2 0 0.028539 0.555065 0.166588 1 0.781335 0.086089 0.616780 2 0.022462 0.047383 0.476410 3 0.796853 0.850955 0.765398 4 0.208298 0.858031 0.264920 0 1 2 0 -0.221525 0.305001 -0.083476 1 0.286600 -0.408646 0.122046 2 -0.159623 -0.134702 0.294325 3 -0.007549 0.046553 -0.039004 4 -0.235452 0.414281 -0.178830
77.DataFrame 分组,并得到每一组中最大三个数之和:
df = pd.DataFrame({'A': list('aaabbcaabcccbbc'), 'B': [12, 345, 3, 1, 45, 14, 4, 52, 54, 23, 235, 21, 57, 3, 87]}) print(df) df.groupby('A')['B'].nlargest(3).sum(level=0) ==> A B 0 a 12 1 a 345 2 a 3 3 b 1 4 b 45 5 c 14 6 a 4 7 a 52 8 b 54 9 c 23 10 c 235 11 c 21 12 b 57 13 b 3 14 c 87 A a 409 b 156 c 345 Name: B, dtype: int64
透视表
当分析庞大的数据时,为了更好的发掘数据特征之间的关系,且不破坏原数据,就可以利用透视表 pivot_table 进行操作。
78.透视表的创建:
新建表将 A, B, C 列作为索引进行聚合。
df = pd.DataFrame({'A': ['one', 'one', 'two', 'three'] * 3, 'B': ['A', 'B', 'C'] * 4, 'C': ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 2, 'D': np.random.randn(12), 'E': np.random.randn(12)}) print(df) print(pd.pivot_table(df, index=['A', 'B'])) ==> A B C D E 0 one A foo -2.718717 1.749056 1 one B foo -0.710776 0.442023 2 two C foo -0.824951 2.244523 3 three A bar 0.300916 1.709200 4 one B bar -2.590790 0.292709 5 one C bar 0.908543 -0.598258 6 two A foo -0.521278 0.204491 7 three B foo -3.302320 -1.762640 8 one C foo -1.311013 -0.722187 9 one A bar 0.785471 -0.231635 10 two B bar -1.758329 -0.031603 11 three C bar 1.236829 1.235032 D E A B one A -0.966623 0.758711 B -1.650783 0.367366 C -0.201235 -0.660222 three A 0.300916 1.709200 B -3.302320 -1.762640 C 1.236829 1.235032 two A -0.521278 0.204491 B -1.758329 -0.031603 C -0.824951 2.244523
79.透视表按指定行进行聚合:
将该 DataFrame 的 D 列聚合,按照 A,B 列为索引进行聚合,聚合的方式为默认求均值。
pd.pivot_table(df, values=['D'], index=['A', 'B']) ==> D A B one A -0.966623 B -1.650783 C -0.201235 three A 0.300916 B -3.302320 C 1.236829 two A -0.521278 B -1.758329 C -0.824951
80.透视表聚合方式定义:
上一题中 D 列聚合时,采用默认求均值的方法,若想使用更多的方式可以在 aggfunc 中实现。
pd.pivot_table(df, values=['D'], index=['A', 'B'], aggfunc=[np.sum, len]) ==> sum len D D A B one A -1.933246 2.0 B -3.301567 2.0 C -0.402470 2.0 three A 0.300916 1.0 B -3.302320 1.0 C 1.236829 1.0 two A -0.521278 1.0 B -1.758329 1.0 C -0.824951 1.0
81.透视表利用额外列进行辅助分割:
D 列按照 A,B 列进行聚合时,若关心 C 列对 D 列的影响,可以加入 columns 值进行分析。
pd.pivot_table(df, values=['D'], index=['A', 'B'], columns=['C'], aggfunc=np.sum) ==> D C bar foo A B one A 0.785471 -2.718717 B -2.590790 -0.710776 C 0.908543 -1.311013 three A 0.300916 NaN B NaN -3.302320 C 1.236829 NaN two A NaN -0.521278 B -1.758329 NaN C NaN -0.824951
82.透视表的缺省值处理:
在透视表中由于不同的聚合方式,相应缺少的组合将为缺省值,可以加入 fill_value 对缺省值处理。
pd.pivot_table(df, values=['D'], index=['A', 'B'], columns=['C'], aggfunc=np.sum, fill_value=0) ==> D C bar foo A B one A 0.785471 -2.718717 B -2.590790 -0.710776 C 0.908543 -1.311013 three A 0.300916 0.000000 B 0.000000 -3.302320 C 1.236829 0.000000 two A 0.000000 -0.521278 B -1.758329 0.000000 C 0.000000 -0.824951
绝对类型¶
在数据的形式上主要包括数量型和性质型,数量型表示着数据可数范围可变,而性质型表示范围已经确定不可改变,绝对型数据就是性质型数据的一种。
83.绝对型数据定义:
df = pd.DataFrame({"id": [1, 2, 3, 4, 5, 6], "raw_grade": [ 'a', 'b', 'b', 'a', 'a', 'e']}) df["grade"] = df["raw_grade"].astype("category") df ==> id raw_grade grade 0 1 a a 1 2 b b 2 3 b b 3 4 a a 4 5 a a 5 6 e e
84.对绝对型数据重命名:
df["grade"].cat.categories = ["very good", "good", "very bad"] df ==> id raw_grade grade 0 1 a very good 1 2 b good 2 3 b good 3 4 a very good 4 5 a very good 5 6 e very bad
85.重新排列绝对型数据并补充相应的缺省值:
df["grade"] = df["grade"].cat.set_categories( ["very bad", "bad", "medium", "good", "very good"]) df ==> id raw_grade grade 0 1 a very good 1 2 b good 2 3 b good 3 4 a very good 4 5 a very good 5 6 e very bad
86.对绝对型数据进行排序:
df.sort_values(by="grade") ==> id raw_grade grade 5 6 e very bad 1 2 b good 2 3 b good 0 1 a very good 3 4 a very good 4 5 a very good
87.对绝对型数据进行分组:
df.groupby("grade").size() ==> grade very bad 1 bad 0 medium 0 good 2 very good 3 dtype: int64
数据清洗
常常我们得到的数据是不符合我们最终处理的数据要求,包括许多缺省值以及坏的数据,需要我们对数据进行清洗。
88.缺失值拟合:
在FilghtNumber中有数值缺失,其中数值为按 10 增长,补充相应的缺省值使得数据完整,并让数据为 int 类型。
df = pd.DataFrame({'From_To': ['LoNDon_paris', 'MAdrid_miLAN', 'londON_StockhOlm', 'Budapest_PaRis', 'Brussels_londOn'], 'FlightNumber': [10045, np.nan, 10065, np.nan, 10085], 'RecentDelays': [[23, 47], [], [24, 43, 87], [13], [67, 32]], 'Airline': ['KLM(!)', '<Air France> (12)', '(British Airways. )', '12. Air France', '"Swiss Air"']}) df['FlightNumber'] = df['FlightNumber'].interpolate().astype(int) df ==> From_To FlightNumber RecentDelays Airline 0 LoNDon_paris 10045 [23, 47] KLM(!) 1 MAdrid_miLAN 10055 [] <Air France> (12) 2 londON_StockhOlm 10065 [24, 43, 87] (British Airways. ) 3 Budapest_PaRis 10075 [13] 12. Air France 4 Brussels_londOn 10085 [67, 32] "Swiss Air"
89.数据列拆分:
其中From_to应该为两独立的两列From和To,将From_to依照_拆分为独立两列建立为一个新表。
temp = df.From_To.str.split('_', expand=True) temp.columns = ['From', 'To'] temp ==> From To 0 LoNDon paris 1 MAdrid miLAN 2 londON StockhOlm 3 Budapest PaRis 4 Brussels londOn
90.字符标准化:
其中注意到地点的名字都不规范(如:londON应该为London)需要对数据进行标准化处理。
temp['From'] = temp['From'].str.capitalize() temp['To'] = temp['To'].str.capitalize()
91.删除坏数据加入整理好的数据:
将最开始的 From_to 列删除,加入整理好的 From 和 to 列。
df = df.drop('From_To', axis=1) df = df.join(temp) print(df) ==> FlightNumber RecentDelays Airline From To 0 10045 [23, 47] KLM(!) London Paris 1 10055 [] <Air France> (12) Madrid Milan 2 10065 [24, 43, 87] (British Airways. ) London Stockholm 3 10075 [13] 12. Air France Budapest Paris 4 10085 [67, 32] "Swiss Air" Brussels London
92.去除多余字符:
如同 airline 列中许多数据有许多其他字符,会对后期的数据分析有较大影响,需要对这类数据进行修正。
df['Airline'] = df['Airline'].str.extract( '([a-zA-Z\s]+)', expand=False).str.strip() df ==> FlightNumber Airline From To delay_1 delay_2 \ 0 10045 KLM London Paris 23.0 47.0 1 10055 Air France Madrid Milan NaN NaN 2 10065 British Airways London Stockholm 24.0 43.0 3 10075 Air France Budapest Paris 13.0 NaN 4 10085 Swiss Air Brussels London 67.0 32.0 delay_3 0 NaN 1 NaN 2 87.0 3 NaN 4 NaN
93.格式规范:
在 RecentDelays 中记录的方式为列表类型,由于其长度不一,这会为后期数据分析造成很大麻烦。这里将 RecentDelays 的列表拆开,取出列表中的相同位置元素作为一列,若为空值即用 NaN 代替。
delays = df['RecentDelays'].apply(pd.Series) delays.columns = ['delay_{}'.format(n) for n in range(1, len(delays.columns)+1)] df = df.drop('RecentDelays', axis=1).join(delays) df ==> FlightNumber Airline From To delay_1 delay_2 delay_3 0 10045 KLM London Paris 23.0 47.0 NaN 1 10055 Air France Madrid Milan NaN NaN NaN 2 10065 British Airways London Stockholm 24.0 43.0 87.0 3 10075 Air France Budapest Paris 13.0 NaN NaN 4 10085 Swiss Air Brussels London 67.0 32.0 NaN
数据预处理
94.信息区间划分:
班级一部分同学的数学成绩表,如下图所示
df=pd.DataFrame({'name':['Alice','Bob','Candy','Dany','Ella','Frank','Grace','Jenny'], 'grades':[58,83,79,65,93,45,61,88]})
但我们更加关心的是该同学是否及格,将该数学成绩按照是否>60来进行划分。
df = pd.DataFrame({'name': ['Alice', 'Bob', 'Candy', 'Dany', 'Ella', 'Frank', 'Grace', 'Jenny'], 'grades': [58, 83, 79, 65, 93, 45, 61, 88]}) def choice(x): if x > 60: return 1 else: return 0 df.grades = pd.Series(map(lambda x: choice(x), df.grades)) df ==> name grades 0 Alice 0 1 Bob 1 2 Candy 1 3 Dany 1 4 Ella 1 5 Frank 0 6 Grace 1 7 Jenny 1
95.数据去重:
一个列为A的 DataFrame 数据,如下图所示
df = pd.DataFrame({'A': [1, 2, 2, 3, 4, 5, 5, 5, 6, 7, 7]})
尝试将 A 列中连续重复的数据清除。
df = pd.DataFrame({'A': [1, 2, 2, 3, 4, 5, 5, 5, 6, 7, 7]}) df.loc[df['A'].shift() != df['A']] ==> A 0 1 1 2 3 3 4 4 5 5 8 6 9 7
96.数据归一化:
有时候,DataFrame 中不同列之间的数据差距太大,需要对其进行归一化处理。 其中,Max-Min 归一化是简单而常见的一种方式,公式如下:
def normalization(df): numerator = df.sub(df.min()) denominator = (df.max()).sub(df.min()) Y = numerator.div(denominator) return Y df = pd.DataFrame(np.random.random(size=(5, 3))) print(df) normalization(df) ==> 0 1 2 0 0.923325 0.925392 0.203170 1 0.770389 0.050410 0.605788 2 0.146447 0.542584 0.056240 3 0.161917 0.841527 0.547914 4 0.948175 0.814426 0.980268 0 1 2 0 0.969004 1.000000 0.159009 1 0.778247 0.000000 0.594731 2 0.000000 0.562496 0.000000 3 0.019297 0.904153 0.532098 4 1.000000 0.873179 1.000000
Pandas 绘图操作
为了更好的了解数据包含的信息,最直观的方法就是将其绘制成图。
97.Series 可视化:
%matplotlib inline ts = pd.Series(np.random.randn(100), index=pd.date_range('today', periods=100)) ts = ts.cumsum() ts.plot()
==>输出图像:
98.DataFrame 折线图:
df = pd.DataFrame(np.random.randn(100, 4), index=ts.index, columns=['A', 'B', 'C', 'D']) df = df.cumsum() df.plot()
==>输出图像:
99.DataFrame 散点图:
df = pd.DataFrame({"xs": [1, 5, 2, 8, 1], "ys": [4, 2, 1, 9, 6]}) df = df.cumsum() df.plot.scatter("xs", "ys", color='red', marker="*")
==>输出图像:
100.DataFrame 柱形图:
df = pd.DataFrame({"revenue": [57, 68, 63, 71, 72, 90, 80, 62, 59, 51, 47, 52], "advertising": [2.1, 1.9, 2.7, 3.0, 3.6, 3.2, 2.7, 2.4, 1.8, 1.6, 1.3, 1.9], "month": range(12) }) ax = df.plot.bar("month", "revenue", color="yellow") df.plot("month", "advertising", secondary_y=True, ax=ax)
==>输出图像: