[pandas]新手教程:2.十个入门问题

1. Pandas处理什么样的数据?

核心知识点:

  • 导入包,又名Import pandas as pd

  • 一个数据表被存储为一个DataFrame

  • DataFrame中的每一列都是一个Series

  • 你可以通过对DataFrame或Serie应用一个方法来做事情

pandas适合处理表格状的数据。在pandas中,表格被称为DataFrame。

1.1. 引入pandas模块

  • pd是pandas在行业内的标准缩写。
# 第一个案例

import pandas as pd

1.2. 用DataFrame来处理表格

  • 当使用字典来引入数据时,key就是表格的列名,value就是每列的值。
  • 每个不同的列所对应的值,可以是不同的类型。
df = pd.DataFrame(
     {
         "Name": [
             "Braund, Mr. Owen Harris",
             "Allen, Mr. William Henry",
             "Bonnell, Miss. Elizabeth",
         ],
         "Age": [22, 35, 58],
         "Sex": ["male", "male", "female"],
     }
)
df

1.3. DataFrame的每列是Siries

df['Age']
0    22
1    35
2    58
Name: Age, dtype: int64
  • 当选中DataFrame的一列时,返回的时Series
  • 使用DataFrame[‘列名’]来选择某列

Series也可以直接创建:

ages = pd.Series([22, 35, 58], name="Age")
ages
0    22
1    35
2    58
Name: Age, dtype: int64

Series没有列标签,但是有行标签。

1.4. DataFrame和Series可以调用一些方法

举几个例子:

df["Age"].max()
58
ages.max()
58
df.describe()
Age
count 3.000000
mean 38.333333
std 18.230012
min 22.000000
25% 28.500000
50% 35.000000
75% 46.500000
max 58.000000

2.怎样读写各种表格数据?

  • pandas支持很多种数据形式:csv, excel, sql, json, parquet ...

  • 读取:read_*

  • 写入:to_*

import pandas as pd
titanic_data_path = "../data/titanic.csv"
titanic_df = pd.read_csv(titanic_data_path)

# df.head()、tail() 用法 
# titanic_df.head(8)
titanic_df.tail(8)
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
883 884 0 2 Banfield, Mr. Frederick James male 28.0 0 0 C.A./SOTON 34068 10.500 NaN S
884 885 0 3 Sutehall, Mr. Henry Jr male 25.0 0 0 SOTON/OQ 392076 7.050 NaN S
885 886 0 3 Rice, Mrs. William (Margaret Norton) female 39.0 0 5 382652 29.125 NaN Q
886 887 0 2 Montvila, Rev. Juozas male 27.0 0 0 211536 13.000 NaN S
887 888 1 1 Graham, Miss. Margaret Edith female 19.0 0 0 112053 30.000 B42 S
888 889 0 3 Johnston, Miss. Catherine Helen "Carrie" female NaN 1 2 W./C. 6607 23.450 NaN S
889 890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.000 C148 C
890 891 0 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.750 NaN Q

df有很多种方法,还有很多属性可以直接调用。

titanic_df.dtypes
PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object
titanic_df.to_excel('ddd.xlsx', sheet_name = 'sheet1', index=False)
titanic_df.info
<bound method DataFrame.info of      PassengerId  Survived  Pclass  \
0              1         0       3   
1              2         1       1   
2              3         1       3   
3              4         1       1   
4              5         0       3   
..           ...       ...     ...   
886          887         0       2   
887          888         1       1   
888          889         0       3   
889          890         1       1   
890          891         0       3   

                                                  Name     Sex   Age  SibSp  \
0                              Braund, Mr. Owen Harris    male  22.0      1   
1    Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                               Heikkinen, Miss. Laina  female  26.0      0   
3         Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                             Allen, Mr. William Henry    male  35.0      0   
..                                                 ...     ...   ...    ...   
886                              Montvila, Rev. Juozas    male  27.0      0   
887                       Graham, Miss. Margaret Edith  female  19.0      0   
888           Johnston, Miss. Catherine Helen "Carrie"  female   NaN      1   
889                              Behr, Mr. Karl Howell    male  26.0      0   
890                                Dooley, Mr. Patrick    male  32.0      0   

     Parch            Ticket     Fare Cabin Embarked  
0        0         A/5 21171   7.2500   NaN        S  
1        0          PC 17599  71.2833   C85        C  
2        0  STON/O2. 3101282   7.9250   NaN        S  
3        0            113803  53.1000  C123        S  
4        0            373450   8.0500   NaN        S  
..     ...               ...      ...   ...      ...  
886      0            211536  13.0000   NaN        S  
887      0            112053  30.0000   B42        S  
888      2        W./C. 6607  23.4500   NaN        S  
889      0            111369  30.0000  C148        C  
890      0            370376   7.7500   NaN        Q  

[891 rows x 12 columns]>
titanic_df.__class__
pandas.core.frame.DataFrame 

3.怎样选取表格的子集?

pandas可以:

  • 选择或者筛选特定的行/列

  • 按条件筛选数据

  • 切片、选择、提取...

核心知识点:

  • 用 方括号[] 选取子集

  • 在括号里面, 可以是:

    • 单个行/列的标签

    • 行/列的标签的列表

    • 切片

    • 条件

  • 用 loc 时,用行列的名字(label name)

  • 用 iloc时,用行列的位置(索引值,不含后)

  • 用 loc/iloc的时候可以重新个DataFrame数据进行赋值。

3.1 用方括号[]

import pandas as pd

titanic = pd.read_csv('data/titanic.csv')
titanic.head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
# 单列
ages = titanic['Age']
ages.head(),ages.shape, ages.dtype,type(ages)
(0    22.0
 1    38.0
 2    26.0
 3    35.0
 4    35.0
 Name: Age, dtype: float64,
 (891,),
 dtype('float64'),
 pandas.core.series.Series)

DataFrame中的每一列都是Series。

# 多个不同列
age_sex = titanic[['Age','Sex']] # 方括号中间是list
age_sex.head(), age_sex.shape, age_sex.columns, type(age_sex)
(    Age     Sex
 0  22.0    male
 1  38.0  female
 2  26.0  female
 3  35.0  female
 4  35.0    male,
 (891, 2),
 Index(['Age', 'Sex'], dtype='object'),
 pandas.core.frame.DataFrame)

多列就是DataFrame。

# 筛选行
above_35 = titanic[titanic['Age'] > 35 ]
above_35.head(), above_35.shape
(    PassengerId  Survived  Pclass  \
 1             2         1       1   
 6             7         0       1   
 11           12         1       1   
 13           14         0       3   
 15           16         1       2   

                                                  Name     Sex   Age  SibSp  \
 1   Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
 6                             McCarthy, Mr. Timothy J    male  54.0      0   
 11                           Bonnell, Miss. Elizabeth  female  58.0      0   
 13                        Andersson, Mr. Anders Johan    male  39.0      1   
 15                   Hewlett, Mrs. (Mary D Kingcome)   female  55.0      0   

     Parch    Ticket     Fare Cabin Embarked  
 1       0  PC 17599  71.2833   C85        C  
 6       0     17463  51.8625   E46        S  
 11      0    113783  26.5500  C103        S  
 13      5    347082  31.2750   NaN        S  
 15      0    248706  16.0000   NaN        S  ,
 (217, 12))
type(titanic['Age'] > 35) # 括号里面的居然是个Series,index的序列
pandas.core.series.Series
condition = titanic['Age'] > 35
condition
0      False
1       True
2      False
3      False
4      False
       ...  
886    False
887    False
888    False
889    False
890    False
Name: Age, Length: 891, dtype: bool
# 区间值
titanic[titanic['Pclass'].isin([2,3])].head() # 属于2、3的才是范围内
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
5 6 0 3 Moran, Mr. James male NaN 0 0 330877 8.4583 NaN Q
7 8 0 3 Palsson, Master. Gosta Leonard male 2.0 3 1 349909 21.0750 NaN S
titanic[(titanic['Pclass'] == 2) | (titanic['Pclass'] == 3)].head()

4.怎样绘制散点图?

pandas用Matplotlib来创造性的绘制各种图形:

  • 散点图

  • 条状图

  • 盒装图

  • 等等

核心知识点:

  • plot.系列方法对Series和DataFrame都有用
  • 每一列默认用不同的图形表示
  • Pandas的plot都是Matplotlib对象

数据

import pandas as pd
import matplotlib.pyplot as plt
data_source = "./data/air_quality_no2.csv"
air_quality = pd.read_csv(data_source, index_col=0, parse_dates=True) 
air_quality.head()
  • index_col参数定义第一列为index;
  • parse_dates参数定义是否把日期转换为时间戳

4.1. 默认参数全表绘图

# 绘图
# 所有序列
air_quality.plot()
<AxesSubplot:xlabel='datetime'>

4.2. 指定列绘图

# 指定序列
air_quality['station_paris'].plot()
<AxesSubplot:xlabel='datetime'>

4.3. 散点图

# 散点图
air_quality.plot.scatter(x='station_london', y= 'station_paris',alpha=0.5)
<AxesSubplot:xlabel='station_london', ylabel='station_paris'>

4.4. 所有图形

for i in dir(air_quality.plot):
    if not i.startswith("_"):
        print(i)
#下面这个写法更舒服
# ----------------------------------------------------------       
[
    method_name 
    for method_name in dir(air_quality.plot) 
    if not method_name.startswith("_")
]
area
bar
barh
box
density
hexbin
hist
kde
line
pie
scatter





['area',
 'bar',
 'barh',
 'box',
 'density',
 'hexbin',
 'hist',
 'kde',
 'line',
 'pie',
 'scatter']
# 盒状图
air_quality.plot.box()
<AxesSubplot:>

4.5. 子图

# 分开的子图
axs = air_quality.plot.area(figsize=(12,4), subplots=True) 
type(axs)
numpy.ndarray
# 1. 生成一个空的matplotlib的Figure和Axes对象
fig, axs = plt.subplots(figsize= (12,4)) 
# 2. Use pandas to put the area plot on the prepared Figure/Axes
air_quality.plot.area(ax=axs)     
# 3. Do any matplotlib customization you like   
axs.set_ylabel("NO$_2$ concentration")
# 4. Save the Figure/Axes using the existing matplotlib method.
fig.savefig('4.1-no2_concentrations.png') 

5.怎样从现有列衍生新的列?

  • 无需使用循环计算

  • 直接以列为单位进行计算

比如在原表基础上新增一列数据

核心知识点:

  • 新增列的方法就是DataFrame['新列名称']
  • 对新增列不用对行进行循环
  • 可以通过字典或函数的方法对行列进行重命名

数据

import pandas as pd
data_source = "data/air_quality_no2.csv"
air_quality = pd.read_csv(data_source, index_col=0, parse_dates=True)
air_quality.head()
station_antwerp station_paris station_london
datetime
--- --- --- ---
2019-05-07 02:00:00 NaN NaN 23.0
2019-05-07 03:00:00 50.5 25.0 19.0
2019-05-07 04:00:00 45.0 27.7 19.0
2019-05-07 05:00:00 NaN 50.4 16.0
2019-05-07 06:00:00 NaN 61.9 NaN

5.1 新增列

  • 不用对每一行去循环
air_quality['london_mg_per_cubic'] = air_quality['station_london'] * 1.882
air_quality.head()
station_antwerp station_paris station_london london_mg_per_cubic
datetime
--- --- --- --- ---
2019-05-07 02:00:00 NaN NaN 23.0 43.286
2019-05-07 03:00:00 50.5 25.0 19.0 35.758
2019-05-07 04:00:00 45.0 27.7 19.0 35.758
2019-05-07 05:00:00 NaN 50.4 16.0 30.112
2019-05-07 06:00:00 NaN 61.9 NaN NaN
air_quality["ratio_paris_antwerp"] = (
    air_quality["station_paris"] / air_quality["station_antwerp"]) #非法计算直接空值。
air_quality.head()
station_antwerp station_paris station_london london_mg_per_cubic ratio_paris_antwerp
datetime
--- --- --- --- --- ---
2019-05-07 02:00:00 NaN NaN 23.0 43.286 NaN
2019-05-07 03:00:00 50.5 25.0 19.0 35.758 0.495050
2019-05-07 04:00:00 45.0 27.7 19.0 35.758 0.615556
2019-05-07 05:00:00 NaN 50.4 16.0 30.112 NaN
2019-05-07 06:00:00 NaN 61.9 NaN NaN NaN

数学计算和罗技计算都是跟随元素(element wise)。
复杂点的情况还可以使用apply()函数。

5.2 重命名

  • 字典方法
  • 函数方法
# 字典
air_quality_renamed = air_quality.rename(
    columns={
        "station_antwerp": "BETR801",
        "station_paris": "FR04014",
        "station_london": "London Westminster",
    }
)
air_quality_renamed.head()
BETR801 FR04014 London Westminster london_mg_per_cubic ratio_paris_antwerp
datetime
--- --- --- --- --- ---
2019-05-07 02:00:00 NaN NaN 23.0 43.286 NaN
2019-05-07 03:00:00 50.5 25.0 19.0 35.758 0.495050
2019-05-07 04:00:00 45.0 27.7 19.0 35.758 0.615556
2019-05-07 05:00:00 NaN 50.4 16.0 30.112 NaN
2019-05-07 06:00:00 NaN 61.9 NaN NaN NaN
air_quality_renamed = air_quality_renamed.rename(columns=str.lower)
air_quality_renamed.head()
betr801 fr04014 london westminster london_mg_per_cubic ratio_paris_antwerp
datetime
--- --- --- --- --- ---
2019-05-07 02:00:00 NaN NaN 23.0 43.286 NaN
2019-05-07 03:00:00 50.5 25.0 19.0 35.758 0.495050
2019-05-07 04:00:00 45.0 27.7 19.0 35.758 0.615556
2019-05-07 05:00:00 NaN 50.4 16.0 30.112 NaN
2019-05-07 06:00:00 NaN 61.9 NaN NaN NaN

6.怎样汇总统计?

  • 基础统计很轻易的搞定,适用于:

    • 整个数据集

    • 双向队列

    • 分类汇总

  • 还可以“split-apply-combine“(化整为零)

核心知识点:

  • 可以对整行整列进行汇总计算

  • groupby可以实现split-apply-combine模式

  • value_counts是计算每个类型变量的快捷方式

6.1 聚合统计

import pandas as pd

data_source = "data/titanic.csv"
titanic = pd.read_csv(data_source)
titanic.head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
titanic['Age'].mean()
29.69911764705882
titanic[['Age' ,'Fare']].median()
Age     28.0000
Fare    14.4542
dtype: float64
titanic[['Age' ,'Fare']].describe()
Age Fare
count 714.000000 891.000000
mean 29.699118 32.204208
std 14.526497 49.693429
min 0.420000 0.000000
25% 20.125000 7.910400
50% 28.000000 14.454200
75% 38.000000 31.000000
max 80.000000 512.329200
# 不想使用默认的方式,可以使用agg().

titanic.agg({
    "Age": ["min", "max","median","skew"],
    "Fare": ["min", "max", "median", "mean"],
})
Age Fare
min 0.420000 0.000000
max 80.000000 512.329200
median 28.000000 14.454200
skew 0.389108 NaN
mean NaN 32.204208

6.2. 按类别分组统计

titanic[['Age' ,'Sex']].groupby('Sex').mean()
Age
Sex
--- ---
female 27.915709
male 30.726645

计算一列数据里面不同类别的统计值很常见。groupby方法可被用来做这样的操作。

这就是所谓:

  • split-apply-combine模式
    • Split 数据分组
    • Apply 每组独立进行计算
    • Combine 合并成一种数据结构

Pandas里,Apply和combine一起执行了。

如果不对上面例子进行指定列,将会对每个数字值的列进行统计。

titanic.groupby("Sex").mean()
PassengerId Survived Pclass Age SibSp Parch Fare
Sex
--- --- --- --- --- --- --- ---
female 431.028662 0.742038 2.159236 27.915709 0.694268 0.649682 44.479818
male 454.147314 0.188908 2.389948 30.726645 0.429809 0.235702 25.523893

分组数据也支持使用方括号来进行选取。

titanic.groupby("Sex")["Age"].mean()
Sex
female    27.915709
male      30.726645
Name: Age, dtype: float64

[pandas-note\Docs\06_groupby_select_detail.svg]

对于一些以数据值来区别类别的列,比如例子中的Pclass, 1、2、3等舱计算个平均值没有任何意义。针对这种情况,pandas提供了Categorical数据类型来处理。

titanic.groupby(["Sex","Pclass"])["Fare"].mean() 
Sex     Pclass
female  1         106.125798
        2          21.970121
        3          16.118810
male    1          67.226127
        2          19.741782
        3          12.661633
Name: Fare, dtype: float64

可以一次性处理不同属性复合列的统计计算。

6.3.分类计数

同一列不同数字代表不同类型,需要计数可以用 value_counts()

titanic["Pclass"].value_counts() # 3类舱位统计个数
3    491
1    216
2    184
Name: Pclass, dtype: int64
titanic.groupby('Pclass')["Pclass"].count() #对Pclass进行分组,然后对Pclass进行count().
Pclass
1    216
2    184
3    491
Name: Pclass, dtype: int64
titanic.groupby('Pclass')["Pclass"].size() # 和上面有什么区别?
Pclass
1    216
2    184
3    491
Name: Pclass, dtype: int64

不同方法的区别:

  • size() 含缺失值数据,是行数。
  • count() 忽略了缺失值
  • value_counts() 可以用参数 dropna 来明确是否包含NaN数据

7.怎样重构表格?

  • melt() 可以从长到宽 long to wide

  • pivot()可以从宽到长 wide to long

核心知识点:

  • sort_values 可以对一个或多个列进行排序

  • pivot函数就是重构数据,pivot后的表格也支持聚合

  • pivot (long to wide format) 的反向操作是 melt (wide to long format)

import pandas as pd 

data1 = "data/titanic.csv"
data2 = "data/air_quality_long.csv"

titanic = pd.read_csv(data1)
air_quality = pd.read_csv(data2, index_col= "date.utc", parse_dates=True)

titanic.head(), air_quality.head()
(   PassengerId  Survived  Pclass  \
 0            1         0       3   
 1            2         1       1   
 2            3         1       3   
 3            4         1       1   
 4            5         0       3   

                                                 Name     Sex   Age  SibSp  \
 0                            Braund, Mr. Owen Harris    male  22.0      1   
 1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
 2                             Heikkinen, Miss. Laina  female  26.0      0   
 3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
 4                           Allen, Mr. William Henry    male  35.0      0   

    Parch            Ticket     Fare Cabin Embarked  
 0      0         A/5 21171   7.2500   NaN        S  
 1      0          PC 17599  71.2833   C85        C  
 2      0  STON/O2. 3101282   7.9250   NaN        S  
 3      0            113803  53.1000  C123        S  
 4      0            373450   8.0500   NaN        S  ,
                                 city country location parameter  value   unit
 date.utc                                                                     
 2019-06-18 06:00:00+00:00  Antwerpen      BE  BETR801      pm25   18.0  µg/m³
 2019-06-17 08:00:00+00:00  Antwerpen      BE  BETR801      pm25    6.5  µg/m³
 2019-06-17 07:00:00+00:00  Antwerpen      BE  BETR801      pm25   18.5  µg/m³
 2019-06-17 06:00:00+00:00  Antwerpen      BE  BETR801      pm25   16.0  µg/m³
 2019-06-17 05:00:00+00:00  Antwerpen      BE  BETR801      pm25    7.5  µg/m³)

7.1. 对行进行排序

  • sort_values()
  • sort_index()
titanic.sort_values(by='Age').head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
803 804 1 3 Thomas, Master. Assad Alexander male 0.42 0 1 2625 8.5167 NaN C
755 756 1 2 Hamalainen, Master. Viljo male 0.67 1 1 250649 14.5000 NaN S
644 645 1 3 Baclini, Miss. Eugenie female 0.75 2 1 2666 19.2583 NaN C
469 470 1 3 Baclini, Miss. Helene Barbara female 0.75 2 1 2666 19.2583 NaN C
78 79 1 2 Caldwell, Master. Alden Gates male 0.83 0 2 248738 29.0000 NaN S
titanic.sort_values(by=['Pclass','Age'],ascending=False).head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
851 852 0 3 Svensson, Mr. Johan male 74.0 0 0 347060 7.7750 NaN S
116 117 0 3 Connors, Mr. Patrick male 70.5 0 0 370369 7.7500 NaN Q
280 281 0 3 Duane, Mr. Frank male 65.0 0 0 336439 7.7500 NaN Q
483 484 1 3 Turkula, Mrs. (Hedwig) female 63.0 0 0 4134 9.5875 NaN S
326 327 0 3 Nysveen, Mr. Johan Hansen male 61.0 0 0 345364 6.2375 NaN S

长型数据转宽数据

  • 长数据:一般是指数据集中的变量没有做明确的细分,即变量中至少有一个变量中的元素存在值严重重复循环的情况(可以归为几类),表格整体的形状为长方形,数据总体的表现为:变量少而观察值多。
  • 宽数据:是指数据集对所有的变量进行了明确的细分,各变量的值不存在重复循环的情况也无法归类。数据总体的表现为:变量多而观察值少。 Python实现长型数据与宽型数据转换
  • pivot()
    • 参数
      • columns:来一项项分列
      • values:显示哪列的值
no2 = air_quality[air_quality['parameter']=='no2']
no2_subset = no2.sort_index().groupby(['location']).head(2)
no2_subset
no2_subset.pivot(columns='location',values="value")
city country parameter value unit
location BETR801 FR04014 London Westminster BETR801 FR04014 London Westminster BETR801 FR04014 London Westminster BETR801 FR04014 London Westminster BETR801 FR04014 London Westminster
--- --- --- --- --- --- --- --- --- --- --- --- --- --- --- ---
date.utc
--- --- --- --- --- --- --- --- --- --- --- --- --- --- --- ---
2019-04-09 01:00:00+00:00 Antwerpen Paris NaN BE FR NaN no2 no2 NaN 22.5 24.4 NaN µg/m³ µg/m³ NaN
2019-04-09 02:00:00+00:00 Antwerpen Paris London BE FR GB no2 no2 no2 53.5 27.4 67.0 µg/m³ µg/m³ µg/m³
2019-04-09 03:00:00+00:00 NaN NaN London NaN NaN GB NaN NaN no2 NaN NaN 67.0 NaN NaN µg/m³
no2.pivot(columns='location',values='value').plot()
<AxesSubplot:xlabel='date.utc'>
  • pivot_table()
    • 当需要对多个值进行聚合的时候用
air_quality.pivot_table(
    values='value', index='location', columns='parameter',aggfunc="mean"
)
parameter no2 pm25
location
--- --- ---
BETR801 26.950920 23.169492
FR04014 29.374284 NaN
London Westminster 29.740050 13.443568
air_quality.pivot_table(
    values='value', 
    index='location', 
    columns='parameter',
    aggfunc="mean",
    margins=True,
    margins_name= "合计"
)
parameter no2 pm25 合计
location
--- --- --- ---
BETR801 26.950920 23.169492 24.982353
FR04014 29.374284 NaN 29.374284
London Westminster 29.740050 13.443568 21.491708
合计 29.430316 14.386849 24.222743
air_quality.groupby(['parameter','location']).mean()
value
parameter location
--- --- ---
no2 BETR801 26.950920
FR04014 29.374284
London Westminster 29.740050
pm25 BETR801 23.169492
London Westminster 13.443568

7.3.宽到长

no2_pivoted = no2.pivot(columns='location',values='value').reset_index()
no2_pivoted
location date.utc BETR801 FR04014 London Westminster
0 2019-04-09 01:00:00+00:00 22.5 24.4 NaN
1 2019-04-09 02:00:00+00:00 53.5 27.4 67.0
2 2019-04-09 03:00:00+00:00 54.5 34.2 67.0
3 2019-04-09 04:00:00+00:00 34.5 48.5 41.0
4 2019-04-09 05:00:00+00:00 46.5 59.5 41.0
... ... ... ... ...
1700 2019-06-20 20:00:00+00:00 NaN 21.4 NaN
1701 2019-06-20 21:00:00+00:00 NaN 24.9 NaN
1702 2019-06-20 22:00:00+00:00 NaN 26.5 NaN
1703 2019-06-20 23:00:00+00:00 NaN 21.8 NaN
1704 2019-06-21 00:00:00+00:00 NaN 20.0 NaN

1705 rows × 4 columns

no_2 = no2_pivoted.melt(id_vars='date.utc')
no_2.head()
date.utc location value
0 2019-04-09 01:00:00+00:00 BETR801 22.5
1 2019-04-09 02:00:00+00:00 BETR801 53.5
2 2019-04-09 03:00:00+00:00 BETR801 54.5
3 2019-04-09 04:00:00+00:00 BETR801 34.5
4 2019-04-09 05:00:00+00:00 BETR801 46.5
no_2 = no2_pivoted.melt(
    id_vars='date.utc',
    value_vars=["BETR801", "FR04014", "London Westminster"],
    value_name="NO_2",
    var_name="id_location",
)
no_2.head()

8.怎样合并多个表格

  • merge() 按行合并

  • join() 按列合并

核心知识点:

  • concat(): 多个表按行合并,按列合并

  • merge(): 类似数据库操作时,用这个

8.1 concat()

import pandas as pd

data1 = "data/air_quality_no2_long.csv"
data2 = "data/air_quality_pm25_long.csv"

air_quality_no2 = pd.read_csv(data1)
air_quality_pm25 = pd.read_csv(data2)
air_quality = pd.concat([air_quality_no2,air_quality_pm25], axis=0)
air_quality.head()
city country date.utc location parameter value unit
0 Paris FR 2019-06-21 00:00:00+00:00 FR04014 no2 20.0 µg/m³
1 Paris FR 2019-06-20 23:00:00+00:00 FR04014 no2 21.8 µg/m³
2 Paris FR 2019-06-20 22:00:00+00:00 FR04014 no2 26.5 µg/m³
3 Paris FR 2019-06-20 21:00:00+00:00 FR04014 no2 24.9 µg/m³
4 Paris FR 2019-06-20 20:00:00+00:00 FR04014 no2 21.4 µg/m³
print('Shape of the ``air_quality_pm25`` table: ', air_quality_pm25.shape)
Shape of the ``air_quality_pm25`` table:  (1110, 7)
print('Shape of the ``air_quality_no2`` table: ', air_quality_no2.shape)
Shape of the ``air_quality_no2`` table:  (2068, 7)
print('Shape of the resulting ``air_quality`` table: ', air_quality.shape)
Shape of the resulting ``air_quality`` table:  (3178, 7)
air_quality = air_quality.sort_values('date.utc')
air_quality.head()
city country date.utc location parameter value unit
1109 London GB 2019-05-07 01:00:00+00:00 London Westminster pm25 8.0 µg/m³
2067 London GB 2019-05-07 01:00:00+00:00 London Westminster no2 23.0 µg/m³
1098 Antwerpen BE 2019-05-07 01:00:00+00:00 BETR801 no2 50.5 µg/m³
1003 Paris FR 2019-05-07 01:00:00+00:00 FR04014 no2 25.0 µg/m³
100 Antwerpen BE 2019-05-07 01:00:00+00:00 BETR801 pm25 12.5 µg/m³
air_quality_ = pd.concat(
    [air_quality_pm25, air_quality_no2], 
    keys=["PM25", "NO2"]
)
air_quality_.head()
city country date.utc location parameter value unit
PM25 0 Antwerpen BE 2019-06-18 06:00:00+00:00 BETR801 pm25 18.0 µg/m³
1 Antwerpen BE 2019-06-17 08:00:00+00:00 BETR801 pm25 6.5 µg/m³
2 Antwerpen BE 2019-06-17 07:00:00+00:00 BETR801 pm25 18.5 µg/m³
3 Antwerpen BE 2019-06-17 06:00:00+00:00 BETR801 pm25 16.0 µg/m³
4 Antwerpen BE 2019-06-17 05:00:00+00:00 BETR801 pm25 7.5 µg/m³

8.2 用公共标识进行表连接

stations_coord = pd.read_csv("data/air_quality_stations.csv")
stations_coord.head()
location coordinates.latitude coordinates.longitude
0 BELAL01 51.23619 4.38522
1 BELHB23 51.17030 4.34100
2 BELLD01 51.10998 5.00486
3 BELLD02 51.12038 5.02155
4 BELR833 51.32766 4.36226
air_quality = pd.merge(air_quality, stations_coord, how="left", on="location")
air_quality.head()
city country date.utc location parameter value unit coordinates.latitude coordinates.longitude
0 London GB 2019-05-07 01:00:00+00:00 London Westminster pm25 8.0 µg/m³ 51.49467 -0.13193
1 London GB 2019-05-07 01:00:00+00:00 London Westminster no2 23.0 µg/m³ 51.49467 -0.13193
2 Antwerpen BE 2019-05-07 01:00:00+00:00 BETR801 no2 50.5 µg/m³ 51.20966 4.43182
3 Paris FR 2019-05-07 01:00:00+00:00 FR04014 no2 25.0 µg/m³ 48.83724 2.39390
4 Paris FR 2019-05-07 01:00:00+00:00 FR04014 no2 25.0 µg/m³ 48.83722 2.39390
air_quality_parameters = pd.read_csv("data/air_quality_parameters.csv")
air_quality_parameters.head()
id description name
0 bc Black Carbon BC
1 co Carbon Monoxide CO
2 no2 Nitrogen Dioxide NO2
3 o3 Ozone O3
4 pm10 Particulate matter less than 10 micrometers in... PM10
air_quality = pd.merge(air_quality, air_quality_parameters,
                       how='left', left_on='parameter', right_on='id')
air_quality.head()
city country date.utc location parameter value unit coordinates.latitude coordinates.longitude id description name
0 London GB 2019-05-07 01:00:00+00:00 London Westminster pm25 8.0 µg/m³ 51.49467 -0.13193 pm25 Particulate matter less than 2.5 micrometers i... PM2.5
1 London GB 2019-05-07 01:00:00+00:00 London Westminster no2 23.0 µg/m³ 51.49467 -0.13193 no2 Nitrogen Dioxide NO2
2 Antwerpen BE 2019-05-07 01:00:00+00:00 BETR801 no2 50.5 µg/m³ 51.20966 4.43182 no2 Nitrogen Dioxide NO2
3 Paris FR 2019-05-07 01:00:00+00:00 FR04014 no2 25.0 µg/m³ 48.83724 2.39390 no2 Nitrogen Dioxide NO2
4 Paris FR 2019-05-07 01:00:00+00:00 FR04014 no2 25.0 µg/m³ 48.83722 2.39390 no2 Nitrogen Dioxide NO2

9.怎样处理时间序列?

pandas能很好的处理时间序列。

核心知识点

  • 使用to_datetime函数或作为read函数的一部分,可以将有效的日期字符串转换为datetime对象。

  • pandas中的Datetime对象使用dt访问器支持计算、逻辑操作和方便的日期相关属性。

  • DatetimeIndex包含这些与日期相关的属性,并支持方便的切片。

  • 重采样是改变时间序列频率的一种强有力的方法。

数据:

import pandas as pd
import matplotlib.pyplot as plt
air_quality = pd.read_csv("data/air_quality_no2_long.csv")
air_quality = air_quality.rename(columns={'date.utc': "datetime"})
air_quality.head()
city country datetime location parameter value unit
0 Paris FR 2019-06-21 00:00:00+00:00 FR04014 no2 20.0 µg/m³
1 Paris FR 2019-06-20 23:00:00+00:00 FR04014 no2 21.8 µg/m³
2 Paris FR 2019-06-20 22:00:00+00:00 FR04014 no2 26.5 µg/m³
3 Paris FR 2019-06-20 21:00:00+00:00 FR04014 no2 24.9 µg/m³
4 Paris FR 2019-06-20 20:00:00+00:00 FR04014 no2 21.4 µg/m³
air_quality.city.unique()
array(['Paris', 'Antwerpen', 'London'], dtype=object)

9.1. 使用pandas日期时间属性

datetime这一列是纯文字,改成datetime对象

air_quality['datetime'] = pd.to_datetime(air_quality['datetime'])
air_quality['datetime'].dtype
datetime64[ns, UTC]

在读取文件的时候,可以通过设置parse_dates=参数将文字形式的时间转换为时间戳,它可以作为对象进行处理。

aq = pd.read_csv("data/air_quality_no2_long.csv", parse_dates=["date.utc"])
aq['date.utc'].dtypes
aq = aq.rename(columns={'date.utc': "datetime"})  # 改名后要赋值给原来的dataframe
aq['datetime']
0      2019-06-21 00:00:00+00:00
1      2019-06-20 23:00:00+00:00
2      2019-06-20 22:00:00+00:00
3      2019-06-20 21:00:00+00:00
4      2019-06-20 20:00:00+00:00
                  ...           
2063   2019-05-07 06:00:00+00:00
2064   2019-05-07 04:00:00+00:00
2065   2019-05-07 03:00:00+00:00
2066   2019-05-07 02:00:00+00:00
2067   2019-05-07 01:00:00+00:00
Name: datetime, Length: 2068, dtype: datetime64[ns, UTC]

改成时间戳后,可以进行一些关于时间的操作。

比如起始时间:

air_quality["datetime"].min(), air_quality["datetime"].max()
(Timestamp('2019-05-07 01:00:00+0000', tz='UTC'),
 Timestamp('2019-06-21 00:00:00+0000', tz='UTC'))
air_quality["datetime"].max() - air_quality["datetime"].min()
Timedelta('44 days 23:00:00')

结果是一个pandas.Timedelta对象,类似于标准Python库中的datetime.timedelta,用于定义时间区间。

还记得统计计算教程中groupby提供的split-apply-combine模式吗?这里,我们想要计算每个工作日和每个测量位置的给定统计量(例如,平均值(NO_2))。为了对工作日进行分组,我们使用pandas Timestamp的datetime属性weekday(星期一=0,星期日=6),dt访问器也可以访问该属性。可以对位置和工作日进行分组,以拆分这些组合的平均值计算。

air_quality.groupby(
    [air_quality['datetime'].dt.weekday, 'location']
)['value'].mean()
datetime  location          
0         BETR801               27.875000
          FR04014               24.856250
          London Westminster    23.969697
1         BETR801               22.214286
          FR04014               30.999359
          London Westminster    24.885714
2         BETR801               21.125000
          FR04014               29.165753
          London Westminster    23.460432
3         BETR801               27.500000
          FR04014               28.600690
          London Westminster    24.780142
4         BETR801               28.400000
          FR04014               31.617986
          London Westminster    26.446809
5         BETR801               33.500000
          FR04014               25.266154
          London Westminster    24.977612
6         BETR801               21.896552
          FR04014               23.274306
          London Westminster    24.859155
Name: value, dtype: float64

与前面的情况类似,我们想要计算一天中每个小时的给定统计数据(例如,mean (NO_2)),我们可以再次使用split-apply-combine方法。在这种情况下,我们使用pandas Timestamp的datetime属性hour,dt访问器也可以访问它。

# 绘制图形
fig, axs = plt.subplots(figsize=(12, 4))

air_quality.groupby(air_quality["datetime"].dt.hour)["value"].mean().plot(
    kind='bar', rot=0, ax=axs
)

plt.xlabel("Hour of the day")
plt.ylabel("$NO_2 (µg/m^3)$")
Text(0, 0.5, '$NO_2 (µg/m^3)$')

9.2. Datetime作为索引

在'7.reshaping'的教程中,pivot()用于调整数据表的形状,将每个测量位置作为单独的列:

no_2 = air_quality.pivot(index='datetime',columns='location',values='value')
no_2.head()
location BETR801 FR04014 London Westminster
datetime
--- --- --- ---
2019-05-07 01:00:00+00:00 50.5 25.0 23.0
2019-05-07 02:00:00+00:00 45.0 27.7 19.0
2019-05-07 03:00:00+00:00 NaN 50.4 19.0
2019-05-07 04:00:00+00:00 NaN 61.9 16.0
2019-05-07 05:00:00+00:00 NaN 72.4 NaN

通过旋转数据,datetime信息成为表的索引。通常,可以通过set_index函数将列设置为索引。

使用日期时间索引(即DatetimeIndex)提供了强大的功能。例如,我们不需要dt访问器来获取时间序列属性,就可以直接在索引上使用这些属性:

no_2.index.year, no_2.index.weekday
(Int64Index([2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019,
             ...
             2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019],
            dtype='int64', name='datetime', length=1033),
 Int64Index([1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
             ...
             3, 3, 3, 3, 3, 3, 3, 3, 3, 4],
            dtype='int64', name='datetime', length=1033))

其他优点包括方便地划分时段或调整时间尺度。举例:

no_2["2019-05-20":"2019-05-21"].plot();

通过提供一个解析为datetime的字符串,可以在DatetimeIndex上选择数据的特定子集。

9.3. 调整时间频率

monthly_max = no_2.resample("M").max()
monthly_max
location BETR801 FR04014 London Westminster
datetime
--- --- --- ---
2019-05-31 00:00:00+00:00 74.5 97.0 97.0
2019-06-30 00:00:00+00:00 52.5 84.7 52.0

resample()方法类似于groupby操作:

  • 它通过使用定义目标频率的字符串(例如M, 5H,…)来提供基于时间的分组

  • 它需要一个聚合函数,如mean, max,…

定义时,时间序列的频率由freq属性提供:

monthly_max.index.freq
<MonthEnd>
no_2.resample("D").mean().plot(style="-o", figsize=(10, 5));

10.怎样处理字符串数据?

pandas同样有强大的能力处理字符串数据。

核心知识点:

  • 可以使用str访问器使用字符串方法。

  • 字符串方法以元素为单位工作,可以用于条件索引。

  • replace方法是一种根据给定字典转换值的方便方法。

数据:

import pandas as pd

titanic = pd.read_csv("data/titanic.csv")
titanic.head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

案例分析

Q1:使所有名称字符小写。

titanic["Name"].str.lower()
0                                braund, mr. owen harris
1      cumings, mrs. john bradley (florence briggs th...
2                                 heikkinen, miss. laina
3           futrelle, mrs. jacques heath (lily may peel)
4                               allen, mr. william henry
                             ...                        
886                                montvila, rev. juozas
887                         graham, miss. margaret edith
888             johnston, miss. catherine helen "carrie"
889                                behr, mr. karl howell
890                                  dooley, mr. patrick
Name: Name, Length: 891, dtype: object

要使Name列中的每个字符串都小写,请选择Name列(参见关于数据选择的教程),添加str访问器并应用下面的方法。因此,每个字符串都按元素顺序进行了转换。

与时间序列教程中的datetime对象具有dt访问器类似,在使用str访问器时,可以使用许多专门的字符串方法。这些方法的名称通常与单个元素的等效内置字符串方法相匹配,但是对每个列的值应用元素级(还记得元素级计算吗?)

Q2:通过提取逗号前的部分,创建一个新列Surname,该列包含乘客的姓氏。

titanic["Name"].str.split(",")
0                             [Braund,  Mr. Owen Harris]
1      [Cumings,  Mrs. John Bradley (Florence Briggs ...
2                              [Heikkinen,  Miss. Laina]
3        [Futrelle,  Mrs. Jacques Heath (Lily May Peel)]
4                            [Allen,  Mr. William Henry]
                             ...                        
886                             [Montvila,  Rev. Juozas]
887                      [Graham,  Miss. Margaret Edith]
888          [Johnston,  Miss. Catherine Helen "Carrie"]
889                             [Behr,  Mr. Karl Howell]
890                               [Dooley,  Mr. Patrick]
Name: Name, Length: 891, dtype: object

使用Series.str.split()方法,每个值都作为包含2个元素的列表返回。第一个元素是逗号前的部分,第二个元素是逗号后的部分。

titanic["Surname"] = titanic["Name"].str.split(",").str.get(0)
titanic['Surname']
0         Braund
1        Cumings
2      Heikkinen
3       Futrelle
4          Allen
         ...    
886     Montvila
887       Graham
888     Johnston
889         Behr
890       Dooley
Name: Surname, Length: 891, dtype: object

因为我们只对表示姓氏(元素0)的第一部分感兴趣,所以可以再次使用str访问器并应用Series.str.get()来提取相关部分。实际上,这些字符串函数可以连接起来一次组合多个函数!

Q3:提取泰坦尼克号上的女伯爵的乘客数据。

titanic["Name"].str.contains("Countess")
0      False
1      False
2      False
3      False
4      False
       ...  
886    False
887    False
888    False
889    False
890    False
Name: Name, Length: 891, dtype: bool
titanic[titanic['Name'].str.contains('Countess')]
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked Surname
759 760 1 1 Rothes, the Countess. of (Lucy Noel Martha Dye... female 33.0 0 0 110152 86.5 B77 S Rothes

字符串方法Series.str.contains()检查列Name中的每个值,如果字符串包含单词Countess,则返回每个值True(Countess是名称的一部分)或False(Countess不是名称的一部分)。该输出可用于使用数据教程子集中介绍的条件(布尔)索引对数据进行子选择。因为泰坦尼克号上只有一位女伯爵,所以我们只得到一行数据。

Series.str.extract()方法接受正则表达式

Q4:泰坦尼克号上哪位乘客的名字最长?

titanic['Name'].str.len()
0      23
1      51
2      22
3      44
4      24
       ..
886    21
887    28
888    40
889    21
890    19
Name: Name, Length: 891, dtype: int64

要获得最长的名称,我们首先必须获得name列中每个名称的长度。通过使用pandas字符串方法,Series.str.len()函数分别应用于每个名称(按元素顺序)。

titanic['Name'].str.len().idxmax()
307

接下来,我们需要获得名称长度最大的表中的相应位置,最好是索引标签。idxmax()方法正是这样做的。它不是字符串方法,应用于整数,因此不使用str。

titanic.loc[titanic['Name'].str.len().idxmax(),'Name']
'Penasco y Castellana, Mrs. Victor de Satode (Maria Josefa Perez de Soto y Vallejo)'

根据行(307)和列(name)的索引名,我们可以使用loc操作符进行选择,这在subsetting教程中介绍过。

Q5:在“性别”一栏,将“男性”的值替换为“M”,将“女性”的值替换为“F”。

titanic["Sex_short"] = titanic["Sex"].replace({"male": "M", "female": "F"})
titanic['Sex_short']
0      M
1      F
2      F
3      F
4      M
      ..
886    M
887    F
888    F
889    M
890    M
Name: Sex_short, Length: 891, dtype: object

虽然replace()不是一个字符串方法,但它提供了一种使用映射或词汇表来转换某些值的方便方法。它需要一个字典来定义{from: to}的映射。

posted @ 2022-05-27 22:14  王大桃zzZ  阅读(120)  评论(0编辑  收藏  举报