Fork me on GitHub

Pandas简明教程

一、pandas简介

​ pandas是一个专门用于数据分析的开源Python库,目前很多使用Python分析数据的专业人员都将pandas作为基础工具来使用。pandas是以Numpy作为基础来设计开发的,Numpy是大量Python数据科学计算库的基础,pandas以此为基础,在计算方面具有很高的性能。

pandas内容很丰富,本篇文章包含:

  • 初识pandas
  • 两种数据结构
  • 函数应用
  • 合并、分组、分割、变形
  • 缺失值、重复值处理
  • 数据过滤
  • 文件处理
  • 案例实战

二、初识pandas

我们先从一个实例说起,假如现在有这样的数据:

data = [
    {"name": "apple", "price": 6, "quantity": 100},
    {"name": "pear", "price": 8.5, "quantity": 230},
    {"name": "banana", "price": 6, "quantity": 150},
]

如果实现 price+1,然后求其平均price

用Python基本语法实现:

data = [
    {"name": "apple", "price": 6, "quantity": 100},
    {"name": "pear", "price": 8.5, "quantity": 230},
    {"name": "banana", "price": 6, "quantity": 150},
]

# price + 1
# price average
total_price = 0
for item in data:
    item["price"] = item.get("price", 0) + 1
    total_price += item["price"]
avr = total_price / len(data)
print(avr)

用pandas实现:

data = [
    {"name": "apple", "price": 6, "quantity": 100},
    {"name": "pear", "price": 8.5, "quantity": 230},
    {"name": "banana", "price": 6, "quantity": 150},
]
import pandas as pd

df = pd.DataFrame(data)
print(df)
# price + 1
df["price"] = df["price"] + 1
# price average
avr_price = df["price"].mean()
print(avr_price)

Python基本语法实现至少需要循环对数据进行处理,如果数据量比较大是很耗费时间的,采用pandas来进行数据量大的数据是比较合理的,而且由很丰富的API进行数据处理。

三、两种数据结构

pandas种主要围绕Series与DataFrame这两种数据进行操作。

(一)Series

带标签的一维数组,可存储整数、浮点数、字符串、Python对象等类型的数据,具有以下特点:

  • 类似一维数组对象
  • 由数据和索引组成(索引在左、数据在右,索引自动创建)

1、创建

import pandas as pd

# 通过list创建
ser_obj = pd.Series([1, 2, 3, 4, 5])
print(ser_obj)
print(type(ser_obj))

# 通过dict创建
ser_obj1 = pd.Series({"a": 1, "b": 2, "c": 3})
print(ser_obj1)
print(type(ser_obj1))

0 1
1 2
2 3
3 4
4 5
dtype: int64
<class 'pandas.core.series.Series'>

a 1
b 2
c 3
dtype: int64
<class 'pandas.core.series.Series'>

2、操作

  • 索引操作
print(ser_obj.index)  # RangeIndex(start=0, stop=5, step=1)
print(ser_obj1.index)  # Index(['a', 'b', 'c'], dtype='object')

通过迭代可以获取每一个索引值,或者索引和切片。

  • 值操作
print(ser_obj.values) # [1 2 3 4 5]
print(ser_obj1.values) # [1 2 3]
  • head方法预览数据
print(ser_obj.head()) # head(2) 预览前2行

0    1
1    2
2    3
3    4
4    5
dtype: int64

(二)DataFrame

DataFrame是由多种类型的列构成的二维标签数据结构,可以理解成一系列的Series构成的,它具有以下特点:

  • 类似多维数组/表格数据
  • 每列数据可以是不同的类型
  • 索引包括列索引和行索引

1、创建

import pandas as pd
import numpy as np

# 通过ndarray构建DataFrame
df = pd.DataFrame(np.arange(12).reshape(3, 4))
print(df)
print(type(df))

# 通过字典构建DataFrame
# 默认索引是0、1、2,
# 可以指定索引:pd.DataFrame({"one": [1, 2, 3], "two": [4, 5, 6]}, index=["a","b","c"])
df = pd.DataFrame({"one": [1, 2, 3], "two": [4, 5, 6]})
print(df)
print(type(df))

# 通过Series构建DataFrame
data = {"one": pd.Series([1, 2, 3], index=["a", "b", "c"]), "two": pd.Series([4, 5, 6], index=["a", "b", "c"])}
df = pd.DataFrame(data)
print(df)
print(type(df))

输出:

0 1 2 3
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
<class 'pandas.core.frame.DataFrame'>

one two
0 1 4
1 2 5
2 3 6
<class 'pandas.core.frame.DataFrame'>

one two
a 1 4
b 2 5
c 3 6
<class 'pandas.core.frame.DataFrame'>

2、操作

  • 行操作
import pandas as pd

df = pd.DataFrame({"one": [1, 2, 3], "two": [4, 5, 6]})
print(df)

# 获取行
print(df[0:1])

# 添加行
df = df.append({"one": 4, "two": 7}, ignore_index=True)
print(df)

# 删除行
df.drop([0], axis=0, inplace=True)  # 第一个参数指定索引
print(df)

  • 列操作
import pandas as pd

df = pd.DataFrame({"one": [1, 2, 3], "two": [4, 5, 6]})
print(df)

# 获取列
print(df["one"])

# 添加列
df["three"] = [7, 8, 9]

# 删除列
df.drop(["three"], axis=1, inplace=True)

  • 块操作
import pandas as pd

df = pd.DataFrame({"one": [1, 2, 3], "two": [4, 5, 6]})
print(df)

# 获取块
df1 = df.loc[0, ["one"]]
print(df1)
print(type(df1))

# 修改某个值
df.loc[0, ["one"]] = 10  # 0表示行索引, one表示列标签
print(df)

四、函数应用

  • apply
  • applymap
  • sort_index、sort_values
  • count、describe、min、max、sum、mean

(一)apply

通过apply将函数应用到行或列上

import pandas as pd

data = [
    {"name": "apple", "price": 6, "quantity": 100},
    {"name": "pear", "price": 8.5, "quantity": 230},
    {"name": "banana", "price": 6, "quantity": 150},
]

df = pd.DataFrame(data, index=["a", "b", "c"])
print(df)
"""
     name  price  quantity
a   apple    6.0       100
b    pear    8.5       230
c  banana    6.0       150
"""


# 实例一
def add_price_quantity(row):
    row["price"] = row["price"] + 1
    row["quantity"] = row["quantity"] + 100
    return row


# apply
df1 = df.apply(add_price_quantity, axis=1)
print(df1)
"""
     name  price  quantity
a   apple    7.0       200
b    pear    9.5       330
c  banana    7.0       250
"""

# 实例二
df2 = df["price"].apply(lambda x: x + 10)
print(df2)
"""
a    16.0
b    18.5
c    16.0
Name: price, dtype: float64
"""

(二)applymap

通过applymap将函数应用到每个数据上

import pandas as pd

data = [
    {"name": "apple", "price": 6, "quantity": 100},
    {"name": "pear", "price": 8.5, "quantity": 230},
    {"name": "banana", "price": 6, "quantity": 150},
]

df = pd.DataFrame(data, index=["a", "b", "c"])
print(df)


def handle_cell(cell):
    if isinstance(cell, str):
        cell = cell + "_fruit"
    return cell


# applymap
df1 = df.applymap(handle_cell)
print(df1)
"""
           name  price  quantity
a   apple_fruit    6.0       100
b    pear_fruit    8.5       230
c  banana_fruit    6.0       150
"""

(三)sort_index、sort_values

  • sort_index 索引排序
  • sort_values(by="label") 其中参数by是按照标签名进行排序

其中:axis指定轴的方向

import pandas as pd

data = [
    {"name": "apple", "price": 6, "quantity": 100},
    {"name": "pear", "price": 8.5, "quantity": 230},
    {"name": "banana", "price": 6, "quantity": 150},
]

df = pd.DataFrame(data, index=["a", "b", "c"])
print(df)
"""
     name  price  quantity
a   apple    6.0       100
b    pear    8.5       230
c  banana    6.0       150
"""

df1 = df.sort_index(ascending=False, axis=0)  # 索引降序
print(df1)
"""
     name  price  quantity
c  banana    6.0       150
b    pear    8.5       230
a   apple    6.0       100
"""

df2 = df.sort_values(by=["price"])
print(df2)
"""
     name  price  quantity
a   apple    6.0       100
c  banana    6.0       150
b    pear    8.5       230
"""

(四)count、describe、min、max、sum、mean

常用的统计的计算方法:

  • count 非空值的数量
  • describe 针对Series或DataFrame列计算汇总统计
  • min、max 计算最小值和最大值
  • sum 值的总和
  • mean 值的平均值
import pandas as pd

data = [
    {"name": "apple", "price": 6, "quantity": 100},
    {"name": "pear", "price": 8.5, "quantity": 230},
    {"name": "banana", "price": 6, "quantity": 150},
]

df = pd.DataFrame(data, index=["a", "b", "c"])
print(df)

print(df.count())
print(df.describe())
print(df["price"].min())
print(df["price"].max())
print(df["price"].sum())
print(df["price"].mean())

输出:

name price quantity

a apple 6.0 100
b pear 8.5 230
c banana 6.0 150

count

name 3
price 3
quantity 3
dtype: int64

describe

​ price quantity
count 3.000000 3.000000
mean 6.833333 160.000000
std 1.443376 65.574385
min 6.000000 100.000000
25% 6.000000 125.000000
50% 6.000000 150.000000
75% 7.250000 190.000000
max 8.500000 230.000000

price

6.0
8.5
20.5
6.833333333333333

五、合并、分组、分割、变形

(一)合并

pandas有一些内置合并数据集的方法,如:

  • pd.concat 沿一个轴将多个对象堆叠起来
  • pd.merge 根据一个或者多个键将多个数据集连接起来
  • df.append 纵向追加Series、DataFrame

1、pd.concat

  • 合并Series
    >>> s1 = pd.Series(['a', 'b'])
    >>> s2 = pd.Series(['c', 'd'])
    >>> pd.concat([s1, s2])
    0    a
    1    b
    0    c
    1    d
    dtype: object

    Clear the existing index and reset it in the result
    by setting the ``ignore_index`` option to ``True``.

    >>> pd.concat([s1, s2], ignore_index=True)
    0    a
    1    b
    2    c
    3    d
    dtype: object
  • 合并DataFrame
    # 行合并
    Combine two ``DataFrame`` objects with identical columns.

    >>> df1 = pd.DataFrame([['a', 1], ['b', 2]],
    ...                    columns=['letter', 'number'])
    >>> df1
      letter  number
    0      a       1
    1      b       2
    >>> df2 = pd.DataFrame([['c', 3], ['d', 4]],
    ...                    columns=['letter', 'number'])
    >>> df2
      letter  number
    0      c       3
    1      d       4
    >>> pd.concat([df1, df2])
      letter  number
    0      a       1
    1      b       2
    0      c       3
    1      d       4
    
    # 列合并
    >>> df4 = pd.DataFrame([['bird', 'polly'], ['monkey', 'george']],
    ...                    columns=['animal', 'name'])
    >>> pd.concat([df1, df4], axis=1)
      letter  number  animal    name
    0      a       1    bird   polly
    1      b       2  monkey  george

2、pd.merge

def merge(
    left: DataFrame | Series,
    right: DataFrame | Series,
    how: str = "inner",
    on: IndexLabel | None = None,
    left_on: IndexLabel | None = None,
    right_on: IndexLabel | None = None,
    left_index: bool = False,
    right_index: bool = False,
    sort: bool = False,
    suffixes: Suffixes = ("_x", "_y"),
    copy: bool = True,
    indicator: bool = False,
    validate: str | None = None,
) -> DataFrame:

left、right 连接的对象

how 连接方式

on 连接列标签的名称,如果没有指定使用左右两个对象的交集作为连接键

left_on 左侧连接键的名称

right_on 右侧连接键的名称

left_index 左侧对象行索引作为连接键

right_index 右侧对象行索引作为连接键

sort 将合并的数据进行排序,设置为False可以提高性能

suffixes 字符串值组成的元组,用于指定当左右DataFrame存在相同列名时在列名后面附加的后缀名称

copy 将数据复制到数据结构种,设置为False可以提高性能

indicator 显示合并数据种数据的来源情况

  • 默认方式
import pandas as pd

df1 = pd.DataFrame([['a', 1], ['b', 2]],
                   columns=['letter1', 'number'])

print(df1)
"""
  letter1  number
0       a       1
1       b       2
"""
df2 = pd.DataFrame([['c', 3], ['d', 2]],
                   columns=['letter2', 'number'])
print(df2)
"""
  letter2  number
0       c       3
1       d       2
"""

df3 = pd.merge(df1, df2)  # 可以通过on显式指定 pd.merge(df1, df2, on="number")
print(df3)
"""
  letter1  number letter2
0       b       2       d
"""

  • left_on、right_on
import pandas as pd

df1 = pd.DataFrame([['a', 1], ['b', 2]],
                   columns=['letter1', 'number1'])

print(df1)
"""
  letter1  number1
0       a       1
1       b       2
"""
df2 = pd.DataFrame([['c', 3], ['d', 2]],
                   columns=['letter2', 'number2'])
print(df2)
"""
  letter2  number2
0       c       3
1       d       2
"""

df3 = pd.merge(df1, df2, left_on="number1", right_on="number2")
print(df3)
"""
  letter1  number1 letter2  number2
0       b        2       d        2
"""

  • left_index、right_index
import pandas as pd

df1 = pd.DataFrame([['a', 1], ['b', 2]],
                   columns=['letter1', 'number'])

print(df1)
"""
  letter1  number
0       a       1
1       b       2
"""
df2 = pd.DataFrame([['c', 3], ['d', 2]],
                   columns=['letter2', 'number'])
print(df2)
"""
  letter2  number
0       c       3
1       d       2
"""

df3 = pd.merge(df1, df2, left_index=True, right_index=True)
print(df3)
"""
  letter1  number_x letter2  number_y
0       a         1       c         3
1       b         2       d         2
"""

3、df.append

  • 参数
    def append(
        self,
        other,
        ignore_index: bool = False,
        verify_integrity: bool = False,
        sort: bool = False,
    ) -> DataFrame:
  • 实例
        Examples
        --------
        >>> df = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB'), index=['x', 'y'])
        >>> df
           A  B
        x  1  2
        y  3  4
        >>> df2 = pd.DataFrame([[5, 6], [7, 8]], columns=list('AB'), index=['x', 'y'])
        >>> df.append(df2)
           A  B
        x  1  2
        y  3  4
        x  5  6
        y  7  8

        With `ignore_index` set to True:

        >>> df.append(df2, ignore_index=True)
           A  B
        0  1  2
        1  3  4
        2  5  6
        3  7  8

        The following, while not recommended methods for generating DataFrames,
        show two ways to generate a DataFrame from multiple data sources.

        Less efficient:

        >>> df = pd.DataFrame(columns=['A'])
        >>> for i in range(5):
        ...     df = df.append({'A': i}, ignore_index=True)
        >>> df
           A
        0  0
        1  1
        2  2
        3  3
        4  4

        More efficient:

        >>> pd.concat([pd.DataFrame([i], columns=['A']) for i in range(5)],
        ...           ignore_index=True)
           A
        0  0
        1  1
        2  2
        3  3
        4  4
        

(二)分组

  • 统计方法使用
>>> l = [["a", 12, 12], [None, 12.3, 33.], ["b", 12.3, 123], ["a", 1, 1]]
>>> df = pd.DataFrame(l, columns=["a", "b", "c"])

>>> df.groupby(by="a").sum()
    b     c
a
a   13.0   13.0
b   12.3  123.0
  • apply
>> df = pd.DataFrame({'Animal': ['Falcon', 'Falcon',
...                               'Parrot', 'Parrot'],
...                    'Max Speed': [380., 370., 24., 26.]})
>>> df.groupby("Animal", group_keys=True).apply(lambda x: x)
          Animal  Max Speed
Animal
Falcon 0  Falcon      380.0
       1  Falcon      370.0
Parrot 2  Parrot       24.0
       3  Parrot       26.0

>>> df.groupby("Animal", group_keys=False).apply(lambda x: x)
   Animal  Max Speed
0  Falcon      380.0
1  Falcon      370.0
2  Parrot       24.0
3  Parrot       26.0

(三)分割

使用分块原理进行分割:

import pandas as pd

data = [
    {"name": "apple", "price": 6, "quantity": 100},
    {"name": "pear", "price": 8.5, "quantity": 230},
    {"name": "banana", "price": 6, "quantity": 150},
]

df = pd.DataFrame(data)
print(df)

df1 = df[0:1][["name", "price"]]
print(df1)  # 0行数据
"""
    name  price
0  apple    6.0
"""

df2 = df[1:][["name", "price"]]
print(df2)  # 1行后的数据
"""
     name  price
1    pear    8.5
2  banana    6.0
"""

(四)变形

1、重塑层次化索引(stack/unstack)

import pandas as pd

data = [
    {"日期": "2022-01-12", "最高气温": 15, "最低气温": 8},
    {"日期": "2022-02-12", "最高气温": 12, "最低气温": 7},
    {"日期": "2022-02-15", "最高气温": 11, "最低气温": 8},
    {"日期": "2022-03-09", "最高气温": 15, "最低气温": 13},
    {"日期": "2022-03-13", "最高气温": 19, "最低气温": 15},
    {"日期": "2022-06-18", "最高气温": 28, "最低气温": 22},
]

df = pd.DataFrame(data)

# 把列转化为行,形成一个层次化索引的Series
df1 = df.stack()
print(df1)
print(type(df1))
"""
0  日期      2022-01-12
   最高气温            15
   最低气温             8
1  日期      2022-02-12
   最高气温            12
   最低气温             7
2  日期      2022-02-15
   最高气温            11
   最低气温             8
3  日期      2022-03-09
   最高气温            15
   最低气温            13
4  日期      2022-03-13
   最高气温            19
   最低气温            15
5  日期      2022-06-18
   最高气温            28
   最低气温            22
dtype: object
<class 'pandas.core.series.Series'>
"""

df2 = df1.unstack()
print(df2)
print(type(df2))
"""
           日期 最高气温 最低气温
0  2022-01-12   15    8
1  2022-02-12   12    7
2  2022-02-15   11    8
3  2022-03-09   15   13
4  2022-03-13   19   15
5  2022-06-18   28   22
<class 'pandas.core.frame.DataFrame'>
"""

2、数据透视表

  • 参数
def pivot_table(
    data: DataFrame,
    values=None,
    index=None,
    columns=None,
    aggfunc: AggFuncType = "mean",
    fill_value=None,
    margins: bool = False,
    dropna: bool = True,
    margins_name: str = "All",
    observed: bool = False,
    sort: bool = True,
) -> DataFrame:

其中index参数用于进行分组,然后求其分组后值的平均值。

import pandas as pd

data = [
    {"日期": "2022-01-12", "最高气温": 15, "最低气温": 8, "天气": "晴", "风向": "东北风"},
    {"日期": "2022-02-12", "最高气温": 12, "最低气温": 7, "天气": "晴", "风向": "东北风"},
    {"日期": "2022-02-15", "最高气温": 11, "最低气温": 8, "天气": "多云", "风向": "西南风"},
    {"日期": "2022-03-09", "最高气温": 15, "最低气温": 13, "天气": "晴", "风向": "西北风"},
    {"日期": "2022-03-13", "最高气温": 19, "最低气温": 15, "天气": "小雨", "风向": "北风"},
    {"日期": "2022-06-18", "最高气温": 28, "最低气温": 22, "天气": "小雨", "风向": "西南风"},
]

df = pd.DataFrame(data)
print(df)
"""
           日期  最高气温  最低气温  天气   风向
0  2022-01-12    15     8   晴  东北风
1  2022-02-12    12     7   晴  东北风
2  2022-02-15    11     8  多云  西南风
3  2022-03-09    15    13   晴  西北风
4  2022-03-13    19    15  小雨   北风
5  2022-06-18    28    22  小雨  西南风
"""

df = df.set_index("日期")
print(df)
"""
            最高气温  最低气温  天气   风向
日期                             
2022-01-12    15     8   晴  东北风
2022-02-12    12     7   晴  东北风
2022-02-15    11     8  多云  西南风
2022-03-09    15    13   晴  西北风
2022-03-13    19    15  小雨   北风
2022-06-18    28    22  小雨  西南风
"""

df1 = pd.pivot_table(df, values=["最高气温", "最低气温"], index=["天气"])
print(df1)
"""
         最低气温  最高气温
天气                 
多云   8.000000  11.0
小雨  18.500000  23.5
晴    9.333333  14.0
"""

df2 = pd.pivot_table(df, values=["最高气温", "最低气温"], index=["天气"], columns=["风向"])
print(df2)
"""
   最低气温                    最高气温                  
风向  东北风    北风   西北风   西南风   东北风    北风   西北风   西南风
天气                                               
多云  NaN   NaN   NaN   8.0   NaN   NaN   NaN  11.0
小雨  NaN  15.0   NaN  22.0   NaN  19.0   NaN  28.0
晴   7.5   NaN  13.0   NaN  13.5   NaN  15.0   NaN
"""

六、缺失值、重复值处理

(一)缺失值处理

当数据中存在NaN缺失值时,可以用其它数值替代缺失值,这里使用了df.fillna()方法。

1、查看是否存在缺失值

import pandas as pd

data = [
    {"日期": "2022-01-12", "最高气温": 15, "最低气温": 8, "天气": "晴", "风向": "东北风"},
    {"日期": "2022-02-12", "最高气温": 12, "最低气温": None, "天气": "晴", "风向": "东北风"},
    {"日期": "2022-02-15", "最高气温": 11, "最低气温": 8, "天气": "多云", "风向": "西南风"},
    {"日期": "2022-03-09", "最高气温": None, "最低气温": 13, "天气": "晴", "风向": "西北风"},
    {"日期": "2022-03-13", "最高气温": 19, "最低气温": 15, "天气": "小雨", "风向": "北风"},
    {"日期": "2022-06-18", "最高气温": 28, "最低气温": 22, "天气": "小雨", "风向": "西南风"},
]

df = pd.DataFrame(data)
res = df.isna()  # DataFrame.isnull is an alias for DataFrame.isna
print(res)
"""
      日期   最高气温   最低气温     天气     风向
0  False  False  False  False  False
1  False  False   True  False  False
2  False  False  False  False  False
3  False   True  False  False  False
4  False  False  False  False  False
5  False  False  False  False  False
"""

2、df.dropna(axis=0)删除行

df.dropna(axis=0)

3、df.dropna(axis=1)删除列

df.dropna(axis=1)

4、df.fillna("missing")填充缺失值

  • 参数
    def fillna(  # type: ignore[override]
        self,
        value: Hashable | Mapping | Series | DataFrame = None,
        method: FillnaOptions | None = None,
        axis: Axis | None = None,
        inplace: bool = False,
        limit: int | None = None,
        downcast: dict | None = None,
    ) -> DataFrame | None:
        return super().fillna(
            value=value,
            method=method,
            axis=axis,
            inplace=inplace,
            limit=limit,
            downcast=downcast,
        )

method参数表示填充方式:

df.fillna("missing") # 使用字符串替代NaN
df.fillna(method="pad") # 使用前一个数据替代NaN
df.fillna(method="bfill", limit=1) # 使用后一个数据替代NaN,使用limit限制每一列可以替代NaN的数目
df.fillna(df1.mean()) # 使用统计变量替代NaN

(二)重复值处理

1、判断是否重复行(df.duplicated)

该方法返回一个bool型的Series,用以判断某行是否是重复行:

    def duplicated(
        self,
        subset: Hashable | Sequence[Hashable] | None = None,
        keep: Literal["first", "last", False] = "first",
    ) -> Series:
 Examples
        --------
        Consider dataset containing ramen rating.

        >>> df = pd.DataFrame({
        ...     'brand': ['Yum Yum', 'Yum Yum', 'Indomie', 'Indomie', 'Indomie'],
        ...     'style': ['cup', 'cup', 'cup', 'pack', 'pack'],
        ...     'rating': [4, 4, 3.5, 15, 5]
        ... })
        >>> df
            brand style  rating
        0  Yum Yum   cup     4.0
        1  Yum Yum   cup     4.0
        2  Indomie   cup     3.5
        3  Indomie  pack    15.0
        4  Indomie  pack     5.0

        By default, for each set of duplicated values, the first occurrence
        is set on False and all others on True.

        >>> df.duplicated()
        0    False
        1     True
        2    False
        3    False
        4    False
        dtype: bool

        By using 'last', the last occurrence of each set of duplicated values
        is set on False and all others on True.

        >>> df.duplicated(keep='last')
        0     True
        1    False
        2    False
        3    False
        4    False
        dtype: bool

        By setting ``keep`` on False, all duplicates are True.

        >>> df.duplicated(keep=False)
        0     True
        1     True
        2    False
        3    False
        4    False
        dtype: bool

        To find duplicates on specific column(s), use ``subset``.

        >>> df.duplicated(subset=['brand'])
        0    False
        1     True
        2    False
        3     True
        4     True
        dtype: bool

2、删除重复行(df.drop_duplicates)

  • 参数
    def drop_duplicates(
        self,
        subset: Hashable | Sequence[Hashable] | None = None,
        keep: Literal["first", "last", False] = "first",
        inplace: bool = False,
        ignore_index: bool = False,
    ) -> DataFrame | None:
 Examples
        --------
        Consider dataset containing ramen rating.

        >>> df = pd.DataFrame({
        ...     'brand': ['Yum Yum', 'Yum Yum', 'Indomie', 'Indomie', 'Indomie'],
        ...     'style': ['cup', 'cup', 'cup', 'pack', 'pack'],
        ...     'rating': [4, 4, 3.5, 15, 5]
        ... })
        >>> df
            brand style  rating
        0  Yum Yum   cup     4.0
        1  Yum Yum   cup     4.0
        2  Indomie   cup     3.5
        3  Indomie  pack    15.0
        4  Indomie  pack     5.0

        By default, it removes duplicate rows based on all columns.

        >>> df.drop_duplicates()
            brand style  rating
        0  Yum Yum   cup     4.0
        2  Indomie   cup     3.5
        3  Indomie  pack    15.0
        4  Indomie  pack     5.0

        To remove duplicates on specific column(s), use ``subset``.

        >>> df.drop_duplicates(subset=['brand'])
            brand style  rating
        0  Yum Yum   cup     4.0
        2  Indomie   cup     3.5

        To remove duplicates and keep last occurrences, use ``keep``.

        >>> df.drop_duplicates(subset=['brand', 'style'], keep='last')
            brand style  rating
        1  Yum Yum   cup     4.0
        2  Indomie   cup     3.5
        4  Indomie  pack     5.0

七、数据过滤

  • 列标签过滤
  • 逻辑运算符过滤
  • 正则过滤
  • 正则提取
  • 其它

(一)列标签过滤

  • 数据
import pandas as pd

data = [
    {"日期": "2022-01-12", "最高气温": 15, "最低气温": 8, "天气": "晴", "风向": "东北风"},
    {"日期": "2022-02-12", "最高气温": 12, "最低气温": None, "天气": "晴", "风向": "东北风"},
    {"日期": "2022-02-15", "最高气温": 11, "最低气温": 8, "天气": "多云", "风向": "西南风"},
    {"日期": "2022-03-09", "最高气温": None, "最低气温": 13, "天气": "晴", "风向": "西北风"},
    {"日期": "2022-03-13", "最高气温": 19, "最低气温": 15, "天气": "小雨", "风向": "北风"},
    {"日期": "2022-06-18", "最高气温": 28, "最低气温": 22, "天气": "小雨", "风向": "西南风"},
]

df = pd.DataFrame(data)
print(df)
  • 过滤
df1 = df[df["最高气温"] > 15]
print(df1)

(二)逻辑运算符过滤

df1 = df[(df["最高气温"] > 15) & (df["最低气温"] > 15)]
df2 = df[(df["最高气温"] > 15) | (df["最低气温"] > 15)]
df3 = df[df["最高气温"] != 15]
df4 = df[~(df["最高气温"] >= 15)]

(三)正则过滤

df1 = df[df["最高气温"].astype(str).str.contain('(\d+)', regex=True)]

(四)正则提取

通过extract方法可以提取Series中的一些文本信息:

import pandas as pd

data = [
    {"日期": "2022-01-12", "最高气温": 15, "最低气温": 8, "天气": "晴", "风向": "东北风"},
    {"日期": "2022-02-12", "最高气温": 12, "最低气温": None, "天气": "晴", "风向": "东北风"},
    {"日期": "2022-02-15", "最高气温": 11, "最低气温": 8, "天气": "多云", "风向": "西南风"},
    {"日期": "2022-03-09", "最高气温": None, "最低气温": 13, "天气": "晴", "风向": "西北风"},
    {"日期": "2022-03-13", "最高气温": 19, "最低气温": 15, "天气": "小雨", "风向": "北风"},
    {"日期": "2022-06-18", "最高气温": 28, "最低气温": 22, "天气": "小雨", "风向": "西南风"},
]

df = pd.DataFrame(data)

df1 = df["日期"].str.extract(r'(\d+)')
print(df1)
"""
      0
0  2022
1  2022
2  2022
3  2022
4  2022
5  2022
"""

(五)其它

Series中的字符串方法可以做很多事情:

print(type(df["日期"].str)) # <class 'pandas.core.strings.accessor.StringMethods'>

该类中含有:

  • extract
  • extractall
  • find
  • findall
  • match
  • replace
  • split

...

八、文件处理

(一)excel文件处理

1、写文件(df.to_excel)

Examples
        --------

        Create, write to and save a workbook:

        >>> df1 = pd.DataFrame([['a', 'b'], ['c', 'd']],
        ...                    index=['row 1', 'row 2'],
        ...                    columns=['col 1', 'col 2'])
        >>> df1.to_excel("output.xlsx")  # doctest: +SKIP

        To specify the sheet name:

        >>> df1.to_excel("output.xlsx",
        ...              sheet_name='Sheet_name_1')  # doctest: +SKIP

        If you wish to write to more than one sheet in the workbook, it is
        necessary to specify an ExcelWriter object:

        >>> df2 = df1.copy()
        >>> with pd.ExcelWriter('output.xlsx') as writer:  # doctest: +SKIP
        ...     df1.to_excel(writer, sheet_name='Sheet_name_1')
        ...     df2.to_excel(writer, sheet_name='Sheet_name_2')

        ExcelWriter can also be used to append to an existing Excel file:

        >>> with pd.ExcelWriter('output.xlsx',
        ...                     mode='a') as writer:  # doctest: +SKIP
        ...     df.to_excel(writer, sheet_name='Sheet_name_3')

        To set the library that is used to write the Excel file,
        you can pass the `engine` keyword (the default engine is
        automatically chosen depending on the file extension):

        >>> df1.to_excel('output1.xlsx', engine='xlsxwriter')  # doctest: +SKIP
        

2、读文件(pd.read_excel)

pd.read_excel("out_put.xlsx")

(二)csv文件处理

1、写文件(df.to_csv)

 Examples
        --------
        >>> df = pd.DataFrame({{'name': ['Raphael', 'Donatello'],
        ...                    'mask': ['red', 'purple'],
        ...                    'weapon': ['sai', 'bo staff']}})
        >>> df.to_csv(index=False)
        'name,mask,weapon\nRaphael,red,sai\nDonatello,purple,bo staff\n'

        Create 'out.zip' containing 'out.csv'

        >>> compression_opts = dict(method='zip',
        ...                         archive_name='out.csv')  # doctest: +SKIP
        >>> df.to_csv('out.zip', index=False,
        ...           compression=compression_opts)  # doctest: +SKIP

        To write a csv file to a new folder or nested folder you will first
        need to create it using either Pathlib or os:

        >>> from pathlib import Path  # doctest: +SKIP
        >>> filepath = Path('folder/subfolder/out.csv')  # doctest: +SKIP
        >>> filepath.parent.mkdir(parents=True, exist_ok=True)  # doctest: +SKIP
        >>> df.to_csv(filepath)  # doctest: +SKIP

        >>> import os  # doctest: +SKIP
        >>> os.makedirs('folder/subfolder', exist_ok=True)  # doctest: +SKIP
        >>> df.to_csv('folder/subfolder/out.csv')  # doctest: +SKIP
        

2、读文件(pd.read_csv)

pd.read_excel("out_put.csv")

九、案例

数据:Gasoline And Diesel Prices In Romania Notebook

import pandas as pd
import numpy as np

df1 = pd.read_csv("files/gasolinePricesInRomania.csv")
df2 = pd.read_csv("files/dieselPricesInRomania.csv")

print(df1.head())
print(df1.count())
print(df2.head())
print(df2.count())

df = pd.merge(df1, df2, on="date")
print(df["date"].unique())


def handle_date(row):
    date = row.rsplit(maxsplit=1, sep="-")[-1]
    return date


df['date'] = df['date'].map(handle_date)

# 统计每个国家每一年的用油总和、平均值、最大值、最小值

res = df.groupby(by="date").agg([np.sum, np.mean, np.max, np.min])

# 列命名写法
res1 = df.groupby(by="date").agg(
    s1=pd.NamedAgg(column='gasoline price per liter in RON', aggfunc='sum'),
    s2=pd.NamedAgg(column='diesel price per liter in RON', aggfunc='sum'),
    m1=pd.NamedAgg(column='gasoline price per liter in RON', aggfunc='mean'),
    m2=pd.NamedAgg(column='diesel price per liter in RON', aggfunc='mean')
)
print(res1)

​ s1 s2 m1 m2

date
2020 452.223 448.315 4.567909 4.528434
2021 2069.067 2046.488 5.731488 5.668942
2022 2563.922 2820.015 7.431658 8.173957
2023 44.844 53.439 6.406286 7.634143

posted @   iveBoy  阅读(19)  评论(1编辑  收藏  举报
编辑推荐:
· 开发者必知的日志记录最佳实践
· SQL Server 2025 AI相关能力初探
· Linux系列:如何用 C#调用 C方法造成内存泄露
· AI与.NET技术实操系列(二):开始使用ML.NET
· 记一次.NET内存居高不下排查解决与启示
阅读排行:
· 被坑几百块钱后,我竟然真的恢复了删除的微信聊天记录!
· 没有Manus邀请码?试试免邀请码的MGX或者开源的OpenManus吧
· 【自荐】一款简洁、开源的在线白板工具 Drawnix
· 园子的第一款AI主题卫衣上架——"HELLO! HOW CAN I ASSIST YOU TODAY
· Docker 太简单,K8s 太复杂?w7panel 让容器管理更轻松!
历史上的今天:
2020-12-24 Linux下用户管理
TOP
点击右上角即可分享
微信分享提示

目录导航