Pandas简明教程
一、pandas简介
pandas是一个专门用于数据分析的开源Python库,目前很多使用Python分析数据的专业人员都将pandas作为基础工具来使用。pandas是以Numpy作为基础来设计开发的,Numpy是大量Python数据科学计算库的基础,pandas以此为基础,在计算方面具有很高的性能。
pandas内容很丰富,本篇文章包含:
- 初识pandas
- 两种数据结构
- 函数应用
- 合并、分组、分割、变形
- 缺失值、重复值处理
- 数据过滤
- 文件处理
- 案例实战
二、初识pandas
我们先从一个实例说起,假如现在有这样的数据:
data = [
{"name": "apple", "price": 6, "quantity": 100},
{"name": "pear", "price": 8.5, "quantity": 230},
{"name": "banana", "price": 6, "quantity": 150},
]
如果实现 price+1
,然后求其平均price
用Python基本语法实现:
data = [
{"name": "apple", "price": 6, "quantity": 100},
{"name": "pear", "price": 8.5, "quantity": 230},
{"name": "banana", "price": 6, "quantity": 150},
]
# price + 1
# price average
total_price = 0
for item in data:
item["price"] = item.get("price", 0) + 1
total_price += item["price"]
avr = total_price / len(data)
print(avr)
用pandas实现:
data = [
{"name": "apple", "price": 6, "quantity": 100},
{"name": "pear", "price": 8.5, "quantity": 230},
{"name": "banana", "price": 6, "quantity": 150},
]
import pandas as pd
df = pd.DataFrame(data)
print(df)
# price + 1
df["price"] = df["price"] + 1
# price average
avr_price = df["price"].mean()
print(avr_price)
Python基本语法实现至少需要循环对数据进行处理,如果数据量比较大是很耗费时间的,采用pandas来进行数据量大的数据是比较合理的,而且由很丰富的API进行数据处理。
三、两种数据结构
pandas种主要围绕Series与DataFrame这两种数据进行操作。
(一)Series
带标签的一维数组,可存储整数、浮点数、字符串、Python对象等类型的数据,具有以下特点:
- 类似一维数组对象
- 由数据和索引组成(索引在左、数据在右,索引自动创建)
1、创建
import pandas as pd
# 通过list创建
ser_obj = pd.Series([1, 2, 3, 4, 5])
print(ser_obj)
print(type(ser_obj))
# 通过dict创建
ser_obj1 = pd.Series({"a": 1, "b": 2, "c": 3})
print(ser_obj1)
print(type(ser_obj1))
0 1
1 2
2 3
3 4
4 5
dtype: int64
<class 'pandas.core.series.Series'>a 1
b 2
c 3
dtype: int64
<class 'pandas.core.series.Series'>
2、操作
- 索引操作
print(ser_obj.index) # RangeIndex(start=0, stop=5, step=1)
print(ser_obj1.index) # Index(['a', 'b', 'c'], dtype='object')
通过迭代可以获取每一个索引值,或者索引和切片。
- 值操作
print(ser_obj.values) # [1 2 3 4 5]
print(ser_obj1.values) # [1 2 3]
- head方法预览数据
print(ser_obj.head()) # head(2) 预览前2行
0 1
1 2
2 3
3 4
4 5
dtype: int64
(二)DataFrame
DataFrame是由多种类型的列构成的二维标签数据结构,可以理解成一系列的Series构成的,它具有以下特点:
- 类似多维数组/表格数据
- 每列数据可以是不同的类型
- 索引包括列索引和行索引
1、创建
import pandas as pd
import numpy as np
# 通过ndarray构建DataFrame
df = pd.DataFrame(np.arange(12).reshape(3, 4))
print(df)
print(type(df))
# 通过字典构建DataFrame
# 默认索引是0、1、2,
# 可以指定索引:pd.DataFrame({"one": [1, 2, 3], "two": [4, 5, 6]}, index=["a","b","c"])
df = pd.DataFrame({"one": [1, 2, 3], "two": [4, 5, 6]})
print(df)
print(type(df))
# 通过Series构建DataFrame
data = {"one": pd.Series([1, 2, 3], index=["a", "b", "c"]), "two": pd.Series([4, 5, 6], index=["a", "b", "c"])}
df = pd.DataFrame(data)
print(df)
print(type(df))
输出:
0 1 2 3
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
<class 'pandas.core.frame.DataFrame'>one two
0 1 4
1 2 5
2 3 6
<class 'pandas.core.frame.DataFrame'>one two
a 1 4
b 2 5
c 3 6
<class 'pandas.core.frame.DataFrame'>
2、操作
- 行操作
import pandas as pd
df = pd.DataFrame({"one": [1, 2, 3], "two": [4, 5, 6]})
print(df)
# 获取行
print(df[0:1])
# 添加行
df = df.append({"one": 4, "two": 7}, ignore_index=True)
print(df)
# 删除行
df.drop([0], axis=0, inplace=True) # 第一个参数指定索引
print(df)
- 列操作
import pandas as pd
df = pd.DataFrame({"one": [1, 2, 3], "two": [4, 5, 6]})
print(df)
# 获取列
print(df["one"])
# 添加列
df["three"] = [7, 8, 9]
# 删除列
df.drop(["three"], axis=1, inplace=True)
- 块操作
import pandas as pd
df = pd.DataFrame({"one": [1, 2, 3], "two": [4, 5, 6]})
print(df)
# 获取块
df1 = df.loc[0, ["one"]]
print(df1)
print(type(df1))
# 修改某个值
df.loc[0, ["one"]] = 10 # 0表示行索引, one表示列标签
print(df)
四、函数应用
- apply
- applymap
- sort_index、sort_values
- count、describe、min、max、sum、mean
(一)apply
通过apply将函数应用到行或列上
import pandas as pd
data = [
{"name": "apple", "price": 6, "quantity": 100},
{"name": "pear", "price": 8.5, "quantity": 230},
{"name": "banana", "price": 6, "quantity": 150},
]
df = pd.DataFrame(data, index=["a", "b", "c"])
print(df)
"""
name price quantity
a apple 6.0 100
b pear 8.5 230
c banana 6.0 150
"""
# 实例一
def add_price_quantity(row):
row["price"] = row["price"] + 1
row["quantity"] = row["quantity"] + 100
return row
# apply
df1 = df.apply(add_price_quantity, axis=1)
print(df1)
"""
name price quantity
a apple 7.0 200
b pear 9.5 330
c banana 7.0 250
"""
# 实例二
df2 = df["price"].apply(lambda x: x + 10)
print(df2)
"""
a 16.0
b 18.5
c 16.0
Name: price, dtype: float64
"""
(二)applymap
通过applymap将函数应用到每个数据上
import pandas as pd
data = [
{"name": "apple", "price": 6, "quantity": 100},
{"name": "pear", "price": 8.5, "quantity": 230},
{"name": "banana", "price": 6, "quantity": 150},
]
df = pd.DataFrame(data, index=["a", "b", "c"])
print(df)
def handle_cell(cell):
if isinstance(cell, str):
cell = cell + "_fruit"
return cell
# applymap
df1 = df.applymap(handle_cell)
print(df1)
"""
name price quantity
a apple_fruit 6.0 100
b pear_fruit 8.5 230
c banana_fruit 6.0 150
"""
(三)sort_index、sort_values
- sort_index 索引排序
- sort_values(by="label") 其中参数by是按照标签名进行排序
其中:axis指定轴的方向
import pandas as pd
data = [
{"name": "apple", "price": 6, "quantity": 100},
{"name": "pear", "price": 8.5, "quantity": 230},
{"name": "banana", "price": 6, "quantity": 150},
]
df = pd.DataFrame(data, index=["a", "b", "c"])
print(df)
"""
name price quantity
a apple 6.0 100
b pear 8.5 230
c banana 6.0 150
"""
df1 = df.sort_index(ascending=False, axis=0) # 索引降序
print(df1)
"""
name price quantity
c banana 6.0 150
b pear 8.5 230
a apple 6.0 100
"""
df2 = df.sort_values(by=["price"])
print(df2)
"""
name price quantity
a apple 6.0 100
c banana 6.0 150
b pear 8.5 230
"""
(四)count、describe、min、max、sum、mean
常用的统计的计算方法:
- count 非空值的数量
- describe 针对Series或DataFrame列计算汇总统计
- min、max 计算最小值和最大值
- sum 值的总和
- mean 值的平均值
import pandas as pd
data = [
{"name": "apple", "price": 6, "quantity": 100},
{"name": "pear", "price": 8.5, "quantity": 230},
{"name": "banana", "price": 6, "quantity": 150},
]
df = pd.DataFrame(data, index=["a", "b", "c"])
print(df)
print(df.count())
print(df.describe())
print(df["price"].min())
print(df["price"].max())
print(df["price"].sum())
print(df["price"].mean())
输出:
name price quantity
a apple 6.0 100
b pear 8.5 230
c banana 6.0 150count
name 3
price 3
quantity 3
dtype: int64describe
price quantity
count 3.000000 3.000000
mean 6.833333 160.000000
std 1.443376 65.574385
min 6.000000 100.000000
25% 6.000000 125.000000
50% 6.000000 150.000000
75% 7.250000 190.000000
max 8.500000 230.000000price
6.0
8.5
20.5
6.833333333333333
五、合并、分组、分割、变形
(一)合并
pandas有一些内置合并数据集的方法,如:
- pd.concat 沿一个轴将多个对象堆叠起来
- pd.merge 根据一个或者多个键将多个数据集连接起来
- df.append 纵向追加Series、DataFrame
1、pd.concat
- 合并Series
>>> s1 = pd.Series(['a', 'b'])
>>> s2 = pd.Series(['c', 'd'])
>>> pd.concat([s1, s2])
0 a
1 b
0 c
1 d
dtype: object
Clear the existing index and reset it in the result
by setting the ``ignore_index`` option to ``True``.
>>> pd.concat([s1, s2], ignore_index=True)
0 a
1 b
2 c
3 d
dtype: object
- 合并DataFrame
# 行合并
Combine two ``DataFrame`` objects with identical columns.
>>> df1 = pd.DataFrame([['a', 1], ['b', 2]],
... columns=['letter', 'number'])
>>> df1
letter number
0 a 1
1 b 2
>>> df2 = pd.DataFrame([['c', 3], ['d', 4]],
... columns=['letter', 'number'])
>>> df2
letter number
0 c 3
1 d 4
>>> pd.concat([df1, df2])
letter number
0 a 1
1 b 2
0 c 3
1 d 4
# 列合并
>>> df4 = pd.DataFrame([['bird', 'polly'], ['monkey', 'george']],
... columns=['animal', 'name'])
>>> pd.concat([df1, df4], axis=1)
letter number animal name
0 a 1 bird polly
1 b 2 monkey george
2、pd.merge
def merge(
left: DataFrame | Series,
right: DataFrame | Series,
how: str = "inner",
on: IndexLabel | None = None,
left_on: IndexLabel | None = None,
right_on: IndexLabel | None = None,
left_index: bool = False,
right_index: bool = False,
sort: bool = False,
suffixes: Suffixes = ("_x", "_y"),
copy: bool = True,
indicator: bool = False,
validate: str | None = None,
) -> DataFrame:
left、right 连接的对象
how 连接方式
on 连接列标签的名称,如果没有指定使用左右两个对象的交集作为连接键
left_on 左侧连接键的名称
right_on 右侧连接键的名称
left_index 左侧对象行索引作为连接键
right_index 右侧对象行索引作为连接键
sort 将合并的数据进行排序,设置为False可以提高性能
suffixes 字符串值组成的元组,用于指定当左右DataFrame存在相同列名时在列名后面附加的后缀名称
copy 将数据复制到数据结构种,设置为False可以提高性能
indicator 显示合并数据种数据的来源情况
- 默认方式
import pandas as pd
df1 = pd.DataFrame([['a', 1], ['b', 2]],
columns=['letter1', 'number'])
print(df1)
"""
letter1 number
0 a 1
1 b 2
"""
df2 = pd.DataFrame([['c', 3], ['d', 2]],
columns=['letter2', 'number'])
print(df2)
"""
letter2 number
0 c 3
1 d 2
"""
df3 = pd.merge(df1, df2) # 可以通过on显式指定 pd.merge(df1, df2, on="number")
print(df3)
"""
letter1 number letter2
0 b 2 d
"""
- left_on、right_on
import pandas as pd
df1 = pd.DataFrame([['a', 1], ['b', 2]],
columns=['letter1', 'number1'])
print(df1)
"""
letter1 number1
0 a 1
1 b 2
"""
df2 = pd.DataFrame([['c', 3], ['d', 2]],
columns=['letter2', 'number2'])
print(df2)
"""
letter2 number2
0 c 3
1 d 2
"""
df3 = pd.merge(df1, df2, left_on="number1", right_on="number2")
print(df3)
"""
letter1 number1 letter2 number2
0 b 2 d 2
"""
- left_index、right_index
import pandas as pd
df1 = pd.DataFrame([['a', 1], ['b', 2]],
columns=['letter1', 'number'])
print(df1)
"""
letter1 number
0 a 1
1 b 2
"""
df2 = pd.DataFrame([['c', 3], ['d', 2]],
columns=['letter2', 'number'])
print(df2)
"""
letter2 number
0 c 3
1 d 2
"""
df3 = pd.merge(df1, df2, left_index=True, right_index=True)
print(df3)
"""
letter1 number_x letter2 number_y
0 a 1 c 3
1 b 2 d 2
"""
3、df.append
- 参数
def append(
self,
other,
ignore_index: bool = False,
verify_integrity: bool = False,
sort: bool = False,
) -> DataFrame:
- 实例
Examples
--------
>>> df = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB'), index=['x', 'y'])
>>> df
A B
x 1 2
y 3 4
>>> df2 = pd.DataFrame([[5, 6], [7, 8]], columns=list('AB'), index=['x', 'y'])
>>> df.append(df2)
A B
x 1 2
y 3 4
x 5 6
y 7 8
With `ignore_index` set to True:
>>> df.append(df2, ignore_index=True)
A B
0 1 2
1 3 4
2 5 6
3 7 8
The following, while not recommended methods for generating DataFrames,
show two ways to generate a DataFrame from multiple data sources.
Less efficient:
>>> df = pd.DataFrame(columns=['A'])
>>> for i in range(5):
... df = df.append({'A': i}, ignore_index=True)
>>> df
A
0 0
1 1
2 2
3 3
4 4
More efficient:
>>> pd.concat([pd.DataFrame([i], columns=['A']) for i in range(5)],
... ignore_index=True)
A
0 0
1 1
2 2
3 3
4 4
(二)分组
- 统计方法使用
>>> l = [["a", 12, 12], [None, 12.3, 33.], ["b", 12.3, 123], ["a", 1, 1]]
>>> df = pd.DataFrame(l, columns=["a", "b", "c"])
>>> df.groupby(by="a").sum()
b c
a
a 13.0 13.0
b 12.3 123.0
- apply
>> df = pd.DataFrame({'Animal': ['Falcon', 'Falcon',
... 'Parrot', 'Parrot'],
... 'Max Speed': [380., 370., 24., 26.]})
>>> df.groupby("Animal", group_keys=True).apply(lambda x: x)
Animal Max Speed
Animal
Falcon 0 Falcon 380.0
1 Falcon 370.0
Parrot 2 Parrot 24.0
3 Parrot 26.0
>>> df.groupby("Animal", group_keys=False).apply(lambda x: x)
Animal Max Speed
0 Falcon 380.0
1 Falcon 370.0
2 Parrot 24.0
3 Parrot 26.0
(三)分割
使用分块原理进行分割:
import pandas as pd
data = [
{"name": "apple", "price": 6, "quantity": 100},
{"name": "pear", "price": 8.5, "quantity": 230},
{"name": "banana", "price": 6, "quantity": 150},
]
df = pd.DataFrame(data)
print(df)
df1 = df[0:1][["name", "price"]]
print(df1) # 0行数据
"""
name price
0 apple 6.0
"""
df2 = df[1:][["name", "price"]]
print(df2) # 1行后的数据
"""
name price
1 pear 8.5
2 banana 6.0
"""
(四)变形
1、重塑层次化索引(stack/unstack)
import pandas as pd
data = [
{"日期": "2022-01-12", "最高气温": 15, "最低气温": 8},
{"日期": "2022-02-12", "最高气温": 12, "最低气温": 7},
{"日期": "2022-02-15", "最高气温": 11, "最低气温": 8},
{"日期": "2022-03-09", "最高气温": 15, "最低气温": 13},
{"日期": "2022-03-13", "最高气温": 19, "最低气温": 15},
{"日期": "2022-06-18", "最高气温": 28, "最低气温": 22},
]
df = pd.DataFrame(data)
# 把列转化为行,形成一个层次化索引的Series
df1 = df.stack()
print(df1)
print(type(df1))
"""
0 日期 2022-01-12
最高气温 15
最低气温 8
1 日期 2022-02-12
最高气温 12
最低气温 7
2 日期 2022-02-15
最高气温 11
最低气温 8
3 日期 2022-03-09
最高气温 15
最低气温 13
4 日期 2022-03-13
最高气温 19
最低气温 15
5 日期 2022-06-18
最高气温 28
最低气温 22
dtype: object
<class 'pandas.core.series.Series'>
"""
df2 = df1.unstack()
print(df2)
print(type(df2))
"""
日期 最高气温 最低气温
0 2022-01-12 15 8
1 2022-02-12 12 7
2 2022-02-15 11 8
3 2022-03-09 15 13
4 2022-03-13 19 15
5 2022-06-18 28 22
<class 'pandas.core.frame.DataFrame'>
"""
2、数据透视表
- 参数
def pivot_table(
data: DataFrame,
values=None,
index=None,
columns=None,
aggfunc: AggFuncType = "mean",
fill_value=None,
margins: bool = False,
dropna: bool = True,
margins_name: str = "All",
observed: bool = False,
sort: bool = True,
) -> DataFrame:
其中index参数用于进行分组,然后求其分组后值的平均值。
import pandas as pd
data = [
{"日期": "2022-01-12", "最高气温": 15, "最低气温": 8, "天气": "晴", "风向": "东北风"},
{"日期": "2022-02-12", "最高气温": 12, "最低气温": 7, "天气": "晴", "风向": "东北风"},
{"日期": "2022-02-15", "最高气温": 11, "最低气温": 8, "天气": "多云", "风向": "西南风"},
{"日期": "2022-03-09", "最高气温": 15, "最低气温": 13, "天气": "晴", "风向": "西北风"},
{"日期": "2022-03-13", "最高气温": 19, "最低气温": 15, "天气": "小雨", "风向": "北风"},
{"日期": "2022-06-18", "最高气温": 28, "最低气温": 22, "天气": "小雨", "风向": "西南风"},
]
df = pd.DataFrame(data)
print(df)
"""
日期 最高气温 最低气温 天气 风向
0 2022-01-12 15 8 晴 东北风
1 2022-02-12 12 7 晴 东北风
2 2022-02-15 11 8 多云 西南风
3 2022-03-09 15 13 晴 西北风
4 2022-03-13 19 15 小雨 北风
5 2022-06-18 28 22 小雨 西南风
"""
df = df.set_index("日期")
print(df)
"""
最高气温 最低气温 天气 风向
日期
2022-01-12 15 8 晴 东北风
2022-02-12 12 7 晴 东北风
2022-02-15 11 8 多云 西南风
2022-03-09 15 13 晴 西北风
2022-03-13 19 15 小雨 北风
2022-06-18 28 22 小雨 西南风
"""
df1 = pd.pivot_table(df, values=["最高气温", "最低气温"], index=["天气"])
print(df1)
"""
最低气温 最高气温
天气
多云 8.000000 11.0
小雨 18.500000 23.5
晴 9.333333 14.0
"""
df2 = pd.pivot_table(df, values=["最高气温", "最低气温"], index=["天气"], columns=["风向"])
print(df2)
"""
最低气温 最高气温
风向 东北风 北风 西北风 西南风 东北风 北风 西北风 西南风
天气
多云 NaN NaN NaN 8.0 NaN NaN NaN 11.0
小雨 NaN 15.0 NaN 22.0 NaN 19.0 NaN 28.0
晴 7.5 NaN 13.0 NaN 13.5 NaN 15.0 NaN
"""
六、缺失值、重复值处理
(一)缺失值处理
当数据中存在NaN缺失值时,可以用其它数值替代缺失值,这里使用了df.fillna()方法。
1、查看是否存在缺失值
import pandas as pd
data = [
{"日期": "2022-01-12", "最高气温": 15, "最低气温": 8, "天气": "晴", "风向": "东北风"},
{"日期": "2022-02-12", "最高气温": 12, "最低气温": None, "天气": "晴", "风向": "东北风"},
{"日期": "2022-02-15", "最高气温": 11, "最低气温": 8, "天气": "多云", "风向": "西南风"},
{"日期": "2022-03-09", "最高气温": None, "最低气温": 13, "天气": "晴", "风向": "西北风"},
{"日期": "2022-03-13", "最高气温": 19, "最低气温": 15, "天气": "小雨", "风向": "北风"},
{"日期": "2022-06-18", "最高气温": 28, "最低气温": 22, "天气": "小雨", "风向": "西南风"},
]
df = pd.DataFrame(data)
res = df.isna() # DataFrame.isnull is an alias for DataFrame.isna
print(res)
"""
日期 最高气温 最低气温 天气 风向
0 False False False False False
1 False False True False False
2 False False False False False
3 False True False False False
4 False False False False False
5 False False False False False
"""
2、df.dropna(axis=0)删除行
df.dropna(axis=0)
3、df.dropna(axis=1)删除列
df.dropna(axis=1)
4、df.fillna("missing")填充缺失值
- 参数
def fillna( # type: ignore[override]
self,
value: Hashable | Mapping | Series | DataFrame = None,
method: FillnaOptions | None = None,
axis: Axis | None = None,
inplace: bool = False,
limit: int | None = None,
downcast: dict | None = None,
) -> DataFrame | None:
return super().fillna(
value=value,
method=method,
axis=axis,
inplace=inplace,
limit=limit,
downcast=downcast,
)
method参数表示填充方式:
df.fillna("missing") # 使用字符串替代NaN
df.fillna(method="pad") # 使用前一个数据替代NaN
df.fillna(method="bfill", limit=1) # 使用后一个数据替代NaN,使用limit限制每一列可以替代NaN的数目
df.fillna(df1.mean()) # 使用统计变量替代NaN
(二)重复值处理
1、判断是否重复行(df.duplicated)
该方法返回一个bool型的Series,用以判断某行是否是重复行:
def duplicated(
self,
subset: Hashable | Sequence[Hashable] | None = None,
keep: Literal["first", "last", False] = "first",
) -> Series:
Examples
--------
Consider dataset containing ramen rating.
>>> df = pd.DataFrame({
... 'brand': ['Yum Yum', 'Yum Yum', 'Indomie', 'Indomie', 'Indomie'],
... 'style': ['cup', 'cup', 'cup', 'pack', 'pack'],
... 'rating': [4, 4, 3.5, 15, 5]
... })
>>> df
brand style rating
0 Yum Yum cup 4.0
1 Yum Yum cup 4.0
2 Indomie cup 3.5
3 Indomie pack 15.0
4 Indomie pack 5.0
By default, for each set of duplicated values, the first occurrence
is set on False and all others on True.
>>> df.duplicated()
0 False
1 True
2 False
3 False
4 False
dtype: bool
By using 'last', the last occurrence of each set of duplicated values
is set on False and all others on True.
>>> df.duplicated(keep='last')
0 True
1 False
2 False
3 False
4 False
dtype: bool
By setting ``keep`` on False, all duplicates are True.
>>> df.duplicated(keep=False)
0 True
1 True
2 False
3 False
4 False
dtype: bool
To find duplicates on specific column(s), use ``subset``.
>>> df.duplicated(subset=['brand'])
0 False
1 True
2 False
3 True
4 True
dtype: bool
2、删除重复行(df.drop_duplicates)
- 参数
def drop_duplicates(
self,
subset: Hashable | Sequence[Hashable] | None = None,
keep: Literal["first", "last", False] = "first",
inplace: bool = False,
ignore_index: bool = False,
) -> DataFrame | None:
Examples
--------
Consider dataset containing ramen rating.
>>> df = pd.DataFrame({
... 'brand': ['Yum Yum', 'Yum Yum', 'Indomie', 'Indomie', 'Indomie'],
... 'style': ['cup', 'cup', 'cup', 'pack', 'pack'],
... 'rating': [4, 4, 3.5, 15, 5]
... })
>>> df
brand style rating
0 Yum Yum cup 4.0
1 Yum Yum cup 4.0
2 Indomie cup 3.5
3 Indomie pack 15.0
4 Indomie pack 5.0
By default, it removes duplicate rows based on all columns.
>>> df.drop_duplicates()
brand style rating
0 Yum Yum cup 4.0
2 Indomie cup 3.5
3 Indomie pack 15.0
4 Indomie pack 5.0
To remove duplicates on specific column(s), use ``subset``.
>>> df.drop_duplicates(subset=['brand'])
brand style rating
0 Yum Yum cup 4.0
2 Indomie cup 3.5
To remove duplicates and keep last occurrences, use ``keep``.
>>> df.drop_duplicates(subset=['brand', 'style'], keep='last')
brand style rating
1 Yum Yum cup 4.0
2 Indomie cup 3.5
4 Indomie pack 5.0
七、数据过滤
- 列标签过滤
- 逻辑运算符过滤
- 正则过滤
- 正则提取
- 其它
(一)列标签过滤
- 数据
import pandas as pd
data = [
{"日期": "2022-01-12", "最高气温": 15, "最低气温": 8, "天气": "晴", "风向": "东北风"},
{"日期": "2022-02-12", "最高气温": 12, "最低气温": None, "天气": "晴", "风向": "东北风"},
{"日期": "2022-02-15", "最高气温": 11, "最低气温": 8, "天气": "多云", "风向": "西南风"},
{"日期": "2022-03-09", "最高气温": None, "最低气温": 13, "天气": "晴", "风向": "西北风"},
{"日期": "2022-03-13", "最高气温": 19, "最低气温": 15, "天气": "小雨", "风向": "北风"},
{"日期": "2022-06-18", "最高气温": 28, "最低气温": 22, "天气": "小雨", "风向": "西南风"},
]
df = pd.DataFrame(data)
print(df)
- 过滤
df1 = df[df["最高气温"] > 15]
print(df1)
(二)逻辑运算符过滤
df1 = df[(df["最高气温"] > 15) & (df["最低气温"] > 15)]
df2 = df[(df["最高气温"] > 15) | (df["最低气温"] > 15)]
df3 = df[df["最高气温"] != 15]
df4 = df[~(df["最高气温"] >= 15)]
(三)正则过滤
df1 = df[df["最高气温"].astype(str).str.contain('(\d+)', regex=True)]
(四)正则提取
通过extract方法可以提取Series中的一些文本信息:
import pandas as pd
data = [
{"日期": "2022-01-12", "最高气温": 15, "最低气温": 8, "天气": "晴", "风向": "东北风"},
{"日期": "2022-02-12", "最高气温": 12, "最低气温": None, "天气": "晴", "风向": "东北风"},
{"日期": "2022-02-15", "最高气温": 11, "最低气温": 8, "天气": "多云", "风向": "西南风"},
{"日期": "2022-03-09", "最高气温": None, "最低气温": 13, "天气": "晴", "风向": "西北风"},
{"日期": "2022-03-13", "最高气温": 19, "最低气温": 15, "天气": "小雨", "风向": "北风"},
{"日期": "2022-06-18", "最高气温": 28, "最低气温": 22, "天气": "小雨", "风向": "西南风"},
]
df = pd.DataFrame(data)
df1 = df["日期"].str.extract(r'(\d+)')
print(df1)
"""
0
0 2022
1 2022
2 2022
3 2022
4 2022
5 2022
"""
(五)其它
Series中的字符串方法可以做很多事情:
print(type(df["日期"].str)) # <class 'pandas.core.strings.accessor.StringMethods'>
该类中含有:
- extract
- extractall
- find
- findall
- match
- replace
- split
...
八、文件处理
(一)excel文件处理
1、写文件(df.to_excel)
Examples
--------
Create, write to and save a workbook:
>>> df1 = pd.DataFrame([['a', 'b'], ['c', 'd']],
... index=['row 1', 'row 2'],
... columns=['col 1', 'col 2'])
>>> df1.to_excel("output.xlsx") # doctest: +SKIP
To specify the sheet name:
>>> df1.to_excel("output.xlsx",
... sheet_name='Sheet_name_1') # doctest: +SKIP
If you wish to write to more than one sheet in the workbook, it is
necessary to specify an ExcelWriter object:
>>> df2 = df1.copy()
>>> with pd.ExcelWriter('output.xlsx') as writer: # doctest: +SKIP
... df1.to_excel(writer, sheet_name='Sheet_name_1')
... df2.to_excel(writer, sheet_name='Sheet_name_2')
ExcelWriter can also be used to append to an existing Excel file:
>>> with pd.ExcelWriter('output.xlsx',
... mode='a') as writer: # doctest: +SKIP
... df.to_excel(writer, sheet_name='Sheet_name_3')
To set the library that is used to write the Excel file,
you can pass the `engine` keyword (the default engine is
automatically chosen depending on the file extension):
>>> df1.to_excel('output1.xlsx', engine='xlsxwriter') # doctest: +SKIP
2、读文件(pd.read_excel)
pd.read_excel("out_put.xlsx")
(二)csv文件处理
1、写文件(df.to_csv)
Examples
--------
>>> df = pd.DataFrame({{'name': ['Raphael', 'Donatello'],
... 'mask': ['red', 'purple'],
... 'weapon': ['sai', 'bo staff']}})
>>> df.to_csv(index=False)
'name,mask,weapon\nRaphael,red,sai\nDonatello,purple,bo staff\n'
Create 'out.zip' containing 'out.csv'
>>> compression_opts = dict(method='zip',
... archive_name='out.csv') # doctest: +SKIP
>>> df.to_csv('out.zip', index=False,
... compression=compression_opts) # doctest: +SKIP
To write a csv file to a new folder or nested folder you will first
need to create it using either Pathlib or os:
>>> from pathlib import Path # doctest: +SKIP
>>> filepath = Path('folder/subfolder/out.csv') # doctest: +SKIP
>>> filepath.parent.mkdir(parents=True, exist_ok=True) # doctest: +SKIP
>>> df.to_csv(filepath) # doctest: +SKIP
>>> import os # doctest: +SKIP
>>> os.makedirs('folder/subfolder', exist_ok=True) # doctest: +SKIP
>>> df.to_csv('folder/subfolder/out.csv') # doctest: +SKIP
2、读文件(pd.read_csv)
pd.read_excel("out_put.csv")
九、案例
数据:Gasoline And Diesel Prices In Romania Notebook
import pandas as pd
import numpy as np
df1 = pd.read_csv("files/gasolinePricesInRomania.csv")
df2 = pd.read_csv("files/dieselPricesInRomania.csv")
print(df1.head())
print(df1.count())
print(df2.head())
print(df2.count())
df = pd.merge(df1, df2, on="date")
print(df["date"].unique())
def handle_date(row):
date = row.rsplit(maxsplit=1, sep="-")[-1]
return date
df['date'] = df['date'].map(handle_date)
# 统计每个国家每一年的用油总和、平均值、最大值、最小值
res = df.groupby(by="date").agg([np.sum, np.mean, np.max, np.min])
# 列命名写法
res1 = df.groupby(by="date").agg(
s1=pd.NamedAgg(column='gasoline price per liter in RON', aggfunc='sum'),
s2=pd.NamedAgg(column='diesel price per liter in RON', aggfunc='sum'),
m1=pd.NamedAgg(column='gasoline price per liter in RON', aggfunc='mean'),
m2=pd.NamedAgg(column='diesel price per liter in RON', aggfunc='mean')
)
print(res1)
s1 s2 m1 m2
date
2020 452.223 448.315 4.567909 4.528434
2021 2069.067 2046.488 5.731488 5.668942
2022 2563.922 2820.015 7.431658 8.173957
2023 44.844 53.439 6.406286 7.634143
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· 开发者必知的日志记录最佳实践
· SQL Server 2025 AI相关能力初探
· Linux系列:如何用 C#调用 C方法造成内存泄露
· AI与.NET技术实操系列(二):开始使用ML.NET
· 记一次.NET内存居高不下排查解决与启示
· 被坑几百块钱后,我竟然真的恢复了删除的微信聊天记录!
· 没有Manus邀请码?试试免邀请码的MGX或者开源的OpenManus吧
· 【自荐】一款简洁、开源的在线白板工具 Drawnix
· 园子的第一款AI主题卫衣上架——"HELLO! HOW CAN I ASSIST YOU TODAY
· Docker 太简单,K8s 太复杂?w7panel 让容器管理更轻松!
2020-12-24 Linux下用户管理