Data contained in pandas objects can be combined together in a number of ways:
pandas.merge connects rows in DataFrame based on one or more keys. This will be familiar to users of SQL or other relational databases, as it impliemnts(工具) database join oprations.
pandas.concat concatenates or "stacks" together objects along an axis.
The combine_first instance method enables splicing(拼接) together overlapping data to fill in missing values in one object with values from another.
I will address each of these and give a number of examples. They'll be utilized in examples throughout the rest of the book.
SQL风格的Join
merge or join operations combine datasets by linking rows using one or more keys. These operations are central to relational database(e.g. SQL-based). The merge function in pandas is the main entry point for using theses algorithms on your data.
Let's start with a simple example:
df1 = pd.DataFrame({
'key': 'b, b, a, c, a, a, b'.split(','),
'data1': range(7)
})
df2 = pd.DataFrame({
'key': ['a', 'b', 'd'],
'data2': range(3)
})
df1
df2
key
data1
0
b
0
1
b
1
2
a
2
3
c
3
4
a
4
5
a
5
6
b
6
key
data2
0
a
0
1
b
1
2
d
2
This is an example of a many to one join; the data in df1 has multiple rows labeled a and b, whereas(然而) df2 has only one row for each value in the key column. Calling merge with these objects we obtain:
"merge 默认是内连接, if 没有指定key..."
pd.merge(df1, df2) # data1, key, data2
'merge 默认是内连接, if 没有指定key...'
key
data1
data2
0
b
0
1
Note that I didn't specify which columns to join on. if that infomation is not specified, merge uses the overlapping columns names as keys. It's a good practice to specify explicitly, though:
(cj. 好像不是这样的哦)
# cj test
pd.merge(df1, df2, on='key', how='left')
key
data1
data2
0
b
0
1.0
1
b
1
NaN
2
a
2
NaN
3
c
3
NaN
4
a
4
NaN
5
a
5
NaN
6
b
6
NaN
If the column names are different in each object, you can specify them separately:
(两个df的键不同, 进行合并时可以分别指定)
df3 = pd.DataFrame({
'lkey': 'a b a c a a b'.split(),
'data1': range(7)
})
df4 = pd.DataFrame({
'rkey': ['a', 'b', 'd'],
'data2': range(3)
})
pd.merge(df3, df4, left_on='lkey', right_on='rkey')
lkey
data1
rkey
data2
0
a
0
a
0
1
a
2
a
0
2
a
4
a
0
3
a
5
a
0
4
b
1
b
1
5
b
6
b
1
You may notice that the 'c' and 'd' values and associate data are missing from the result. By defualt merge does an inner join; the keys in the result are intersection. or the common set found in both tables. Other possible options are left, right and outer. The outer join takes the union of the keys, combining the effect of applying both left and right joins.
(merge 默认是内连接, 相关的还有左, 右, 外连接;
外连接是包含了左,右连接哦)
To determine which key combinations will appear in the result depending on the choice of merge method, think of the multiple keys as forming an array fo tuples to be used as a single join key.
When you are joining columns-on-columns, the indexes on the passed DataFrame objects are discarded.
See Table 8-2 for an argument reference on merge. Joining using the DataFrame's row index is the subject of the next section.
left
right
how
on
left_on
right_on
left_index
right_index
sort
suffixes 添加后缀
copy
indecator
按Index合并
In some cases, the merge key(s) in a DataFrame will be found on its index, In this case, you can pass left_index=True or right_index=True to indicate that the index should be used as the merge key:
Another kind of data combination operation is referred to interchangeably as concat-enation, binding, or stacking, NumPy's concatenate function can do this with NumPy arrays:
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· 基于Microsoft.Extensions.AI核心库实现RAG应用
· Linux系列:如何用heaptrack跟踪.NET程序的非托管内存泄露
· 开发者必知的日志记录最佳实践
· SQL Server 2025 AI相关能力初探
· Linux系列:如何用 C#调用 C方法造成内存泄露
· 无需6万激活码!GitHub神秘组织3小时极速复刻Manus,手把手教你使用OpenManus搭建本
· Manus爆火,是硬核还是营销?
· 终于写完轮子一部分:tcp代理 了,记录一下
· 别再用vector<bool>了!Google高级工程师:这可能是STL最大的设计失误
· 单元测试从入门到精通