Pandas系列(十)-转换连接详解
目录
- 1. 拼接
- 1.1 append
- 1.2 concat
- 2. 关联
- 2.1 merge
- 2.2 join
数据准备
# 导入相关库 import numpy as np import pandas as pd """ 拼接 有两个DataFrame,都存储了用户的一些信息,现在要拼接起来,组成一个DataFrame,如何实现呢? """ data1 = { "name": ["Tom", "Bob"], "age": [18, 30], "city": ["Bei Jing ", "Shang Hai "] } df1 = pd.DataFrame(data=data1) df1 Out[85]: name age city 0 Tom 18 Bei Jing 1 Bob 30 Shang Hai data2 = { "name": ["Mary", "James"], "age": [35, 18], "city": ["Guang Zhou", "Shen Zhen"] } df2 = pd.DataFrame(data=data2) df2 Out[86]: name age city 0 Mary 35 Guang Zhou 1 James 18 Shen Zhen
1. 拼接
1.1 append
def append(self, other, ignore_index=False,verify_integrity=False, sort=None):
append 是最简单的拼接两个DataFrame的方法。
df1.append(df2) Out[87]: name age city 0 Tom 18 Bei Jing 1 Bob 30 Shang Hai 0 Mary 35 Guang Zhou 1 James 18 Shen Zhen
可以看到,拼接后的索引默认还是原有的索引,如果想要重新生成索引的话,设置参数 ignore_index=True 即可。
df1.append(df2, ignore_index=True) Out[88]: name age city 0 Tom 18 Bei Jing 1 Bob 30 Shang Hai 2 Mary 35 Guang Zhou 3 James 18 Shen Zhen
1.2 concat
除了 append 这种方式之外,还有 concat 这种方式可以实现相同的功能。
pd.concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False, keys=None, levels=None, names=None, verify_integrity=False, sort=None, copy=True):
例子
objs=[df1, df2] pd.concat(objs, ignore_index=True) Out[89]: name age city 0 Tom 18 Bei Jing 1 Bob 30 Shang Hai 2 Mary 35 Guang Zhou 3 James 18 Shen Zhen
如果想要区分出不同的DataFrame的数据,可以通过设置参数 keys,当然得设置参数 ignore_index=False。
pd.concat(objs, ignore_index=False, keys=["df1", "df2"]) Out[90]: name age city df1 0 Tom 18 Bei Jing 1 Bob 30 Shang Hai df2 0 Mary 35 Guang Zhou 1 James 18 Shen Zhen
2. 关联
有两个DataFrame,分别存储了用户的部分信息,现在需要将用户的这些信息关联起来,如何实现呢?
2.1 merge
def merge(self, right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=('_x', '_y'), copy=True, indicator=False, validate=None):
通过 pd.merge 可以关联两个DataFrame,这里我们设置参数 on="name",表示依据 name 来作为关联键。默认how='inner',我们可以设置成outer
data1 = { "name": ["Tom", "Bob", "Mary", "James"], "age": [18, 30, 35, 18], "city": ["Bei Jing ", "Shang Hai ", "Guang Zhou", "Shen Zhen"] } df1 = pd.DataFrame(data=data1) df1 data2 = {"name": ["Bob", "Mary", "James", "Andy"], "sex": ["male", "female", "male", np.nan], "income": [8000, 8000, 4000, 6000] } df2 = pd.DataFrame(data=data2) df2 pd.merge(df1,df2,on="name") Out[91]: name age city sex income 0 Bob 30 Shang Hai male 8000 1 Mary 35 Guang Zhou female 8000 2 James 18 Shen Zhen male 4000 #关联后发现数据变少了,只有 3 行数据,这是因为默认关联的方式是 inner,如果不想丢失任何数据,可以设置参数 how="outer"。 pd.merge(df1,df2,on="name",how="outer") Out[92]: name age city sex income 0 Tom 18.0 Bei Jing NaN NaN 1 Bob 30.0 Shang Hai male 8000.0 2 Mary 35.0 Guang Zhou female 8000.0 3 James 18.0 Shen Zhen male 4000.0 4 Andy NaN NaN NaN 6000.0
如果我们想保留左边所有的数据,可以设置参数 how="left";反之,如果想保留右边的所有数据,可以设置参数 how="right"
pd.merge(df1, df2, on="name", how="left") Out[93]: name age city sex income 0 Tom 18 Bei Jing NaN NaN 1 Bob 30 Shang Hai male 8000.0 2 Mary 35 Guang Zhou female 8000.0 3 James 18 Shen Zhen male 4000.0
有时候,两个 DataFrame 中需要关联的键的名称不一样,可以通过 left_on 和 right_on 来分别设置。
df1.rename(columns={"name": "name1"}, inplace=True) df1 Out[94]: name1 age city 0 Tom 18 Bei Jing 1 Bob 30 Shang Hai 2 Mary 35 Guang Zhou 3 James 18 Shen Zhen df2.rename(columns={"name": "name2"}, inplace=True) df2 Out[95]: name2 sex income 0 Bob male 8000 1 Mary female 8000 2 James male 4000 3 Andy NaN 6000 pd.merge(df1, df2, left_on="name1", right_on="name2") Out[96]: name1 age city name2 sex income 0 Bob 30 Shang Hai Bob male 8000 1 Mary 35 Guang Zhou Mary female 8000 2 James 18 Shen Zhen James male 4000
有时候,两个DataFrame中都包含相同名称的字段,如何处理呢?
我们可以设置参数 suffixes,默认 suffixes=('_x', '_y') 表示将相同名称的左边的DataFrame的字段名加上后缀 _x,右边加上后缀 _y。
df1["sex"] = "male" df1 Out[97]: name1 age city sex 0 Tom 18 Bei Jing male 1 Bob 30 Shang Hai male 2 Mary 35 Guang Zhou male 3 James 18 Shen Zhen male pd.merge(df1, df2, left_on="name1", right_on="name2") Out[98]: name1 age city sex_x name2 sex_y income 0 Bob 30 Shang Hai male Bob male 8000 1 Mary 35 Guang Zhou male Mary female 8000 2 James 18 Shen Zhen male James male 4000 pd.merge(df1, df2, left_on="name1", right_on="name2", suffixes=("_left", "_right")) Out[99]: name1 age city sex_left name2 sex_right income 0 Bob 30 Shang Hai male Bob male 8000 1 Mary 35 Guang Zhou male Mary female 8000 2 James 18 Shen Zhen male James male 4000
2.2 join
def join(self, other, on=None, how='left', lsuffix='', rsuffix='',sort=False):
除了 merge 这种方式外,还可以通过 join 这种方式实现关联。相比 merge,join 这种方式有以下几个不同:
(1)默认参数on=None,表示关联时使用左边和右边的索引作为键,设置参数on可以指定的是关联时左边的所用到的键名
(2)左边和右边字段名称重复时,通过设置参数 lsuffix 和 rsuffix 来解决。
df1.join(df2.set_index("name2"), on="name1", lsuffix="_left") Out[100]: name1 age city sex_left sex income 0 Tom 18 Bei Jing male NaN NaN 1 Bob 30 Shang Hai male male 8000.0 2 Mary 35 Guang Zhou male female 8000.0 3 James 18 Shen Zhen male male 4000.0
数据合并综合代码
from pandas import concat def data_concat(dir_name, to_file_path, drop_duplicates: bool = False): """ 数据纵向合并 :param dir_name: 数据来源文件夹名称 :param to_file_path: 合并数据保存文件夹 :param drop_duplicates: 是否去重 :return: """ objs = (read_excel(f'{dir_name}/{file}') for file in os.listdir(dir_name)) merge_data = concat(objs=objs, ignore_index=True) if drop_duplicates: merge_data.drop_duplicates(inplace=True) merge_data.to_excel(to_file_path, index=False) if __name__ == '__main__': data_concat(dir_name='data1', to_file_path='merge_data.xlsx', drop_duplicates=True)
- 融汇贯通
# -*- coding: utf-8 -*- """ Datetime: 2020/07/05 Author: Zhang Yafei Description: 合并文件 """ from pandas import read_csv, read_excel, merge, concat, DataFrame def read_file(file_path, on): if file_path.endswith('.csv'): return read_csv(file_path) if file_path.endswith('.xls') or file_path.endswith('xlsx'): return read_excel(file_path) def df_to_file(df: DataFrame, file_path: str, index: bool = True, encoding: str = 'utf_8_sig'): if file_path.endswith('.csv'): df.to_csv(file_path, index=index, encoding=encoding) if file_path.endswith('.xls') or file_path.endswith('xlsx'): df.to_excel(file_path, index=index) def merge_two_data(file1: str, file2: str, on: str = None, left_on: str = None, right_on: str = None, how: str = 'inner', to_file: str = None): """ 横向合并两个文件 @param file1: @param file2: @param on: @param left_on: @param right_on: @param how: @param to_file: @return: """ df1 = read_file(file1) df2 = read_file(file2) merge_df = merge(df1, df2, on=on, how=how, left_on=left_on, right_on=right_on) if to_file: if to_file.endswith('.csv'): merge_df.to_csv(to_file, encoding='utf_8_sig', index=False) elif to_file.endswith('xls') or to_file.endswith('xlsx'): merge_df.to_excel(to_file, index=False) else: return merge_df def append_two_file(file1: str, file2: str, to_file: str = None): """ 纵向合并两个文件 @param file1: @param file2: @param to_file: @return: """ df1 = read_file(file1) df2 = read_file(file2) df3 = df1.append(df2, ignore_index=True) if to_file: df_to_file(df3, to_file, index=False) else: return df3 def join_two_file(file1: str, file2: str, on: str = None, how: str = 'left', to_file: str = None): """ 横向合并两个文件 @param file1: @param file2: @param on: @param how: @param to_file: @return: """ df1 = read_file(file1) df2 = read_file(file2) df3 = df1.join(df2, on=on, how=how) if to_file: df_to_file(df3, to_file, index=False) else: return df3 def concat_more_data(axis: int = 0, to_file=None, encoding='utf_8_sig', *files): """ 多个文件合并 @param axis: 0/index 1/column 若axis=1, 默认基于索引将多个文件合并 @param to_file: 导出文件路径 @param encoding: 导出文件编码 @param files: 合并文件路径 @return: """ if len(files) > 1: objs = [read_file(file) for file in files] merge_data = concat(objs=objs, axis=axis) if to_file: df_to_file(merge_data, to_file, index=False, encoding=encoding) else: return merge_data else: raise Exception('合并的文件个数小于2,不能进行合并,请输入大于等于两个文件路径')