Python学习笔记：pandas.Series.str.cat拼接字段

一、介绍

数据预处理时，有时需要将数据字段进行合并拼接，可以使用 str.cat() 方法实现。

使用语法

Series.str.cat(others=None, sep=None, na_rep=None, join='left')

参数说明

others -- 如果给定，则对应位置拼接；如果不给定，则拼接自身为字符串
sep -- 连接符、分割符，默认空格
na_rep -- 缺失值
join -- 拼接方式

二、实操

1.构建测试集

import pandas as pd
import numpy as np
df = pd.DataFrame({'user_id':['A','B','C','D','E'],
                   'v0':['high','tall','high','one','two'],
                   'v1':np.random.rand(5),
                   'v2':np.random.rand(5),
                   'v3':np.random.rand(5),
                   'v4':np.random.rand(5),
                   'v5':np.random.rand(5)})
'''
  user_id    v0        v1        v2        v3        v4        v5
0       A  high  0.269505  0.123080  0.366477  0.529162  0.683024
1       B  tall  0.620859  0.469152  0.039121  0.221539  0.665314
2       C  high  0.657277  0.787288  0.488835  0.690670  0.029768
3       D   one  0.648150  0.234147  0.841002  0.403383  0.313004
4       E   two  0.532817  0.246520  0.277159  0.946502  0.369891
'''

2.拼接两列

# 字符串类型
df['user_id'].str.cat(df['v0'])

# 添加连接符
df['user_id'].str.cat(df['v0'], sep=' -- ')
'''
0    A -- high
1    B -- tall
2    C -- high
3     D -- one
4     E -- two
'''

3.数值列合并

str.cat 方法合并的列内容必须都是字符串，如果是数值型会报错，需要提前转换为字符类型。

# 类型错误
df['user_id'].str.cat(df['v1'], sep=' -- ')
# TypeError: Concatenation requires list-likes containing only strings (or missing values). Offending values found in column floating

# 类型转换
df['v1'] = df['v1'].map(lambda x: str(x))
df['user_id'].str.cat(df['v1'], sep=' -- ')
'''
0    A -- 0.26950510515647086
1     B -- 0.6208590675841862
2      C -- 0.657277409259944
3     D -- 0.6481499976765789
4     E -- 0.5328165450111593
Name: user_id, dtype: object
'''

# 使用astype转换
df['user_id'].str.cat(df['v2'].astype('str'), sep=' -- ')

4.拼接特定字符串

举个例子：想要某列添加单位（万元），该如何实现？

# 报错
df['v1'].str.cat('万元')
# ValueError: Did you mean to supply a `sep` keyword?

# 方法一（不建议）：构造辅助列，再进行合并
df['add_columns'] = '万元'
df['v1'].str.cat(df['add_columns'], sep='-')
'''
0    0.26950510515647086-万元
1     0.6208590675841862-万元
2      0.657277409259944-万元
3     0.6481499976765789-万元
4     0.5328165450111593-万元
Name: v1, dtype: object
'''

# 方法二：直接“+”解决
df['v1'] + '-万元'

但需注意，方法二遇到缺失值会报错，需提前进行缺失值填充。

5.多列拼接

多列拼接时，需要用中括号将多列括起来。

df['user_id'].str.cat([df['v0'], df['v1'], df['add_columns']], sep='-')
'''
0    A-high-0.26950510515647086-万元
1     B-tall-0.6208590675841862-万元
2      C-high-0.657277409259944-万元
3      D-one-0.6481499976765789-万元
4      E-two-0.5328165450111593-万元
Name: user_id, dtype: object
'''

6.不指定others参数

# 默认
s = pd.Series(['a', 'b', np.nan, 'd'])
s.str.cat(sep='-') # 'a-b-d'

# 指定缺失值
s.str.cat(sep='-', na_rep='???') # 'a-b-???-d'

7.索引对齐方式

s = pd.Series(['a', 'b', np.nan, 'd'])
t = pd.Series(['d', 'a', 'e', 'c'], index=[3, 0, 4, 2])

# 按照索引左拼接
s.str.cat(t, join='left', na_rep='-')

# 外拼接
s.str.cat(t, join='outer', na_rep='-')

# 内拼接
s.str.cat(t, join='inner', na_rep='-')
## 如果不指定 na_rep 缺失值 则拼接出来内容为 NaN

# 右拼接
s.str.cat(t, join='right', na_rep='-')

参考链接：Pandas的字符串的分割之str.cat()

参考链接：Python3 pandas库 (27) 多列拼接成一列.str.cat()

参考链接：pandas.Series.str.cat

posted @ 2021-11-04 17:49 Hider1214 阅读(2965) 评论(0) 收藏举报

刷新页面返回顶部

Hider1214