Python学习笔记：Pandas数据类型转化

一、Pandas读取剪切板数据

import pandas as pd
df = pd.read_clipboard()
'''
	国家	受欢迎度	评分	向往度
0	中国	10	10.0	10.0
1	美国	6	5.8	7.0
2	日本	2	1.2	7.0
3	德国	8	6.8	6.0
4	英国	7	6.6	NaN
'''
df.dtypes
'''
国家       object
受欢迎度      int64
评分      float64
向往度     float64
dtype: object
'''

object 类型
int 整数类型
float 浮点数类型
string 字符串类型

二、加载数据时指定数据类型

最简单的加载数据： pd.DataFrame(data) 和 pd.read_csv(file_name)

# 读取数据时指定
import pandas as pd
df = pd.read_csv('data.csv',
                dtype={
                    'a':'string',
                    'b':'int64'
                })

# 创建 DataFrame 类型数据时通过 dtype 参数设定
df = pd.DataFrame({
    'a':[1,2,3],
    'b':[4,5,6]
},
dtype='float32'
)   
df  
'''
a	b
0	1.0	4.0
1	2.0	5.0
2	3.0	6.0
'''

三、astype转换数据类型

df.受欢迎度.astype('float')

df.astype({'国家':'string',
          '向往度':'Int64'})

四、pd.to_xx 转换数据类型

to_datetime
to_numeric
to_pickle
to_timedelta

4.1 pd.to_datetime 转换为时间类型

转换为日期
转换为时间戳
按照 format 转换为日期

pd.to_datetime(date['date'], format="%m%d%Y")

针对日期列混合多种日期类型，可考虑：

# 添加日期长度辅助列
df['col'] = df['date'].apply(len)
df_new = df.loc[df['col'] > 10]
df_new['col2'] = pd.to_datetime(df_new['date'], format="%m%d%Y")

另外两种方式均可实现：

# 转换时遇到不能转换的数据转化为 NaN
df['date_new'] = pd.to_datetime(df['date'], format="%m%d%Y", errors='coerce')
# 尝试转换为日期类型
df['date_new'] = pd.to_datetime(df['date'], infer_datetime_format=True)

实例：

# 转换日期
ss = pd.Series(['3/11/2000', '3/12/2000', '3/13/2000'])
pd.to_datetime(ss, format="%m/%d/%Y")
pd.to_datetime(ss, infer_datetime_format=True) # 自动识别

# 转换时间戳
aa = pd.Series([1490195805, 1590195805, 1690195805])
pd.to_datetime(aa, unit='s')
bb = pd.Series([1490195805433502912, 1590195805433502912, 1690195805433502912])
pd.to_datetime(bb, unit='ns')

# 转换字符串
cc = pd.Series(['20200101', '20200202', '202003'])
pd.to_datetime(cc, format='%Y%m%d', errors='ignore') # 不转换
pd.to_datetime(cc, format='%Y%m%d', errors='coerce') # 错误置为 NaT

需要注意的是，对于上述时间戳的日期转化，起始时间默认是1970-01-01，对于国内时间来说会相差8小时，可以手动加上。

print(pd.to_datetime(aa, unit='s'))
print(pd.to_datetime(aa, unit='s', origin=pd.Timestamp('1970-01-01 08:00:00'))) # 指定起始时间
print(pd.to_datetime(aa, unit='s') + pd.Timedelta(days=8/24)) # 手动加上8小时
'''
0   2017-03-22 15:16:45
1   2020-05-23 01:03:25
2   2023-07-24 10:50:05
dtype: datetime64[ns]
0   2017-03-22 23:16:45
1   2020-05-23 09:03:25
2   2023-07-24 18:50:05
dtype: datetime64[ns]
0   2017-03-22 23:16:45
1   2020-05-23 09:03:25
2   2023-07-24 18:50:05
dtype: datetime64[ns]
'''

4.2 pd.to_numeric 转换为数字类型

# 语法
pd.to_numeric(data, errors='raise', downcast=None)

errors：默认'raise'，处理错误的方式，可选{‘ignore’, ‘raise’, ‘coerce’}；
- ‘ignore’：无效的转换将返回输入；
- ‘raise’：无效的转换将引发异常；
- ‘coerce’：无效的转换将设为NaN；
downcast：默认None，可选{‘integer’, ‘signed’, ‘unsigned’, ‘float’}；如果不是None，并且数据已成功转换为数字数据类型，则根据一定规则将结果数据向下转换为可能的最小数字数据类型；‘integer’ 或 ‘signed’: 最小的有符号整型(numpy.int8)；‘unsigned’: 最小的无符号整型(numpy.uint8)；‘float’: 最小的浮点型(numpy.float32)；

data = pd.Series(['1.0','2',-100])
print(data)
print(pd.to_numeric(data))
print(pd.to_numeric(data, downcast='signed'))

data2 = pd.Series(['apple', '1.0', '2', -100])
print(pd.to_numeric(data2, errors='ignore')) # 不转换
print(pd.to_numeric(data2, errors='coerce')) # 错误以NaN替换

4.3 pd.to_timedelta 转换为时间差类型

将数字、时间差字符串like等转化为时间差数据类型。

import numpy as np
print(pd.to_timedelta(np.arange(5), unit='d'))
# TimedeltaIndex(['0 days', '1 days', '2 days', '3 days', '4 days'], dtype='timedelta64[ns]', freq=None)

print(pd.to_timedelta('1 days 06:06:01.00003'))
# 1 days 06:06:01.000030

pd.to_timedelta(['1 days 06:05:01.00003', '15.5us', 'nan'])
# TimedeltaIndex(['1 days 06:05:01.000030', '0 days 00:00:00.000015500', NaT], dtype='timedelta64[ns]', freq=None)

五、智能判断数据类型

convert_dtypes 方法可以用来进行比较智能的数据类型转换。

print(df.dtypes)
'''
国家            object
受欢迎度           int64
评分           float64
向往度          float64
over_long      int64
dtype: object
'''
dfn = df.convert_dtypes()
print(dfn.dtypes)
'''
国家            string
受欢迎度           Int64
评分           Float64
向往度            Int64
over_long      Int64
dtype: object
'''

六、数据类型筛选

select_dtypes() 实现按照字段数据类型筛选。

df.select_dtypes(include=None, exclude=None) -> 'DataFrame'

数字：number、int、float
布尔：bool
时间：datetime64
时间差：timedelta64
类别：category
字符串：string
对象：object

df.select_dtypes(include='float')
df.select_dtypes(include='number')
df.select_dtypes(include=['int','object'])
df.select_dtypes(exclude='object')

参考链接：5招学会Pandas数据类型转化

posted @ 2021-09-15 17:11 Hider1214 阅读(1110) 评论(0) 编辑收藏举报

刷新页面返回顶部

Hider1214