数据处理过程的数据类型
当利用pandas进行数据处理的时候,经常会遇到数据类型的问题,当拿到数据的时候,首先需要确定拿到的是正确类型的数据,一般通过数据类型的转化,这篇文章就介绍pandas里面的数据类型(data types也就是常用的dtyps),以及pandas与numpy之间的数据对应关系。
主要介绍object,int64,float64,datetime64,bool等几种类型,category与timedelta两种类型会单独的在其他文章中进行介绍。当然本文中也会涉及简单的介绍。
数据类型的问题一般都是出了问题之后才会发现的,所以有了一些经验之后就会拿到数据之后,就直接看数据类型,是否与自己想要处理的数据格式一致,这样可以从一开始避免一些尴尬的问题出现。那么我们以一个简单的例子,利用jupyter notebook进行一个数据类型的介绍。
import numpy as np
import pandas as pd
df = pd.read_csv("sales_data_types.csv" )
print (df)
Customer Number Customer Name 2016 2017 \
0 10002 Quest Industries $125 ,000.00 $162 ,500.00
1 552278 Smith Plumbing $920 ,000.00 $1 ,012,000.00
2 23477 ACME Industrial $50 ,000.00 $62 ,500.00
3 24900 Brekke LTD $350 ,000.00 $490 ,000.00
4 651029 Harbor Co $15 ,000.00 $12 ,750.00
Percent Growth Jan Units Month Day Year Active
0 30.00% 500 1 10 2015 Y
1 10.00% 700 6 15 2014 Y
2 25.00% 125 3 29 2016 Y
3 4.00% 75 10 27 2015 Y
4 -15.00% Closed 2 2 2014 N
df.dtypes
Customer Number int64
Customer Name object
2016 object
2017 object
Percent Growth object
Jan Units object
Month int64
Day int64
Year int64
Active object
dtype : object
df['2016' ]+df['2017' ]
0 $125 ,000.00 $162 ,500.00
1 $920 ,000.00 $1 ,012,000.00
2 $50 ,000.00 $62 ,500.00
3 $350 ,000.00 $490 ,000.00
4 $15 ,000.00 $12 ,750.00
dtype: object
df.info()
<class 'pandas.core.frame.DataFrame' >
RangeIndex : 5 entries, 0 to 4
Data columns (total 10 columns):
Customer Number 5 non-null int64
Customer Name 5 non-null object
2016 5 non-null object
2017 5 non-null object
Percent Growth 5 non-null object
Jan Units 5 non-null object
Month 5 non-null int64
Day 5 non-null int64
Year 5 non-null int64
Active 5 non-null object
dtypes : int64 (4 ), object (6 )
memory usage : 480.0 + bytes
首先介绍最常用的astype()
比如可以通过astype()将第一列的数据转化为整数int类型
df['Customer Number' ].astype("int" )
0 10002
1 552278
2 23477
3 24900
4 651029
Name: Customer Number, dtype: int32
df["Customer Number" ] = df["Customer Number" ].astype("int" )
print (df)
print ("--------" *10 )
print (df.dtypes)
Customer Number Customer Name 2016 2017 \
0 10002 Quest Industries $125 ,000.00 $162 ,500.00
1 552278 Smith Plumbing $920 ,000.00 $1 ,012,000.00
2 23477 ACME Industrial $50 ,000.00 $62 ,500.00
3 24900 Brekke LTD $350 ,000.00 $490 ,000.00
4 651029 Harbor Co $15 ,000.00 $12 ,750.00
Percent Growth Jan Units Month Day Year Active
0 30.00% 500 1 10 2015 Y
1 10.00% 700 6 15 2014 Y
2 25.00% 125 3 29 2016 Y
3 4.00% 75 10 27 2015 Y
4 -15.00% Closed 2 2 2014 N
--------------------------------------------------------------------------------
Customer Number int32
Customer Name object
2016 object
2017 object
Percent Growth object
Jan Units object
Month int64
Day int64
Year int64
Active object
dtype: object
print (df)
Customer Number Customer Name 2016 2017 \
0 10002 Quest Industries $125 ,000.00 $162 ,500.00
1 552278 Smith Plumbing $920 ,000.00 $1 ,012,000.00
2 23477 ACME Industrial $50 ,000.00 $62 ,500.00
3 24900 Brekke LTD $350 ,000.00 $490 ,000.00
4 651029 Harbor Co $15 ,000.00 $12 ,750.00
Percent Growth Jan Units Month Day Year Active
0 30.00% 500 1 10 2015 Y
1 10.00% 700 6 15 2014 Y
2 25.00% 125 3 29 2016 Y
3 4.00% 75 10 27 2015 Y
4 -15.00% Closed 2 2 2014 N
df['2016' ].astype('float' )
ValueError Traceback (most recent call last )
< ipython- input-19 -47 cc9d68cd65> in < module > ()
C:\Anaconda3\lib\site- packages\pandas\core\generic.py in astype(self, dtype, copy , raise_on_error, * * kwargs)
3052 # else , only a single dtype is given
3053 new_data = self._data.astype(dtype= dtype, copy = copy ,
- > 3054 raise_on_error= raise_on_error, * * kwargs)
3055 return self._constructor(new_data).__finalize__(self)
3056
C:\Anaconda3\lib\site- packages\pandas\core\internals.py in astype(self, dtype, * * kwargs)
3187
3188 def astype(self, dtype, * * kwargs):
- > 3189 return self.apply('astype' , dtype= dtype, * * kwargs)
3190
3191 def convert (self, * * kwargs):
C:\Anaconda3\lib\site- packages\pandas\core\internals.py in apply(self, f, axes, filter , do_integrity_check, consolidate, * * kwargs)
3054
3055 kwargs['mgr' ] = self
- > 3056 applied = getattr(b, f)(* * kwargs)
3057 result_blocks = _extend_blocks(applied, result_blocks)
3058
C:\Anaconda3\lib\site- packages\pandas\core\internals.py in astype(self, dtype, copy , raise_on_error, values , * * kwargs)
459 * * kwargs):
460 return self._astype(dtype, copy = copy , raise_on_error= raise_on_error,
462
463 def _astype(self, dtype, copy = False , raise_on_error= True , values = None ,
C:\Anaconda3\lib\site- packages\pandas\core\internals.py in _astype(self, dtype, copy , raise_on_error, values , klass, mgr, * * kwargs)
502
503 # _astype_nansafe works fine with 1 - d only
505 values = values.reshape(self.shape)
506
C:\Anaconda3\lib\site- packages\pandas\types\cast.py in _astype_nansafe(arr, dtype, copy )
535
536 if copy :
538 return arr.view(dtype)
539
ValueError: could not convert string to float : '$15,000.00 '
以上的问题说明了一些问题
如果数据是纯净的数据,可以转化为数字
astype基本也就是两种用作,数字转化为单纯字符串,单纯数字的字符串转化为数字,含有其他的非数字的字符串是不能通过astype进行转化的。
需要引入其他的方法进行转化,也就有了下面的自定义函数方法
通过自定义函数清理数据
def convert_currency (var ):
"""
convert the string number to a float
_ 去除$
- 去除逗号,
- 转化为浮点数类型
"""
new_value = var.replace("," ,"" ).replace("$" ,"" )
return float (new_value)
df["2016" ].apply(convert_currency)
0 125000.0
1 920000.0
2 50000.0
3 350000.0
4 15000.0
Name: 2016 , dtype: float64
df["2016" ].apply(lambda x: x.replace("," ,"" ).replace("$" ,"" )).astype("float64" )
0 125000.0
1 920000.0
2 50000.0
3 350000.0
4 15000.0
Name: 2016 , dtype: float64
df["Percent Growth" ].apply(lambda x: x.replace("%" ,"" )).astype("float" )/100
0 0.30
1 0.10
2 0.25
3 0.04
4 -0.15
Name: Percent Growth, dtype: float64
df["Active" ] = np.where(df["Active" ] == "Y" , True , False )
df["Active" ]
0 True
1 True
2 True
3 True
4 False
Name: Active, dtype: bool
df["2016" ]=df["2016" ].apply(lambda x: x.replace("," ,"" ).replace("$" ,"" )).astype("float64" )
df["2017" ]=df["2017" ].apply(lambda x: x.replace("," ,"" ).replace("$" ,"" )).astype("float64" )
df["Percent Growth" ]=df["Percent Growth" ].apply(lambda x: x.replace("%" ,"" )).astype("float" )/100
df.dtypes
Customer Number int32
Customer Name object
2016 float64
2017 float64
Percent Growth float64
Jan Units object
Month int64
Day int64
Year int64
Active bool
dtype: object
print (df)
Customer Number Customer Name 2016 2017 Percent Growth \
0 10002 Quest Industries 125000.0 162500.0 0.30
1 552278 Smith Plumbing 920000.0 1012000.0 0.10
2 23477 ACME Industrial 50000.0 62500.0 0.25
3 24900 Brekke LTD 350000.0 490000.0 0.04
4 651029 Harbor Co 15000.0 12750.0 -0.15
Jan Units Month Day Year Active
0 500 1 10 2015 True
1 700 6 15 2014 True
2 125 3 29 2016 True
3 75 10 27 2015 True
4 Closed 2 2 2014 False
利用pandas中函数进行处理
pd.to_numeric(df["Jan Units" ],errors='coerce' ).fillna(0 )
0 500.0
1 700.0
2 125.0
3 75.0
4 0.0
Name: Jan Units, dtype: float64
pd.to_datetime(df[['Month' , 'Day' , 'Year' ]])
0 2015-01-10
1 2014-06-15
2 2016-03-29
3 2015-10-27
4 2014-02-02
dtype: datetime64[ns]
df["Jan Units" ] = pd.to_numeric(df["Jan Units" ],errors='coerce' )
df["Start_date" ] = pd.to_datetime(df[['Month' , 'Day' , 'Year' ]])
df
Customer Number
Customer Name
2016
2017
Percent Growth
Jan Units
Month
Day
Year
Active
Start_date
0
10002
Quest Industries
125000.0
162500.0
0.30
500.0
1
10
2015
True
2015-01-10
1
552278
Smith Plumbing
920000.0
1012000.0
0.10
700.0
6
15
2014
True
2014-06-15
2
23477
ACME Industrial
50000.0
62500.0
0.25
125.0
3
29
2016
True
2016-03-29
3
24900
Brekke LTD
350000.0
490000.0
0.04
75.0
10
27
2015
True
2015-10-27
4
651029
Harbor Co
15000.0
12750.0
-0.15
NaN
2
2
2014
False
2014-02-02
df.dtypes
Customer Number int32
Customer Name object
2016 float64
2017 float64
Percent Growth float64
Jan Units float64
Month int64
Day int64
Year int64
Active bool
Start_date datetime64[ns]
dtype: object
def convert_percent (val ):
"""
Convert the percentage string to an actual floating point percent
- Remove %
- Divide by 100 to make decimal
"""
new_val = val.replace('%' , '' )
return float (new_val) / 100
df_2 = pd.read_csv("sales_data_types.csv" ,dtype={"Customer_Number" :"int" },converters={
"2016" :convert_currency,
"2017" :convert_currency,
"Percent Growth" :convert_percent,
"Jan Units" :lambda x:pd.to_numeric(x,errors="coerce" ),
"Active" :lambda x: np.where(x=="Y" ,True ,False )
})
df_2.dtypes
Customer Number int64
Customer Name object
2016 float64
2017 float64
Percent Growth float64
Jan Units float64
Month int64
Day int64
Year int64
Active bool
dtype: object
df_2
Customer Number
Customer Name
2016
2017
Percent Growth
Jan Units
Month
Day
Year
Active
0
10002
Quest Industries
125000.0
162500.0
0.30
500.0
1
10
2015
True
1
552278
Smith Plumbing
920000.0
1012000.0
0.10
700.0
6
15
2014
True
2
23477
ACME Industrial
50000.0
62500.0
0.25
125.0
3
29
2016
True
3
24900
Brekke LTD
350000.0
490000.0
0.04
75.0
10
27
2015
True
4
651029
Harbor Co
15000.0
12750.0
-0.15
NaN
2
2
2014
False
至此,pandas里面数据类型目前还有timedelta以及category两个,之后会着重介绍category类型,这是类型是参考了R中的category设计的,在pandas 0.16 之后添加的,之后还会根据需要进行整理pandas的常用方法。
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】凌霞软件回馈社区,博客园 & 1Panel & Halo 联合会员上线
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步