pandas基础

pandas基础

pandas介绍

Python Data Analysis Library

pandas是基于NumPy 的一种工具,该工具是为了解决数据分析任务而创建的。Pandas 纳入 了大量库和一些标准的数据模型,提供了高效地操作大型结构化数据集所需的工具。

pandas核心数据结构

数据结构是计算机存储、组织数据的方式。 通常情况下,精心选择的数据结构可以带来更高的运行或者存储效率。数据结构往往同高效的检索算法和索引技术有关。

Series

Series可以理解为一个一维的数组,只是index名称可以自己改动。类似于定长的有序字典,有Index和 value。

"""
pandas的Series对象
"""
import pandas as pd
import numpy as np

# 空Series对象
s1 = pd.Series()
print(s1)  # Series([], dtype: float64)
# 通过数组创建Series对象
data = np.array(['zs', 'ls', 'ww', 'zl'])
s2 = pd.Series(data)
print(s2)
"""
0    zs
1    ls
2    ww
3    zl
dtype: object
"""

# 修改索引标签
s3 = pd.Series(data, index=['s001', 's002', 's003', 's004'])
print(s3)
"""
s001    zs
s002    ls
s003    ww
s004    zl
dtype: object
"""

# 从字典创建一个Series
data = {'s01': 'zs', 's02': 'li', 's03': 'ww', 's04': 'zl'}
s4 = pd.Series(data)
print(s4)
"""
s01    zs
s02    li
s03    ww
s04    zl
dtype: object
"""

#通过标量创建一个Series
s5 = pd.Series(5,index=['a','b','c'])
print(s5)
"""
a    5
b    5
c    5
dtype: int64
"""

#从Series中读取数据 print(s3) """ s001 zs s002 ls s003 ww s004 zl dtype: object """ print(s3[0])#zs 通过下标访问 print(s3[:2])#通过切片访问 """ s001 zs s002 ls dtype: object """ print(s3['s003'])#ww #通过索引标签 print(s3[['s001','s003']])#通过索引标签组 """ s001 zs s003 ww dtype: object """

 pandas日期处理

 

import pandas as pd

# pandas识别的日期字符串格式
s6 = pd.Series(['2011', '2011-01',
           '2011-01-02',
           '2012/02/01',
           '2011-01-02 08:00:00',
           '01 Jun 2012'])
# to_datetime() 转换日期数据类型
s6 = pd.to_datetime(s6)
print(s6)
"""
0   2011-01-01 00:00:00
1   2011-01-01 00:00:00
2   2011-01-02 00:00:00
3   2012-02-01 00:00:00
4   2011-01-02 08:00:00
5   2012-06-01 00:00:00
dtype: datetime64[ns]
"""
# datetime类型数据支持日期运算
delta = s6-pd.to_datetime('2011-01-01')

print(delta)
"""
0     0 days 00:00:00
1     0 days 00:00:00
2     1 days 00:00:00
3   396 days 00:00:00
4     1 days 08:00:00
5   517 days 00:00:00
dtype: timedelta64[ns]
"""
#输出s6日期某字段的值
print(s6.dt.quarter)
"""
0    1
1    1
2    1
3    1
4    1
5    2
dtype: int64
"""
# 获取偏移天数
print(delta.dt.days)
"""
0      0
1      0
2      1
3    396
4      1
5    517
"""
print(s6.dt.month)
"""
0    1
1    1
2    1
3    2
4    1
5    6
dtype: int64
"""

Series.dt提供了很多日期相关操作,如下:

Series.dt.year    The year of the datetime.
Series.dt.month    The month as January=1, December=12.
Series.dt.day    The days of the datetime.
Series.dt.hour    The hours of the datetime.
Series.dt.minute    The minutes of the datetime.
Series.dt.second    The seconds of the datetime.
Series.dt.microsecond    The microseconds of the datetime.
Series.dt.week    The week ordinal of the year.
Series.dt.weekofyear    The week ordinal of the year.
Series.dt.dayofweek    The day of the week with Monday=0, Sunday=6.
Series.dt.weekday    The day of the week with Monday=0, Sunday=6.
Series.dt.dayofyear    The ordinal day of the year.
Series.dt.quarter    The quarter of the date.
Series.dt.is_month_start    Indicates whether the date is the first day of the month.
Series.dt.is_month_end    Indicates whether the date is the last day of the month.
Series.dt.is_quarter_start    Indicator for whether the date is the first day of a quarter.
Series.dt.is_quarter_end    Indicator for whether the date is the last day of a quarter.
Series.dt.is_year_start    Indicate whether the date is the first day of a year.
Series.dt.is_year_end    Indicate whether the date is the last day of the year.
Series.dt.is_leap_year    Boolean indicator if the date belongs to a leap year.
Series.dt.days_in_month    The number of days in the month.

DateTimeIndex

通过指定周期和频率,使用date.range()函数就可以创建日期序列。 默认情况下,范围的频率是天。

import pandas as pd
# 以日为频率
datelist = pd.date_range('2019/08/21', periods=5)
print(datelist)
# 以月为频率
datelist = pd.date_range('2019/08/21', periods=5,freq='M')
print(datelist)
# 构建某个区间的时间序列
start = pd.datetime(2017, 11, 1)
end = pd.datetime(2017, 11, 5)
dates = pd.date_range(start, end)
print(dates)

bdate_range()用来表示商业日期范围,不同于date_range(),它不包括星期六和星期天。

import pandas as pd
datelist = pd.date_range('2011/11/03', periods=5)
print(datelist)

 

"""
datetimeindex
"""

import pandas as pd
# 以日为频率
d = pd.date_range('2019-01-01', periods=7)
print(d)
"""
DatetimeIndex(['2019-01-01', '2019-01-02', '2019-01-03', '2019-01-04',
               '2019-01-05', '2019-01-06', '2019-01-07'],
              dtype='datetime64[ns]', freq='D')
"""
print(d.dtype)
# datetime64[ns]
print(type(d))#类型
# <class 'pandas.core.indexes.datetimes.DatetimeIndex'>

#生成一组时间,默认以D向后延续fred
d = pd.date_range('2019-10-01',periods=7)
print(d)
"""
DatetimeIndex(['2019-10-01', '2019-10-02', '2019-10-03', '2019-10-04',
               '2019-10-05', '2019-10-06', '2019-10-07'],
              dtype='datetime64[ns]', freq='D')
"""

#生成一组时间,以M为fred 以月为频率
d2 = pd.date_range('2019-10-01',periods=5,freq='M')
print(d2)
"""
DatetimeIndex(['2019-10-31', '2019-11-30', '2019-12-31', '2020-01-31',
               '2020-02-29'],
              dtype='datetime64[ns]', freq='M')
"""

#设置生成一组时间:[start,end]
d3 = pd.date_range('2019-10-1','2019-10-7')
print(d3)
"""
DatetimeIndex(['2019-10-01', '2019-10-02', '2019-10-03', '2019-10-04',
               '2019-10-05', '2019-10-06', '2019-10-07'],
              dtype='datetime64[ns]', freq='D')
"""
#生成一组时间,只包含工作日
d4 = pd.bdate_range('2019-10-1',periods=7)
print(d4)
"""
DatetimeIndex(['2019-10-01', '2019-10-02', '2019-10-03', '2019-10-04',
               '2019-10-07', '2019-10-08', '2019-10-09'],
              dtype='datetime64[ns]', freq='B')
"""

 

DataFrame

DataFrame是一个类似于表格的数据类型,可以理解为一个二维数组,索引有两个维度,可更改。DataFrame具有以下特点:

  • 潜在的列是不同的类型

  • 大小可变

  • 标记轴(行和列)

  • 可以对行和列执行算术运算

import pandas as pd

# 创建一个空的DataFrame
df = pd.DataFrame()
print(df)
"""
Empty DataFrame   #空的
Columns: []       #列
Index: []         #索引
"""

# 从列表创建DataFrame
data = ['Tom', 'Jerry', 'Dog', 'Lily']
df = pd.DataFrame(data)
print(df)
"""
       0
0    Tom
1  Jerry
2    Dog
3   Lily
"""

# 通过二维数组创建DataFrame
# 指定列索引标签columns=['Name','Age'],不指定默认从0开始
data = [['Alex', 10],
        ['Bob', 12],
        ['Clarke', 13]
        ]
df = pd.DataFrame(data, columns=['Name', 'Age'])
print(df)
"""
     Name  Age
0    Alex   10
1     Bob   12
2  Clarke   13
"""


data = [['Alex', 10], ['Bob', 12], ['Clarke', 13]]
df = pd.DataFrame(data, columns=['Name', 'Age'], dtype=float)
print(df)
"""
     Name   Age
0    Alex  10.0
1     Bob  12.0
2  Clarke  13.0
"""
# 通过列表套字典的方式创建DataFrame
data = [{'a': 1, 'b': 2}, 
        {'a': 5, 'b': 10, 'c': 20}]
df = pd.DataFrame(data)
print(df)
"""
   a   b     c
0  1   2   NaN
1  5  10  20.0
"""

# 从字典来创建DataFrame
data = {'Name': ['Tom', 'Jack', 'Steve', 'Ricky'], 
        'Age': [28, 34, 29, 42]}
df = pd.DataFrame(data, index=['s1', 's2', 's3', 's4'])
print(df)
"""
     Name  Age
s1    Tom   28
s2   Jack   34
s3  Steve   29
s4  Ricky   42
"""

data = {'one': pd.Series([1, 2, 3], 
                         index=['a', 'b', 'c']),
        'two': pd.Series([1, 2, 3, 4], 
                         index=['a', 'b', 'c', 'd'])}
df = pd.DataFrame(data)
print(df)
"""
   one  two
a  1.0    1
b  2.0    2
c  3.0    3
d  NaN    4
"""

 

核心数据结构操作

列访问

DataFrame的单列数据为一个Series。根据DataFrame的定义可以 知晓DataFrame是一个带有标签的二维数组,每个标签相当每一列的列名。

import pandas as pd

d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
     'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}

df = pd.DataFrame(d)
print(df['one'])
"""
a    1.0
b    2.0
c    3.0
d    NaN
Name: one, dtype: float64
"""
print(df[['one', 'two']])
"""
   one  two
a  1.0    1
b  2.0    2
c  3.0    3
d  NaN    4
"""

列添加

DataFrame添加一列的方法非常简单,只需要新建一个列索引。并对该索引下的数据进行赋值操作即可。

import pandas as pd

data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'],
        'Age':[28,34,29,42]}
df = pd.DataFrame(data, index=['s1','s2','s3','s4'])

#访问Name列
print(df['Name'],type(df['Name']))
"""
s1      Tom
s2     Jack
s3    Steve
s4    Ricky
Name: Name, dtype: object <class 'pandas.core.series.Series'>
"""

#添加成绩列
df['score']=pd.Series([90, 80, 70, 60],
                      index=['s1','s2','s3','s4'])
print(df)
"""
     Name  Age  score
s1    Tom   28     90
s2   Jack   34     80
s3  Steve   29     70
s4  Ricky   42     60
"""

列删除

删除某列数据需要用到pandas提供的方法pop,pop方法的用法如下:

import pandas as pd

d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
     'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd']),
     'three' : pd.Series([10, 20, 30], index=['a', 'b', 'c'])}
df = pd.DataFrame(d)
print("dataframe is:")
print(df)
"""
dataframe is:
   one  two  three
a  1.0    1   10.0
b  2.0    2   20.0
c  3.0    3   30.0
d  NaN    4    NaN
"""
# 删除一列: one
del(df['one'])
print(df)
"""
   two  three
a    1   10.0
b    2   20.0
c    3   30.0
d    4    NaN
"""

#调用pop方法删除一列
df.pop('two')
print(df)
"""
   three
a   10.0
b   20.0
c   30.0
d    NaN
"""

 

 

行访问

如果只是需要访问DataFrame某几行数据的实现方式则采用数组的选取方式,使用 ":" 即可:

import pandas as pd

d = {'one' : pd.Series([1, 2, 3], 
              index=['a', 'b', 'c']), 'two' : pd.Series([1, 2, 3, 4],
              index=['a', 'b', 'c', 'd'])} df = pd.DataFrame(d) print(df[2:4]) """ one two c 3.0 3 d NaN 4 """

 

loc方法是针对DataFrame索引名称的切片方法。loc方法使用方法如下:

import pandas as pd

d = {'one' : pd.Series([1, 2, 3],
                       index=['a', 'b', 'c']),
     'two' : pd.Series([1, 2, 3, 4],
                       index=['a', 'b', 'c', 'd'])}

df = pd.DataFrame(d)
# 通过索引名称访问
print(df.loc['b'])
"""
one    2.0
two    2.0
Name: b, dtype: float64
"""
print(df.loc[['a', 'b']])
"""
   one  two
a  1.0    1
b  2.0    2
"""

iloc和loc区别是iloc接收的必须是行索引和列索引的位置。iloc方法的使用方法如下:

import pandas as pd

d = {'one' : pd.Series([1, 2, 3],
                       index=['a', 'b', 'c']),
     'two' : pd.Series([1, 2, 3, 4],
                       index=['a', 'b', 'c', 'd'])}

df = pd.DataFrame(d)
print(df)
"""
   one  two
a  1.0    1
b  2.0    2
c  3.0    3
d  NaN    4
"""
#通过索引访问
print(df.iloc[2])
"""
one    3.0
two    3.0
Name: c, dtype: float64
"""
print(df.iloc[[2, 3]])
"""
   one  two
c  3.0    3
d  NaN    4
"""

 

行添加

import pandas as pd

df = pd.DataFrame([['zs', 12],
                   ['ls', 4]],
                  columns = ['Name','Age'])
df2 = pd.DataFrame([['ww', 16],
                    ['zl', 8]],
                   columns = ['Name','Age'])

df = df.append(df2)
print(df)
"""
  Name  Age
0   zs   12
1   ls    4
0   ww   16
1   zl    8
"""

 

行删除

 使用索引标签从DataFrame中删除或删除行。 如果标签重复,则会删除多行。

import pandas as pd

df = pd.DataFrame([['zs', 12],
                   ['ls', 4]],
                  columns = ['Name','Age'])
df2 = pd.DataFrame([['ww', 16],
                    ['zl', 8]],
                   columns = ['Name','Age'])
df = df.append(df2)
print(df)
"""
  Name  Age
0   zs   12
1   ls    4
0   ww   16
1   zl    8
"""
# 删除index为0的行
df = df.drop(0)
print(df)
"""
  Name  Age
1   ls    4
1   zl    8
"""

 

修改DataFrame中的数据

更改DataFrame中的数据,原理是将这部分数据提取出来,重新赋值为新的数据。

import pandas as pd

df = pd.DataFrame([['zs', 12],
                   ['ls', 4]],
                  columns = ['Name','Age'])
df2 = pd.DataFrame([['ww', 16],
                    ['zl', 8]],
                   columns = ['Name','Age'])
df = df.append(df2)
print(df)
"""
  Name  Age
0   zs   12
1   ls    4
0   ww   16
1   zl    8
"""
df['Name'][0] = 'Tom'
print(df)
"""
  Name  Age
0  Tom   12
1   ls    4
0  Tom   16
1   zl    8
"""

 

DataFrame常用属性

编号属性或方法描述
1 axes 返回 行/列 标签(index)列表。
2 dtype 返回对象的数据类型(dtype)。
3 empty 如果系列为空,则返回True
4 ndim 返回底层数据的维数,默认定义:1
5 size 返回基础数据中的元素数。
6 values 将系列作为ndarray返回。
7 head() 返回前n行。
8 tail() 返回最后n行。

 实例代码:

import pandas as pd

data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'],
        'Age':[28,34,29,42]}
df = pd.DataFrame(data, index=['s1','s2','s3','s4'])
df['score']=pd.Series([90, 80, 70, 60],
                      index=['s1','s2','s3','s4'])
# print(df)
"""
     Name  Age  score
s1    Tom   28     90
s2   Jack   34     80
s3  Steve   29     70
s4  Ricky   42     60
"""
print(df.axes)
#[Index(['s1', 's2', 's3', 's4'], dtype='object'), Index(['Name', 'Age', 'score'], dtype='object')]
print(df['Age'].dtype)#int64
print(df.empty)#False
print(df.ndim)#2
print(df.size)#12
print(df.values)
"""
[['Tom' 28 90]
 ['Jack' 34 80]
 ['Steve' 29 70]
 ['Ricky' 42 60]]
"""
print(df.head(3)) # df的前三行
"""
     Name  Age  score
s1    Tom   28     90
s2   Jack   34     80
s3  Steve   29     70
"""
print(df.tail(3)) # df的后三行
"""
     Name  Age  score
s2   Jack   34     80
s3  Steve   29     70
s4  Ricky   42     60
"""

 

 

 

 

posted @ 2019-09-10 10:06  maplethefox  阅读(252)  评论(0编辑  收藏  举报