pandas学习笔记(二)

pandas学习笔记(二) —pandas基础

文件的读取和写入

1.文件的读取

txt文件：

1 2019-03-22 00:06:24.4463094 中文测试 
2 2019-03-22 00:06:32.4565680 需要编辑encoding 
3 2019-03-22 00:06:32.6835965 ashshsh 
4 2017-03-22 00:06:32.8041945 eggg

读取命令采用read_csv或者read_table都可以：

import pandas as pd
df = pd.read_table("./1.txt")
print(df)

Out[8]: 
           1 2019-03-22 00:06:24.4463094 中文测试 
0  2 2019-03-22 00:06:32.4565680 需要编辑encoding 
1       3 2019-03-22 00:06:32.6835965 ashshsh 
2           4 2017-03-22 00:06:32.8041945 eggg

df = pd.read_csv("./1.txt")
print(type(df))
<class 'pandas.core.frame.DataFrame'>
print(df.shape)
(3, 1)
print(df)
Out[12]: 
           1 2019-03-22 00:06:24.4463094 中文测试 
0  2 2019-03-22 00:06:32.4565680 需要编辑encoding 
1       3 2019-03-22 00:06:32.6835965 ashshsh 
2           4 2017-03-22 00:06:32.8041945 eggg

但是这两种读取的都是3行1列的DataFrame类型的，并不是我们所期待的3行4列的亚子~~

read_csv函数默认从文件、URL、文件新对象中加载带有分隔符的数据，默认分隔符是逗号。

因为上面一个txt文件是没有逗号分隔符，所以需要在读取的时候添加seq分隔符参数：

df =  pd.read_csv("./1.txt",sep=' ')
df
Out[19]: 
   1  2019-03-22  00:06:24.4463094          中文测试  Unnamed: 4
0  2  2019-03-22  00:06:32.4565680  需要编辑encoding         NaN
1  3  2019-03-22  00:06:32.6835965       ashshsh         NaN
2  4  2017-03-22  00:06:32.8041945          eggg         NaN

csv文件：

我拿kaggle里面Titanic比赛所提供的train.csv文件为例：

df = pd.read_csv("./train.csv")
df
Out[26]: 
     PassengerId  Survived  Pclass  ...     Fare Cabin  Embarked
0              1         0       3  ...   7.2500   NaN         S
1              2         1       1  ...  71.2833   C85         C
2              3         1       3  ...   7.9250   NaN         S
3              4         1       1  ...  53.1000  C123         S
4              5         0       3  ...   8.0500   NaN         S
..           ...       ...     ...  ...      ...   ...       ...
886          887         0       2  ...  13.0000   NaN         S
887          888         1       1  ...  30.0000   B42         S
888          889         0       3  ...  23.4500   NaN         S
889          890         1       1  ...  30.0000  C148         C
890          891         0       3  ...   7.7500   NaN         Q
[891 rows x 12 columns]

pandas可以自动推断每个column的数据类型，以方便后续对数据的处理。

df.head(5)
Out[27]: 
   PassengerId  Survived  Pclass  ...     Fare Cabin  Embarked
0            1         0       3  ...   7.2500   NaN         S
1            2         1       1  ...  71.2833   C85         C
2            3         1       3  ...   7.9250   NaN         S
3            4         1       1  ...  53.1000  C123         S
4            5         0       3  ...   8.0500   NaN         S
[5 rows x 12 columns]
#只读取前五行数据
print('datatype of column Fare is: ' + str(df['Fare'].dtypes))
datatype of column Fare is: float64

通过一个简单的read_csv()函数，实际可以做到如下几件事：

通过指定的文件路径，从本地读取csv文件，并将数据转换成DataFrame格式
更正数据集的头部(column)
正确处理缺失数据
推断每一列的数据类型

2.文件的写入

用pandas包直接导入：

import pandas as pd

#任意的多组列表
a = [1,2,3]
b = [4,5,6]    

#字典中的key值即为csv中列名
dataframe = pd.DataFrame({'a_name':a,'b_name':b})

#将DataFrame存储为csv,index表示是否显示行名，default=True
dataframe.to_csv("1.csv",index=False,sep=',')

a_name	b_name
1	4
2	5
3	6

用csv包，一行一行写入：

import csv

with open("2.csv","w") as csvfile: 
    writer = csv.writer(csvfile)

    #先写入columns_name
    writer.writerow(["index","a_name","b_name"])
    #写入多行用writerows
    writer.writerows([[0,1,3],[1,2,3],[2,3,4]])

index	a_name	b_name

0	1	3

1	2	3

2	3	4

基本数据结构

1.Series：

Series 一般由四个部分组成，分别是序列的值 data 、索引 index 、存储类型 dtype 、序列的名字 name 。其中，索引也可以指定它的名字，默认为空。

s = pd.Series(data=[100, 'a', {'dic1': 5}],
              index=pd.Index(['id1', 20, 'third'], name='my_idx'),
              dtype='object',
              name='my_name')
s
Out[3]: 
my_idx
id1              100
20                 a
third    {'dic1': 5}
Name: my_name, dtype: object

object 代表了一种混合类型，正如上面的例子中存储了整数、字符串以及 Python 的字典数据结构。此外，目前 pandas 把纯字符串序列也默认认为是一种 object 类型的序列，但它也可以用 string 类型存储。

读取属性：

s.values
Out[4]: array([100, 'a', {'dic1': 5}], dtype=object)
s.index
Out[5]: Index(['id1', 20, 'third'], dtype='object', name='my_idx')
s.dtype
Out[6]: dtype('O')
s.name
Out[7]: 'my_name'
s.shape
Out[8]: (3,)
s['third']
Out[10]: {'dic1': 5}

2.DataFrame：

DataFrame 在 Series 的基础上增加了列索引，一个数据框可以由二维的 data 与行列索引来构造：

data = [[1, 'a', 1.2], [2, 'b', 2.2], [3, 'c', 3.2]]

df = pd.DataFrame(data=data,
                  index=['row_%d' % i for i in range(3)],
                  columns=['col_%d' % i for i in range(3)])
df
Out[3]: 
       col_0 col_1  col_2
row_0      1     a    1.2
row_1      2     b    2.2
row_2      3     c    3.2

但一般而言，更多的时候会采用从列索引名到数据的映射来构造数据框，同时再加上行索引：

df = pd.DataFrame(data = {'col_0': [1,2,3], 'col_1':list('abc'),
                          'col_2': [1.2, 2.2, 3.2]},
                           index = ['row_%d'%i for i in range(3)])
df
Out[4]: 
       col_0 col_1  col_2
row_0      1     a    1.2
row_1      2     b    2.2
row_2      3     c    3.2

通过 .T 可以把 DataFrame 进行转置：

df.T
Out[5]: 
      row_0 row_1 row_2
col_0     1     2     3
col_1     a     b     c
col_2   1.2   2.2   3.2

常用基本函数：

info, describe 分别返回表的信息概况和表中数值列对应的主要统计量：

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB

df.describe()
Out[44]: 
       PassengerId    Survived      Pclass  ...       SibSp       Parch        Fare
count   891.000000  891.000000  891.000000  ...  891.000000  891.000000  891.000000
mean    446.000000    0.383838    2.308642  ...    0.523008    0.381594   32.204208
std     257.353842    0.486592    0.836071  ...    1.102743    0.806057   49.693429
min       1.000000    0.000000    1.000000  ...    0.000000    0.000000    0.000000
25%     223.500000    0.000000    2.000000  ...    0.000000    0.000000    7.910400
50%     446.000000    0.000000    3.000000  ...    0.000000    0.000000   14.454200
75%     668.500000    1.000000    3.000000  ...    1.000000    0.000000   31.000000
max     891.000000    1.000000    3.000000  ...    8.000000    6.000000  512.329200
[8 rows x 7 columns]

窗口对象

pandas 中有3类窗口，分别是滑动窗口 rolling 、扩张窗口 expanding 以及指数加权窗口 ewm 。

1.滑窗对象：

.rolling函数

要使用滑窗函数，就必须先要对一个序列使用 .rolling 得到滑窗对象，其最重要的参数为窗口大小 window 。

s = pd.Series([1,2,3,4,5])
roller = s.rolling(window=3)
roller
Out[53]: Rolling [window=3,center=False,axis=0

roller.mean()
Out[98]: 
0    NaN
1    NaN
2    2.0
3    3.0
4    4.0
dtype: float64

由于窗口大小为3(window)，前两个元素有空值，第三个元素的值将是n，n-1和n-2元素的平均值

.expanding函数

这个函数可以应用于一系列数据。指定min_periods = n参数并在其上应用适当的统计函数。

df = pd.DataFrame(np.random.randn(10,4),
                  index = pd.date_range('1/1/2018',periods=10),
                  columns= ['A','B','C','D'])
print(df.expanding(min_periods=3).mean())
                   A         B         C         D
2018-01-01       NaN       NaN       NaN       NaN
2018-01-02       NaN       NaN       NaN       NaN
2018-01-03 -1.282240 -0.172426 -0.421581 -0.419471
2018-01-04 -1.292201 -0.284911 -0.350558 -0.272313
2018-01-05 -0.757201 -0.324802 -0.220186 -0.148037
2018-01-06 -0.435084 -0.339102 -0.213042 -0.097067
2018-01-07 -0.627613 -0.049110 -0.107301  0.106537
2018-01-08 -0.888303  0.005999 -0.184805  0.064583
2018-01-09 -0.753720  0.015672 -0.172906  0.115930
2018-01-10 -0.568566 -0.135122  0.052374  0.095163
#expanding函数中的min_periods=n，与.rolling函数中window的取值一样

.ewm函数

ewm()可应用于系列数据。指定com，span，halflife参数，并在其上应用适当的统计函数。它以指数形式分配权重。

import pandas as pd
import numpy as np
 
df = pd.DataFrame(np.random.randn(10, 4),
   index = pd.date_range('1/1/2019', periods=10),
   columns = ['A', 'B', 'C', 'D'])
print (df.ewm(com=0.5).mean())df = pd.DataFrame(np.random.randn(10,4),index= pd.date_range('1/1/2019',periods=10),columns = ['A','B','C','D'])
print(df.ewm(com = 0.5).mean())
                   A         B         C         D
2019-01-01  0.687146 -0.228380  0.352405 -1.540756
2019-01-02  0.079633 -0.239311  0.186473 -0.465943
2019-01-03  1.640359  0.740408  0.959875 -0.554901
2019-01-04  0.220392 -0.106495  0.517194 -0.578924
2019-01-05 -0.516006 -0.124318  0.834525  0.010539
2019-01-06 -0.746540 -0.009502  0.466689  0.348502
2019-01-07 -0.174154  0.505380  0.948555 -0.169799
2019-01-08  0.200864  0.754257  0.825790  0.095635
2019-01-09 -0.057712  0.549765  0.265503 -0.523900
2019-01-10 -0.668174  0.822662  0.567739 -0.605294

2.扩张窗口

扩张窗口又称累计窗口，可以理解为一个动态长度的窗口，其窗口的大小就是从序列开始处到具体操作的对应位置，其使用的聚合函数会作用于这些逐步扩张的窗口上。具体地说，设序列为a1, a2, a3, a4，则其每个位置对应的窗口即[a1]、[a1, a2]、[a1, a2, a3]、[a1, a2, a3, a4]

s = pd.Series([1,3,6,10])
s.expanding().mean()
Out[73]: 
0    1.000000
1    2.000000
2    3.333333
3    5.000000
dtype: float64

posted @ 2020-12-19 16:10 AiGgBoY 阅读(115) 评论(0) 编辑收藏举报

刷新页面返回顶部

AiGgBoY