pandas与DataFrame、Series 数据结构

Pandas简介

Pandas是一个开源的Python库，使用其强大的数据结构提供高性能的数据处理和分析工具，Pandas这个名字源自面板数据 - 来自多维数据的计量经济学
2008年，开发人员Wes McKinney在需要高性能，灵活的数据分析工具时开始开发Pandas
在Pandas之前，Python主要用于数据管理和准备，它对数据分析的贡献很小，Pandas解决了这个问题
使用Pandas，无论数据来源如何 - 加载，准备，操作，建模和分析，我们都可以完成数据处理和分析中的五个典型步骤
Python与Pandas一起使用的领域广泛，包括学术和商业领域，包括金融，经济学，统计学，分析学等

Pandas的主要特点

使用默认和自定义索引的快速高效的DataFrame对象
用于将数据从不同文件格式加载到内存数据对象的工具
数据对齐和缺失数据的集成处理
重新设置和旋转日期集
大数据集的基于标签的分片，索引和子集
数据结构中的列可以被删除或插入
按数据分组进行聚合和转换
高性能的数据合并和连接
时间序列功能

Pandas数据结构

- Pandas 自己独有的基本数据结构，Python 中有的数据类型在这里依然适用，也同样可以使用类自己定义数据类型

名称	说明
Series	一维数组，与Numpy中的一维array类似
DataFrame	二维的表格型数据结构，DataFrame可以理解成Series的容器
Panel	三维的数组，Panel可以理解为DataFrame的容器 (Python3中似乎不太支持)

- pandas.Series() 创建一维数组
- df = pandas.DataFrame() 创建二维数组（表格）
- - df.index 全部行的名称
  - df.columns 全部列的名称
  - df.values 所有内容
  - df.describe() 数字描述，如平均值、最大最小值等
  - df.T 行列互换
  - df.sort_index() 通过索引排序
  - df.sort_values() 通过元素值排序

# Series, DataFrame
def function1():
    # 通过Series生成一维数组, np.nan表示空数据
    s = pd.Series([1, np.nan, 2])
    # 索引在左边 值在右边
    # 0    1.0
    # 1    NaN
    # 2    2.0
    # dtype: float64
    print(s)
 
    # 通过DataFrame生成二维数组（表格）
    df = pd.DataFrame(np.arange(1, 7).reshape(2, 3))
    #    0  1  2
    # 0  1  2  3
    # 1  4  5  6
    print(df)
 
    # 自定义字典
    dictionary = {'height': pd.Series([175, 182, 178]),
                  # date_range(): 生成时间序列, periods期间范围
                  'birthday': pd.date_range("2000-1-1", periods=3),
                  'weight': np.array([65, 68, 70], dtype='int32'),
                  'tool': pd.Categorical(["plane", "train", "bus"])}
    df1 = pd.DataFrame(dictionary)
 
    #    height   birthday  weight   tool
    # 0     175 2000-01-01      65  plane
    # 1     182 2000-01-02      68  train
    # 2     178 2000-01-03      70    bus
    print(df1)
 
    # 每一行的索引
    # RangeIndex(start=0, stop=3, step=1)
    print(df1.index)
 
    # 每一列的列名称
    # Index(['height', 'birthday', 'weight', 'tool'], dtype='object')
    print(df1.columns)
 
    # 所有内容
    # [[175 Timestamp('2000-01-01 00:00:00') 65 'plane']
    #  [182 Timestamp('2000-01-02 00:00:00') 68 'train']
    #  [178 Timestamp('2000-01-03 00:00:00') 70 'bus']]
    print(df1.values)
 
    # 数字描述
    #            height     weight
    # count    3.000000   3.000000
    # mean   178.333333  67.666667
    # ...       ...         ...
    # max    182.000000  70.000000
    print(df1.describe())
 
    # 行列转换
    #                             0                    1                    2
    # height                    175                  182                  178
    # birthday  2000-01-01 00:00:00  2000-01-02 00:00:00  2000-01-03 00:00:00
    # weight                     65                   68                   70
    # tool                    plane                train                  bus
    print(df1.T)
 
    # axis=0，根据行方向排序，即将行号排序，ascending 是否升序
    #    height   birthday  weight   tool
    # 2     178 2000-01-03      70    bus
    # 1     182 2000-01-02      68  train
    # 0     175 2000-01-01      65  plane
    print(df1.sort_index(axis=0, ascending=False))
 
    # 根据weight列的值升序排序，默认排序也是升序
    #    height   birthday  weight   tool
    # 0     175 2000-01-01      65  plane
    # 1     182 2000-01-02      68  train
    # 2     178 2000-01-03      70    bus
    print(df1.sort_values(by='weight', ascending=True))

posted @ 2021-01-13 15:41 a最简单阅读(221) 评论(0) 编辑收藏举报

刷新页面返回顶部

@Asimple

( •̀ ω •́ )✧ 加油！

pandas与DataFrame、Series 数据结构

公告