深度学习-02-数据预处理-DataFrame切片

此Blog仅作为日常学习工作中记录使用，Blog中有不足之处欢迎指出

以kaggle中房屋预测的训练集为例，说明DataFrame切片常用操作

一、读入数据

import numpy as np
import pandas as pd

file_path = '***\kaggle_house_pred_train.csv'
data = pd.read_csv(file_path)

data.columns # 列名

输出：

Index(['Id', 'MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street',
       'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig',
       'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType',
       'HouseStyle', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd',
       'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType',
       'MasVnrArea', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual',
       'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1',
       'BsmtFinType2', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating',
       'HeatingQC', 'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF',
       'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath',
       'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual',
       'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType',
       'GarageYrBlt', 'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual',
       'GarageCond', 'PavedDrive', 'WoodDeckSF', 'OpenPorchSF',
       'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'PoolQC',
       'Fence', 'MiscFeature', 'MiscVal', 'MoSold', 'YrSold', 'SaleType',
       'SaleCondition', 'SalePrice'],
      dtype='object')

data.index #当前行标签为0~1460（最后一位为1459，有1460个元素），跨度为1的列表

输出：

RangeIndex(start=0, stop=1460, step=1)

二、DataFrame切片常用方法

DataFrame常用的切片方法有以下：

假设：df 为DataFrame类型

①通过行列名、行列索引 df[label]或df[index]、df[start_index:end_index]

②通过loc函数进行行列切片，df.loc[start_index:end_index,start_col:end_col]

③通过iloc函数进行行列切片，df.iloc[start_index:end_index,start_col:end_col]

写在前面，我当时学习的时候，会很困惑loc函数和iloc函数有什么区别？现提前说明，在后续的例子中去体会。

loc是location的意思，iloc中的i是integer的意思。显而易见，loc函数接收行列索引（名称）的，iloc函数接收的是第几行（下标）；此外，loc函数是闭区间切片，iloc函数是左闭右开方式切片。

第①中方法能够实现的，loc函数和iloc函数也可以实现，因此作中介绍loc函数和iloc函数，进队第①种方式简要介绍。

三、示例

DateFrame切片

对于第①种方式，当是df[index]时，是针对列进行取值，index可以是单个列索引、多个列索引
```
data['SalePrice']
data[['Id','SalePrice']] #index = ['Id','SalePrice']
```
输出：

当是df[start_index:end_index]时，是针对行进行取值，且是左闭右开

DataFrame不存在以下几种类型的切片：df[start_index:end_index,start_col:end_col]、针对列的df['start_col':'end_col']、针对行的df[[index1,index2,...]]

注：在实际应用中，数据量通常会比较大，因此，通常会存在行索引和行下标一致的情况，即0~len(data)。

loc函数

loc函数，根据行列索引进行切片，形式为：df[indexs,columns]，其中indexs，columns形式为

单个行列索引、行列索引列表、行列布尔索引。其中，columns可以省略，但是indexs不可省略

data.loc[:,['Id','Alley']] # 切片获取Id和Alley列
data.loc[:,'Id':'Alley'] # 切片获取Id到Alley列
data.loc[1:5,['Id','Alley']] # 切片获取1到5行，Id到Alley列
data.loc[1:5,'Id':'Alley'] # 切片获取1到5行，Id到Alley列
data.loc[1:5,:] # 切片获取1到5行，所有列
data.loc[[0,2,3],['Id','Alley']] # 切片获取0,2,3行，Id到Alley列
data.loc[0,'Id':'Alley'] # 切片获取0行，Id到Alley列
data.loc[[0,2,3]] # 切片获取0,2,3行，所有列
data.loc[0,'MSZoning'] # 切片获取0行MSZoning列的值

data.loc[data['Alley'].isna(),:] # 切片获取Alley列的缺失行
data.loc[:,data.dtypes != 'object'] # 切片获取非object类型的列，常用于数据处理时，将数值类型数据切出来进行归一化

iloc函数

iloc函数，根据下标值进行切片，形式为：df[indexs,col_indexs]，其中indexs，col_indexs形式为单个行列下标，行列下标列表，行列布尔索引

data.iloc[0,2] # 切片获取0行2列的值
data.iloc[0,2:5] # 切片获取0行2到4列的值
data.iloc[0:4,2:5] # 切片获取0到4行2到4列的值
data.iloc[0:4,[2,3]] # 切片获取0到4行2和3列的值
data.iloc[[0,3,6],[2,3]] # 切片获取0,3,6行2和3列的值
data.iloc[[2,3]] # 切片获取2,3行所有列的值
data.iloc[:,[2,3]] # 切片获取所有行2,3列的值

# 布尔索引形式与loc函数类似

附件：数据集地址

posted @ 2024-10-18 15:16 AfroNicky 阅读(137) 评论(0) 收藏举报

刷新页面返回顶部

AfroNicky