pandas

Pandas 是一个强大的 Python 数据分析库，主要用于数据处理和分析。
一、数据结构基础

Series：

创建方式：可以从列表、数组、字典等创建。例如：

 import pandas as pd
import numpy as np
 
# 从列表创建
s_list = pd.Series([1, 2, 3, 4])
# 从数组创建
s_array = pd.Series(np.array([5, 6, 7, 8]))
# 从字典创建，字典的键会成为 Series 的索引
s_dict = pd.Series({'a': 9, 'b': 10, 'c': 11})

索引操作：可以通过索引获取值，也可以设置新的索引。

 print(s_list[1])  # 获取第二个元素
s_list.index = ['idx1', 'idx2', 'idx3', 'idx4']  # 设置新索引

DataFrame：

创建方式：可以从字典、列表的列表、另一个数据框等创建。

 # 从字典创建
df_dict = pd.DataFrame({'col1': [1, 2, 3], 'col2': [4, 5, 6]})
# 从列表的列表创建
df_lists = pd.DataFrame([[7, 8], [9, 10], [11, 12]], columns=['col3', 'col4'])

行列操作：可以选择特定的行和列，添加、删除行或列。

 # 选择特定列
column_data = df_dict['col1']
# 选择特定行（使用 loc 和 iloc）
row_data = df_dict.loc[1]  # 基于标签索引
row_data_iloc = df_dict.iloc[2]  # 基于整数位置索引
# 添加列
df_dict['new_col'] = [100, 200, 300]
# 删除列
df_dict.drop('new_col', axis=1, inplace=True)
# 添加行
new_row = pd.DataFrame({'col1': 4, 'col2': 5}, index=[3])
df_dict = pd.concat([df_dict, new_row], ignore_index=False)
# 删除行
df_dict.drop(2, axis=0, inplace=True)

二、数据读取与写入

读取各种格式的数据：

CSV 文件：

 df_csv = pd.read_csv('data.csv')
# 可以指定分隔符、编码、是否有标题行等参数
df_csv_custom = pd.read_csv('data_custom.csv', sep=';', encoding='utf-8', header=None)

Excel 文件：

 df_excel = pd.read_excel('data.xlsx')
# 可以指定工作表、读取范围等参数
df_excel_sheet = pd.read_excel('data.xlsx', sheet_name='Sheet2')

SQL 数据库：

 import sqlite3
 
conn = sqlite3.connect('database.db')
query = "SELECT * FROM table_name"
df_sql = pd.read_sql(query, conn)

写入数据：

写入 CSV 文件：

 df_csv.to_csv('output.csv', index=False)

写入 Excel 文件：

 df_excel.to_excel('output.xlsx', index=False, sheet_name='Sheet1')

三、数据清洗与预处理

处理缺失值：

检测缺失值：

 df.isnull()  # 返回一个布尔值的数据框，指示每个元素是否为缺失值
df.isna()  # 与 isnull() 相同

处理缺失值的方法：

删除包含缺失值的行或列：

 df_drop_rows = df.dropna(axis=0)  # 删除包含缺失值的行
df_drop_cols = df.dropna(axis=1)  # 删除包含缺失值的列

填充缺失值：

 df_fill_constant = df.fillna(0)  # 用常数填充
df_fill_mean = df.fillna(df.mean())  # 用均值填充

处理重复值：

检测重复值：

 df.duplicated()  # 返回一个布尔值的 Series，指示每行是否为重复行

删除重复值：

 df_drop_duplicates = df.drop_duplicates()

数据类型转换：

查看数据类型：
```
 df.dtypes
```

转换数据类型：

 df['col_int'] = df['col_int'].astype(int)
df['col_float'] = df['col_float'].astype(float)
df['col_date'] = pd.to_datetime(df['col_date'])

四、数据筛选与查询

基于条件筛选：

使用布尔索引进行筛选。

 condition = df['col1'] > 10
filtered_df = df[condition]

多条件筛选：

 condition1 = df['col1'] > 10
condition2 = df['col2'] < 20
filtered_df_multicondition = df[(condition1) & (condition2)]

使用loc和iloc进行筛选：
- loc基于标签索引筛选：
```
 df_loc = df.loc[1:3, ['col1', 'col2']]
```
- iloc基于整数位置索引筛选：
```
 df_iloc = df.iloc[2:5, 0:2]
```

五、数据聚合与分组

聚合函数：
- 常见的聚合函数有sum、mean、median、min、max等。
```
 df['col1'].sum()
df[['col1', 'col2']].mean()
```

分组操作：

使用groupby进行分组。

 grouped_df = df.groupby('col3')
# 对分组后的数据应用聚合函数
grouped_sum = grouped_df.sum()
grouped_mean = grouped_df.mean()

多层分组：

 grouped_multilevel = df.groupby(['col3', 'col4']).sum()

六、数据连接与合并

merge函数：

内连接（inner join）：

 df1 = pd.DataFrame({'key': ['A', 'B', 'C'], 'value1': [1, 2, 3]})
df2 = pd.DataFrame({'key': ['B', 'C', 'D'], 'value2': [4, 5, 6]})
inner_merged_df = pd.merge(df1, df2, on='key', how='inner')

外连接（outer join）：

 outer_merged_df = pd.merge(df1, df2, on='key', how='outer')

concat函数：

按行连接：

 df3 = pd.DataFrame({'col5': [7, 8, 9]})
concatenated_rows_df = pd.concat([df1, df3], axis=0)

按列连接：

 df4 = pd.DataFrame({'col6': [10, 11, 12]})
concatenated_cols_df = pd.concat([df1, df4], axis=1)

七、时间序列处理

时间序列数据的创建与处理：

创建时间序列数据：

 dates = pd.date_range(start='2024-01-01', end='2024-01-10')
df_time_series = pd.DataFrame({'Date': dates, 'Value': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]})

时间序列索引：

 df_time_series.set_index('Date', inplace=True)

时间序列重采样：

 # 例如，将数据从每日频率重采样为每月频率
monthly_resampled_df = df_time_series.resample('M').sum()

时间序列的移动窗口函数：

移动平均：

 df_time_series['MovingAverage'] = df_time_series['Value'].rolling(window=3).mean()

移动总和：

 df_time_series['MovingSum'] = df_time_series['Value'].rolling(window=3).sum()

八、数据可视化（结合 Matplotlib）

简单的绘图：

绘制折线图：

 import matplotlib.pyplot as plt
 
df.plot(x='col1', y='col2', kind='line')
plt.show()

绘制柱状图：

 df.plot(x='col1', y='col2', kind='bar')
plt.show()

高级绘图：

绘制多个子图：

 fig, axes = plt.subplots(nrows=2, ncols=2)
df.plot(ax=axes[0, 0], kind='line')
df.plot(ax=axes[0, 1], kind='bar')
df.plot(ax=axes[1, 0], kind='hist')
df.plot(ax=axes[1, 1], kind='box')
plt.show()

posted @ 2024-11-06 13:48 渔樵江渚阅读(15) 评论(0) 编辑收藏举报

刷新页面返回顶部

登录后才能查看或发表评论，立即登录或者逛逛博客园首页

相关博文：

· Java8

· Python 量化demo

· Pandas 知识点全攻略：数据处理与分析的必备指南

· 使用pandas进行数据分析

· Pandas基础-CDA学习打卡

阅读排行：
· 微软正式发布.NET 10 Preview 1：开启下一代开发框架新篇章
· 没有源码，如何修改代码逻辑？
· PowerShell开发游戏 · 打蜜蜂
· 在鹅厂做java开发是什么体验
· WPF到Web的无缝过渡：英雄联盟客户端的OpenSilver迁移实战

公告

昵称：渔樵江渚
园龄： 6个月
粉丝： 0
关注： 0

+加关注

2025年2月

日

一

二

三

四

五

六

渔樵江渚

pandas

公告

搜索

常用链接

我的标签

合集

随笔分类

随笔档案

阅读排行榜

	import pandas as pd
	import numpy as np

	# 从列表创建
	s_list = pd.Series([1, 2, 3, 4])
	# 从数组创建
	s_array = pd.Series(np.array([5, 6, 7, 8]))
	# 从字典创建，字典的键会成为 Series 的索引
	s_dict = pd.Series({'a': 9, 'b': 10, 'c': 11})

	print(s_list[1]) # 获取第二个元素
	s_list.index = ['idx1', 'idx2', 'idx3', 'idx4'] # 设置新索引

	# 从字典创建
	df_dict = pd.DataFrame({'col1': [1, 2, 3], 'col2': [4, 5, 6]})
	# 从列表的列表创建
	df_lists = pd.DataFrame([[7, 8], [9, 10], [11, 12]], columns=['col3', 'col4'])

	# 选择特定列
	column_data = df_dict['col1']
	# 选择特定行（使用 loc 和 iloc）
	row_data = df_dict.loc[1] # 基于标签索引
	row_data_iloc = df_dict.iloc[2] # 基于整数位置索引
	# 添加列
	df_dict['new_col'] = [100, 200, 300]
	# 删除列
	df_dict.drop('new_col', axis=1, inplace=True)
	# 添加行
	new_row = pd.DataFrame({'col1': 4, 'col2': 5}, index=[3])
	df_dict = pd.concat([df_dict, new_row], ignore_index=False)
	# 删除行
	df_dict.drop(2, axis=0, inplace=True)

	df_csv = pd.read_csv('data.csv')
	# 可以指定分隔符、编码、是否有标题行等参数
	df_csv_custom = pd.read_csv('data_custom.csv', sep=';', encoding='utf-8', header=None)

	df_excel = pd.read_excel('data.xlsx')
	# 可以指定工作表、读取范围等参数
	df_excel_sheet = pd.read_excel('data.xlsx', sheet_name='Sheet2')

	import sqlite3

	conn = sqlite3.connect('database.db')
	query = "SELECT * FROM table_name"
	df_sql = pd.read_sql(query, conn)

	df.isnull() # 返回一个布尔值的数据框，指示每个元素是否为缺失值
	df.isna() # 与 isnull() 相同

	df_drop_rows = df.dropna(axis=0) # 删除包含缺失值的行
	df_drop_cols = df.dropna(axis=1) # 删除包含缺失值的列

	df_fill_constant = df.fillna(0) # 用常数填充
	df_fill_mean = df.fillna(df.mean()) # 用均值填充

	df['col1'].sum()
	df[['col1', 'col2']].mean()

	df['col_int'] = df['col_int'].astype(int)
	df['col_float'] = df['col_float'].astype(float)
	df['col_date'] = pd.to_datetime(df['col_date'])

	condition1 = df['col1'] > 10
	condition2 = df['col2'] < 20
	filtered_df_multicondition = df[(condition1) & (condition2)]

	grouped_df = df.groupby('col3')
	# 对分组后的数据应用聚合函数
	grouped_sum = grouped_df.sum()
	grouped_mean = grouped_df.mean()

	df1 = pd.DataFrame({'key': ['A', 'B', 'C'], 'value1': [1, 2, 3]})
	df2 = pd.DataFrame({'key': ['B', 'C', 'D'], 'value2': [4, 5, 6]})
	inner_merged_df = pd.merge(df1, df2, on='key', how='inner')

	df3 = pd.DataFrame({'col5': [7, 8, 9]})
	concatenated_rows_df = pd.concat([df1, df3], axis=0)

	df4 = pd.DataFrame({'col6': [10, 11, 12]})
	concatenated_cols_df = pd.concat([df1, df4], axis=1)

	dates = pd.date_range(start='2024-01-01', end='2024-01-10')
	df_time_series = pd.DataFrame({'Date': dates, 'Value': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]})

	# 例如，将数据从每日频率重采样为每月频率
	monthly_resampled_df = df_time_series.resample('M').sum()

	import matplotlib.pyplot as plt

	df.plot(x='col1', y='col2', kind='line')
	plt.show()

	fig, axes = plt.subplots(nrows=2, ncols=2)
	df.plot(ax=axes[0, 0], kind='line')
	df.plot(ax=axes[0, 1], kind='bar')
	df.plot(ax=axes[1, 0], kind='hist')
	df.plot(ax=axes[1, 1], kind='box')
	plt.show()