pandas

DataFrame类的dtype介绍如下:
object: 定义该列值可以存放任意类型的内容,比如float64, str, int32等等。从excel读入时,空值将以float64形式存储为np.nan。
float64: 即浮点形,为numpy的浮点形,而不是python内置类型的float,在numpy中运行时相较于float有性能优势,空值存储为np.nan
pd.StringDtype(): pandas的内置字符串类型,空值为pd.NA。显示定义这个类型的列后,列中所有值都将是str,而不像object,会将某个值推断为非str类型。

列定义为dtype=object时,该列中每行的值的实际存储类型可以有多处。 但列定义为float64或int32时,所有值必须都是符合该定义类型的数值,或np.NaN(一般从excel读入空值时便是np.nan)。
列定义为dtype=np.float64时,必须确保每个值要么是空,要么是能转为数值的字符串。整个列类型为numpy.float64
列定义为dtype=str时,所有值都会实际以str存储,但整个列的类型仍然为object,既后续可能追加非str类型的值。
列定义为dtype=pd.StringDtype()是,所有值都会实际以str存储,但整个列的类型为string[python],既后续追加的新值必须也是str。
dtype=pd.StringDtype() 的等价写法为 dtype='string'。

pd.NA 与 np.nan 都不等价于None,也不等价于False。 if val is not pd.NA and val is not np.nan 可以使用if pd.notna(val) 来简化判断

当使用类似df_object['cola'].str.contains('abc', na=False)) 这样的函数时,一定要确保该列都为str的值,否则,遇到推断为非str值时,就会报错。所以建议如果要在某列上使用这种函数,应该显示指定dtype=str或pd.StringDtype()。

举列

import pandas as pd
import numpy as np

df_object = pd.DataFrame({"cola": [33,34.1,35,36]}, dtype=object)  # 列类型为object
type(df_object['cola'][0])  # 输出结果为 int
df_object['cola'][0] # 输出结果为 33
type(df_object['cola'][1])  # 输出结果为 float (注意,这里该值类型为float, 而不是numpy.float64)
df_object['cola'][1] # 输出结果为 34.1

df_float = pd.DataFrame({"cola": [33,34.1,35,36]}, dtype=np.float64)  # 列类型为float64
type(df_float ['cola'][1])  # 输出结果为 numpy.float64
df_float['cola'][1] # 输出结果为 34.1

df_str = pd.DataFrame({"cola": [33,34,35,36]}, dtype=str)  # 列类型仍然为object
type(df_str['cola'][1])  # 输出结果为 str
df_str['cola'][1] # 输出结果为 '34.1'

df_pdstr = pd.DataFrame({"cola": [33,34,35,36]}, dtype=pd.StringDtype())  # 列类型为string[python]
type(df_pdstr['cola'][1])  # 输出结果为 str
df_pdstr['cola'][1] # 输出结果为 '34.1'

使用pandas读写excel:
当读取excel中多个工作表时,通常创建一个ExcelFile对象,后续使用pd.read_excel时就可以指定这个对象,而这个对象已在内存中,性能会比较好。并且,这个对象不需要做关闭或释放之类的操作。
xlsx = pd.ExcelFile(EXCEL_FILEPATH)
pd.read_excel(xlsx, sheet_name)

在回写excel时,要创建一个ExcelWrite对象,并且这个对象应该做关闭操作。但一般使用上下文管理器来操作。 mode的缺省值为'w',这样会清除原文件中所有工作表,并写入新表。'a'不会清除原来的sheets,并且会追加新的sheet.
with pd.ExcelWriter(EXCEL_FILEPATH, engine='openpyxl', mode='a') as writer:
df_bank_statement.to_excel(writer, sheet_name='sheet2', index=False)

SQL vs Pandas 功能对照参考

import pandas as pd
import numpy as np

# 准备示例数据
orders = pd.DataFrame({
    'order_id': range(1, 5),
    'customer_id': [1, 2, 1, 3],
    'amount': [100, 200, 150, 300],
    'date': ['2024-01-01', '2024-01-01', '2024-01-02', '2024-01-02']
})

customers = pd.DataFrame({
    'customer_id': [1, 2, 3, 4],
    'name': ['Alice', 'Bob', 'Charlie', 'David'],
    'city': ['Beijing', 'Shanghai', 'Beijing', 'Shanghai']
})

# 1. WHERE 子句
# SQL: SELECT * FROM orders WHERE amount > 150
filtered_df = orders[orders['amount'] > 150]

# 复合条件
# SQL: SELECT * FROM orders WHERE amount > 150 AND customer_id = 1
filtered_df = orders[(orders['amount'] > 150) & (orders['customer_id'] == 1)]

# 2. GROUP BY
# SQL: SELECT customer_id, SUM(amount) as total_amount 
#      FROM orders GROUP BY customer_id
grouped = orders.groupby('customer_id')['amount'].sum()

# 多个聚合函数
# SQL: SELECT customer_id, 
#             SUM(amount) as total_amount, 
#             AVG(amount) as avg_amount,
#             COUNT(*) as order_count
#      FROM orders 
#      GROUP BY customer_id
grouped_multi = orders.groupby('customer_id').agg({
    'amount': ['sum', 'mean', 'count']
})

# 3. HAVING
# SQL: SELECT customer_id, SUM(amount) as total_amount 
#      FROM orders 
#      GROUP BY customer_id 
#      HAVING SUM(amount) > 200
having_result = (orders.groupby('customer_id')
                      .agg({'amount': 'sum'})
                      .reset_index()
                      .query('amount > 200'))

# 4. ORDER BY
# SQL: SELECT * FROM orders ORDER BY amount DESC, customer_id ASC
sorted_df = orders.sort_values(['amount', 'customer_id'], 
                             ascending=[False, True])

# 5. JOINS
# INNER JOIN
# SQL: SELECT o.*, c.name 
#      FROM orders o 
#      INNER JOIN customers c ON o.customer_id = c.customer_id
inner_joined = orders.merge(customers, 
                          on='customer_id', 
                          how='inner')

# LEFT OUTER JOIN
# SQL: SELECT o.*, c.name 
#      FROM orders o 
#      LEFT JOIN customers c ON o.customer_id = c.customer_id
left_joined = orders.merge(customers, 
                         on='customer_id', 
                         how='left')

# RIGHT OUTER JOIN
right_joined = orders.merge(customers, 
                          on='customer_id', 
                          how='right')

# FULL OUTER JOIN
full_joined = orders.merge(customers, 
                         on='customer_id', 
                         how='outer')

# CROSS JOIN
# SQL: SELECT * FROM orders CROSS JOIN customers
cross_joined = orders.merge(customers, 
                          how='cross')

# 6. WINDOW FUNCTIONS
# SQL: SELECT *,
#             SUM(amount) OVER(PARTITION BY customer_id) as customer_total,
#             SUM(amount) OVER(PARTITION BY customer_id ORDER BY date) as running_total
window_func = orders.assign(
    customer_total=orders.groupby('customer_id')['amount'].transform('sum'),
    running_total=orders.sort_values('date')
                       .groupby('customer_id')['amount']
                       .cumsum()
)

# 7. ROW_NUMBER()
# SQL: SELECT *,
#             ROW_NUMBER() OVER(PARTITION BY customer_id ORDER BY amount DESC) as row_num
orders['row_num'] = (orders.sort_values('amount', ascending=False)
                          .groupby('customer_id')
                          .cumcount() + 1)

# 8. DISTINCT
# SQL: SELECT DISTINCT customer_id FROM orders
unique_customers = orders['customer_id'].unique()
# 或者
unique_df = orders.drop_duplicates(subset=['customer_id'])

# 9. CASE WHEN
# SQL: SELECT *,
#             CASE WHEN amount > 200 THEN 'High'
#                  WHEN amount > 100 THEN 'Medium'
#                  ELSE 'Low' 
#             END as amount_category
orders['amount_category'] = np.where(orders['amount'] > 200, 'High',
                                   np.where(orders['amount'] > 100, 'Medium', 'Low'))
# 或者使用 cut/qcut
orders['amount_category'] = pd.cut(orders['amount'],
                                 bins=[0, 100, 200, float('inf')],
                                 labels=['Low', 'Medium', 'High'])

# 10. UNION/UNION ALL
# SQL: SELECT * FROM orders1 UNION ALL SELECT * FROM orders2
combined = pd.concat([orders1, orders2])
# UNION (去重)
combined_unique = pd.concat([orders1, orders2]).drop_duplicates()

# 11. 子查询
# SQL: SELECT * FROM orders 
#      WHERE customer_id IN (SELECT customer_id FROM customers WHERE city = 'Beijing')
beijing_customers = customers[customers['city'] == 'Beijing']['customer_id']
subquery_result = orders[orders['customer_id'].isin(beijing_customers)]

# 12. EXISTS
# 在 Pandas 中可以使用 merge + indicator 来模拟
# SQL: SELECT * FROM customers c
#      WHERE EXISTS (SELECT 1 FROM orders o WHERE o.customer_id = c.customer_id)
exists_result = (customers.merge(orders[['customer_id']].drop_duplicates(),
                               on='customer_id',
                               how='left',
                               indicator=True)
                         .query('_merge == "both"')
                         .drop('_merge', axis=1))

pandas性能提升
calamine: 在读取excel时,使用calamine engine通常会有更好的性能,且支持的excel扩展名也比较多。 但是如果是从一个比较大的excel文件中只读取少量数据,那么还是使用默认的openpyxl比较合适,因为它的懒加载属性。
Numba (JIT compilation),Cython: 通过这些方法可以在一些场景下提升pandas性能。 https://pandas.pydata.org/docs/user_guide/enhancingperf.html
非pandas库: 一些非pandas也提供类似于pandas的api,来处理dataframe,通常通过并行,分布式等方式。例如 Dask, Modin, Cylon...等等。 https://pandas.pydata.org/community/ecosystem.html#out-of-core

posted @ 2024-10-19 21:04  RolandHe  阅读(9)  评论(0编辑  收藏  举报