pandas.DataFrame的pivot()和unstack()实现行转列

示例：

有如下表需要进行行转列：

代码如下：

# -*- coding:utf-8 -*-
import pandas as pd
import MySQLdb
from warnings import filterwarnings
# 由于create table if not exists总会抛出warning，因此使用filterwarnings消除
filterwarnings('ignore', category = MySQLdb.Warning)
from sqlalchemy import create_engine
import sys
if sys.version_info.major<3:
  reload(sys)
  sys.setdefaultencoding("utf-8")
  # 此脚本适用于python2和python3
host,port,user,passwd,db,charset="192.168.1.193",3306,"leo","mysql","test","utf8"

def get_df():
  global host,port,user,passwd,db,charset
  conn_config={"host":host, "port":port, "user":user, "passwd":passwd, "db":db,"charset":charset}
  conn = MySQLdb.connect(**conn_config)
  result_df=pd.read_sql('select UserName,Subject,Score from TEST',conn)
  return result_df

def pivot(result_df):
  df_pivoted_init=result_df.pivot('UserName','Subject','Score')
  df_pivoted = df_pivoted_init.reset_index()  # 将行索引也作为DataFrame值的一部分，以方便存储数据库
  return df_pivoted_init,df_pivoted
  # 返回的两个DataFrame，一个是以姓名作index的，一个是以数字序列作index，前者用于unpivot，后者用于save_to_mysql

def unpivot(df_pivoted_init):
  # unpivot需要进行df_pivoted_init二维表格的行、列索引遍历，需要拼SQL因此不能使用save_to_mysql存数据，这里使用SQL和MySQLdb接口存
  insert_sql="insert into test_unpivot(UserName,Subject,Score) values "
  # 处理值为NaN的情况
  df_pivoted_init=df_pivoted_init.fillna(0)
  for col in df_pivoted_init.columns:
    for index in df_pivoted_init.index:
      value=df_pivoted_init.at[index,col]
      if value!=0:
        insert_sql=insert_sql+"('%s','%s',%s)" %(index,col,value)+','
  insert_sql = insert_sql.strip(',')
  global host, port, user, passwd, db, charset
  conn_config = {"host": host, "port": port, "user": user, "passwd": passwd, "db": db, "charset": charset}
  conn = MySQLdb.connect(**conn_config)
  cur=conn.cursor()
  cur.execute("create table if not exists test_unpivot like TEST")
  cur.execute(insert_sql)
  conn.commit()
  conn.close()

def save_to_mysql(df_pivoted,tablename):
  global host, port, user, passwd, db, charset
  """
  只有使用sqllite时才能指定con=connection实例，其他数据库需要使用sqlalchemy生成engine，engine的定义可以添加?来设置字符集和其他属性
  """
  conn="mysql://%s:%s@%s:%d/%s?charset=%s" %(user,passwd,host,port,db,charset)
  mysql_engine = create_engine(conn)
  df_pivoted.to_sql(name=tablename, con=mysql_engine, if_exists='replace', index=False)

# 从TEST表读取源数据至DataFrame结构
result_df=get_df()
# 将源数据行转列为二维表格形式
df_pivoted_init,df_pivoted=pivot(result_df)
# 将二维表格形式的数据存到新表test中
save_to_mysql(df_pivoted,'test')
# 将被行转列的数据unpivot，存入test_unpivot表中
unpivot(df_pivoted_init)

View Code

结果如下：

关于Pandas DataFrame类自带的pivot方法：

DataFrame.pivot(index=None, columns=None, values=None)：

Return reshaped DataFrame organized by given index / column values.

这里只有3个参数，pivot每次只能处理3个列，其中一个作为行转列后的index，另一个作为行转列之后的columns，最后一个作为行转列后的表格values。

pivot会忽略除了以上3个列之外的其他列数据，因此需要使用DataFrame切片、聚合等操作来达成特定的行列转换目标。

如下例：

补充说明：

在学习到Pandas的层次化索引部分时发现了2个很有意思的函数，也可以进行行列互转，其用法如下：

(很久之后我才意识到，pivot只是封装了unstack的一个快捷方式而已，其本质上还是先用set_index建立层次化索引，然后用unstack进行重塑，就像我在下面示例做的操作)

df=pd.DataFrame(np.random.randn(20).reshape(4,5),index=[['a','a','b','b'],[1,2,3,4]],columns=[10,20,30,40,50])
In [96]: df
Out[96]: 
           10        20        30        40        50
a 1  0.945775  0.768337  0.851630 -1.050475 -1.102554
  2 -0.366129  0.353388 -0.722637 -0.056877  1.178270
b 3  0.885536  0.210911  2.067309  1.283721 -0.432906
  4  0.173504  1.263630  1.264698  0.913879  1.156815
In [98]: df.stack()
Out[98]: 
a  1  10    0.945775
      20    0.768337
      30    0.851630
      40   -1.050475
      50   -1.102554
   2  10   -0.366129
      20    0.353388
      30   -0.722637
      40   -0.056877
      50    1.178270
b  3  10    0.885536
      20    0.210911
      30    2.067309
      40    1.283721
      50   -0.432906
   4  10    0.173504
      20    1.263630
      30    1.264698
      40    0.913879
      50    1.156815
In [99]: df.stack().unstack()
Out[99]: 
           10        20        30        40        50
a 1  0.945775  0.768337  0.851630 -1.050475 -1.102554
  2 -0.366129  0.353388 -0.722637 -0.056877  1.178270
b 3  0.885536  0.210911  2.067309  1.283721 -0.432906
  4  0.173504  1.263630  1.264698  0.913879  1.156815

以上利用了Pandas的层次化索引，实际上这也是层次化索引一个主要的用途，结合本例我们可以把代码改成如下：

result_df=pd.read_sql('select UserName,Subject,Score from TEST',conn)
# 在从数据库中获取的数据格式是这样的：
        UserName Subject Score
0        张三      语文   80.0
1        张三      数学   90.0
2        张三      英语   70.0
3        张三      生物   85.0
4        李四      语文   80.0
5        李四      数学   92.0
6        李四      英语   76.0
7        王五      语文   60.0
8        王五      数学   82.0
9        王五      英语   96.0
10       王五      生物   78.0
# 如果要使用层次化索引，那么我们只需要把UserName和Subject列设置为层次化索引，Score为其对应的值即可，我们借用set_index()函数：
df=result_df.set_index(['UserName','Subject'])
In [112]: df.unstack()
Out[112]: 
         Score                  
Subject     数学    生物    英语    语文
UserName                        
张三        90.0  85.0  70.0  80.0
李四        92.0   NaN  76.0  80.0
王五        82.0  78.0  96.0  60.0
# 使用stack可以将unstack的结果转回来，这样就也在形式上实现了行列互转，之后的操作基本一致了。

posted @ 2018-10-12 15:20 realcp1018 阅读(19169) 评论(0) 收藏举报

刷新页面返回顶部

realcp1018

学如逆水行舟，不进则退。

pandas.DataFrame的pivot()和unstack()实现行转列

公告