pandas学习笔记(七)

pandas Task Special

【任务一】企业收入的多样性

image-20210101232650488

  • 读取两个数据集

    • 两表的证券代码列和日期格式不一致 , 先将其变为一致

      import pandas as pd
      import numpy as np
      df1 = pd.read_csv('./data/test_1/Company.csv')
      df2 = pd.read_csv('./data/test_1/Company_data.csv')
      df1.head()
      df2.head()

      # 从日期列中分割出年份
      df2['年份'] = [x.split('/')[0] for x in df2['日期']]
      # 计算p(x_i)
      df2['p(x_i)'] = df2.groupby(['证券代码','年份'])['收入额'].transform(lambda x: (x/x.sum()))
      # 计算log(p(x_i))
      df2['log_res'] = df2['p(x_i)'].apply(lambda x: np.log(x))

      def my_calculate(x):
         multiple_res = x['p(x_i)'] * x['log_res']
         res = (-1) * multiple_res.sum()
         return res

      res = df2.groupby(['证券代码','年份']).apply(my_calculate)
      res = res.to_frame()
      res = res.reset_index()
      res.columns = ['证券代码','日期','收入熵指标']
       证券代码日期
      0 #000007 2014
      1 #000403 2015
      2 #000408 2016
      3 #000408 2017
      4 #000426 2015
       证券代码日期收入类型收入额
      0 1 2008/12/31 1 1.084218e+10
      1 1 2008/12/31 2 1.259789e+10
      2 1 2008/12/31 3 1.451312e+10
      3 1 2008/12/31 4 1.063843e+09
      4 1 2008/12/31 5 8.513880e+08
       证券代码日期收入额
      0 7 2014 4.429740
      1 403 2015 4.025963
      2 408 2016 4.066295
      3 408 2017 NaN
      4 426 2015 4.449655
    • df1证券代码的格式统一一下,把#和占位0去掉

    • df2日期列取前四位year转为int

      df1['证券代码'] = df1['证券代码'].str.replace('#','')
      df1['证券代码'] = df1['证券代码'].apply(lambda x: int(x))
      df1.merge(res, on=['证券代码','日期'], how='left')
      df1.head()
       证券代码日期收入熵指标
      0 7 2014 3.070462
      1 403 2015 2.790585
      2 408 2016 2.818541
      3 408 2017 NaN
      4 426 2015 3.084266

       

posted @ 2021-01-01 23:48  AiGgBoY  阅读(153)  评论(0编辑  收藏  举报