基于 Python 和 Pandas 的数据分析(7) --- Pickling

上一节我们介绍了几种合并数据的方法. 这一节, 我们将重新开始不动产的例子. 在第四节中我们写了如下代码:

import Quandl
import pandas as pd

fiddy_states = pd.read_html('https://simple.wikipedia.org/wiki/List_of_U.S._states')

for abbv in fiddy_states[0][1][2:]:
    #print(abbv)
    print("FMAC/HPI_"+str(abbv))

上面这段代码是为了获取 50 个州的简称, 进行遍历, 并生成恰当的 Quandl 查询语句, 用以查出每个州的房价. 如果单个处理, 将会有50个 dataframes. 大家可以查看一下第四节的教程, 发现房价的查询结果都会返回 "Date"索引, 所以我们可以使用 "join" 来做数据合并. (通常情况下, 可能没有现成的索引, 那么可以使用 "concatenate", 并加上参数 on = 'Date')

下面我们开始收集数据:

import Quandl
import pandas as pd

mytoken = 'HK1FqyWqrmUwLCYnWcHa'

fiddy_states = pd.read_html('https://simple.wikipedia.org/wiki/List_of_U.S._states')

main_df = pd.DataFrame()

for abbv in fiddy_states[0][1][2:]:
    query = "FMAC/HPI_"+str(abbv)
    df = Quandl.get(query, authtoken=mytoken)

    if main_df.empty:
        main_df = df
    else:
        main_df = main_df.join(df)

注意: 这里可能会有一点问题, Quandl 在返回数据的时候, 如果是只有一个列, 它会自动把列名改为 "value". 那么当我们循环的时候, 就会有问题, 变成每个列名都是 "value", 就会报类似这样的错:

ValueError: columns overlap but no suffix specified: Index([u'Value'], dtype='object')

所以这里我们要改一下, 循环之后, 把列名"value" 改成每个州的缩写, 也就是每个 "abbv" 的值. 

df.rename(columns={'NSA Value':str(abbv)+'NSA Value', 'SA Value':str(abbv)+'SA Value'}, inplace=True)

很好, 但是你可能发现, 每运行一次都至少要花费30秒钟, 这是让人很不爽的. 所以我们要想办法把这个数据保存成一个文件, 这样以后需要用的时候, 就基于这个文件就好了.

这里就引出 python 的另一个模块"Pickle", 它可以把对象转成二进制字节流, 也可以通过 unpickling 进行反向转换. 我们可以通过它保存任何 Python 对象, 不管是机器学习的 "classifier", 还是 "Dictionary", 又或者是 "dataframe" 都没有问题. 另外, 除了 Python 的 pickle, Pandas 也有 pickle. 我们会分别学习这两种. 

首先, 我们来看下 Python 的 pickle. 我们可以用它来处理任何 Python 对象, 不一定非得是 dataframe. 但是在我们这个例子中, 我们还是用它来处理 dataframe.

第一步, 引入 pickle:

import pickle

第二步:

pickle_out = open('fiddy_states.pickle','wb')
pickle.dump(main_df, pickle_out)
pickle_out.close()     

在上面的代码中, 首先是以可写的方式创建并打开一个 .pickle 的文件, 然后我们将数据 main_df 写入这个文件. 最后保存关闭这个文件.

尽管我们不想重复地运行获取每个州的房价的代码, 但是我们仍然有偶尔需要运行它的可能, 所以我们还是要把代码整理一下, 以提高它的复用性:

import Quandl
import pandas as pd
import pickle

mytoken = 'HK1FqyWqrmUwLCYnWcHa'

def state_list():
    fiddy_states = pd.read_html('https://simple.wikipedia.org/wiki/List_of_U.S._states')
    return fiddy_states[0][1][2:]
    
def grab_initial_state_data():
    states = state_list()

    main_df = pd.DataFrame()

    for abbv in states:
        query = "FMAC/HPI_"+str(abbv)
        df = Quandl.get(query, authtoken=api_key)
        print(query)
        if main_df.empty:
            main_df = df
        else:
            main_df = main_df.join(df)
            
    pickle_out = open('fiddy_states.pickle','wb')
    pickle.dump(main_df, pickle_out)
    pickle_out.close()        
    
grab_initial_state_data()

如上, 我们封装了两个函数, 这样我们就可以很方便地随时获取所有州的列表, 也可以获取所有房价的数据并把它们保存成 pickle 文件.

现在如果想要再次获取房价的信息, 就只要运行下面的代码就可以了:

pickle_in = open('fiddy_states.pickle','rb')
HPI_data = pickle.load(pickle_in)
print(HPI_data)

以可读的方式打开刚刚保存的 pickle 文件, 把文件的内容保存到 HPI_data 变量中, 并打印出来.

            ALNSA Value  ALSA Value  AKNSA Value  AKSA Value  AZNSA Value  AZSA Value  ARNSA Value     ...      WASA Value  WVNSA Value  WVSA Value  WINSA Value  WISA Value  WYNSA Value  WYSA Value
Date                                                                                                   ...
1975-01-31    35.701696   35.757921    34.456361   34.655493    28.872050   28.841062    37.136508     ...       17.459678    40.675291   40.657172    27.742952   27.817227    31.322153   31.293386
1975-02-28    35.896549   35.961777    34.954360   35.123805    29.371317   29.365897    37.428949     ...       17.566117    41.759012   41.696684    28.178063   28.237985    31.808090   31.791372
1975-03-31    36.124717   36.138987    35.465174   35.513620    29.839679   29.825888    37.672883     ...       17.654450    42.890424   42.660165    28.598691   28.558429    32.319995   32.293365
1975-04-30    36.401656   36.355040    35.998432   35.890060    30.198725   30.147602    37.866182     ...       17.752318    44.062355   43.680733    28.968088   28.811228    32.845934   32.753322
1975-05-31    36.644220   36.536191    36.581805   36.298572    30.374603   30.269359    38.063496     ...       17.814398    45.245558   44.758631    29.267055   28.971560    33.354786   33.150798
1975-06-30    36.731053   36.581766    37.208749   36.714638    30.305563   30.135694    38.284160     ...       17.906027    46.371800   45.811578    29.506152   29.074535    33.840685   33.517736
1975-07-31    36.611401   36.455401    37.839987   37.179961    29.978801   29.815028    38.471020     ...       18.040211    47.354109   46.769160    29.713240   29.212413    34.309437   33.919358
1975-08-31    36.353737   36.193565    38.422661   37.742392    29.484860   29.308448    38.551793     ...       18.122491    48.125707   47.588758    29.918971   29.457899    34.741733   34.394430
1975-09-30    36.097182   35.948436    38.932931   38.391789    28.985826   28.809285    38.490112     ...       18.142710    48.614964   48.202548    30.189301   29.836924    35.110433   34.820695
1975-10-31    35.991355   35.853360    39.390670   39.092214    28.726736   28.562418    38.372037     ...       18.193266    48.775900   48.551512    30.500255   30.253867    35.379704   35.174103
1975-11-30    36.111149   36.005626    39.794554   39.774496    28.880050   28.758223    38.359030     ...       18.385560    48.679916   48.592545    30.720418   30.565281    35.553098   35.429515
1975-12-31    36.459427   36.435918    40.129427   40.291694    29.296251   29.189322    38.487981     ...       18.653890    48.456979   48.424852    30.775030   30.765919    35.667694   35.603243
1976-01-31    36.997043   37.055307    40.397424   40.630890    29.705911   29.674028    38.704100     ...       18.955271    48.212566   48.191089    30.711465   30.793688    35.763022   35.730177
1976-02-29    37.599405   37.667727    40.618981   40.815886    29.991147   29.985613    39.017508     ...       19.178024    48.050434   47.978716    30.615016   30.680120    35.889923   35.871059
1976-03-31    38.103297   38.118349    40.819723   40.875483    30.136340   30.122412    39.454330     ...       19.327694    48.016346   47.758569    30.568294   30.525259    36.078335   36.048608
1976-04-30    38.413508   38.364316    41.012457   40.888991    30.171771   30.120693    39.955964     ...       19.466310    48.064191   47.647910    30.655795   30.489796    36.330964   36.228526
1976-05-31    38.557345   38.443675    41.199138   40.880155    30.152499   30.048025    40.385980     ...       19.617460    48.133561   47.615554    30.868413   30.556750    36.666209   36.441969
1976-06-30    38.620795   38.463828    41.380588   40.831079    30.129364   29.960483    40.704702     ...       19.869510    48.182909   47.600807    31.074662   30.620101    37.123928   36.769646
1976-07-31    38.691612   38.526748    41.548439   40.823728    30.139181   29.974532    40.999260     ...       20.245008    48.190939   47.595652    31.221162   30.694918    37.695979   37.267397
1976-08-31    38.768830   38.598017    41.697225   40.958980    30.183743   30.003149    41.332880     ...       20.590821    48.177652   47.640123    31.376070   30.892543    38.319513   37.936444
1976-09-30    38.830829   38.670819    41.883560   41.301406    30.240191   30.056011    41.631820     ...       20.818679    48.185949   47.777173    31.573523   31.204989    38.962741   38.641213
1976-10-31    38.923251   38.774015    42.162554   41.843095    30.373628   30.199889    41.822286     ...       20.997060    48.231964   48.010079    31.783570   31.526815    39.616352   39.386130
1976-11-30    39.102867   38.988602    42.525372   42.503936    30.587154   30.458126    41.966578     ...       21.235578    48.329286   48.242544    32.048837   31.886991    40.227093   40.087263
1976-12-31    39.292385   39.267048    42.945530   43.119184    30.764663   30.652374    42.156294     ...       21.476291    48.512275   48.480111    32.415178   32.405580    40.718681   40.645104
1977-01-31    39.388110   39.450139    43.380501   43.631207    30.877957   30.844815    42.448026     ...       21.691517    48.804702   48.782961    32.849718   32.937665    41.087572   41.049836
1977-02-28    39.417217   39.488843    43.780287   43.992518    31.010112   31.004389    42.833127     ...       21.850019    49.189489   49.116071    33.361979   33.432924    41.393100   41.371344
1977-03-31    39.444205   39.459786    44.105283   44.165532    31.263171   31.248722    43.219935     ...       22.111569    49.644356   49.377838    33.935754   33.887978    41.675873   41.641534
1977-04-30    39.581335   39.530647    44.364021   44.230465    31.581495   31.528031    43.514299     ...       22.569938    50.169375   49.734861    34.454736   34.268166    41.956252   41.837952
1977-05-31    39.876794   39.759235    44.601709   44.256382    31.861477   31.751082    43.731107     ...       23.051615    50.689889   50.144370    34.893232   34.540933    42.276660   42.018108
1977-06-30    40.293017   40.129254    44.839057   44.243621    32.196370   32.015902    43.973704     ...       23.491496    51.108407   50.490962    35.381716   34.864151    42.736895   42.329047
...                 ...         ...          ...         ...          ...         ...          ...     ...             ...          ...         ...          ...         ...          ...         ...
2016-07-31   132.599826  130.608248   176.787750  173.850747   168.526273  166.780562   138.132931     ...      187.322124   145.667676  143.439483   134.177075  130.504647   187.363168  184.746506
2016-08-31   133.022075  131.238156   176.034537  173.649132   169.526923  168.253567   138.164521     ...      189.210701   145.909376  143.768963   134.448757  131.207161   187.940275  185.288656
2016-09-30   132.848388  131.801316   175.196564  173.477463   170.675230  169.760097   137.838737     ...      191.147989   146.013298  144.254722   134.210854  131.783168   187.979164  185.806246
2016-10-31   132.131421  132.018031   174.602969  173.724838   171.990541  171.351792   137.478133     ...      192.798667   145.928107  144.650738   133.892183  132.380145   187.244371  185.776781
2016-11-30   131.351051  132.197436   174.315989  174.198443   172.884633  172.595990   137.410025     ...      194.543839   145.638546  144.906261   133.143545  132.875691   186.143898  185.545706
2016-12-31   130.925074  132.470198   174.203049  174.724112   173.422643  173.669422   137.741628     ...      196.686728   145.135293  145.001205   132.223438  133.512318   185.352769  185.518939
2017-01-31   131.113598  132.746046   174.320504  175.062814   174.134166  174.813060   138.201868     ...      198.674350   144.575795  144.904867   132.083906  134.139242   185.271251  185.955447
2017-02-28   132.122403  133.351309   174.893424  175.479807   175.076866  175.555160   138.752477     ...      200.559705   144.319529  144.868353   132.729630  134.736105   185.958796  186.739416
2017-03-31   133.712535  134.032444   175.890924  175.846975   176.660680  176.455310   139.560338     ...      202.905037   144.621493  144.790175   134.271819  135.171768   187.160834  187.516561
2017-04-30   135.408734  134.708751   177.145475  176.135739   178.784398  177.812440   140.529965     ...      205.335079   145.377521  144.743952   136.681300  135.986929   188.595601  188.151366
2017-05-31   136.708666  135.217818   178.362239  176.359848   180.510333  179.065741   141.558524     ...      207.588036   146.219754  144.722067   139.017516  136.697898   190.032933  188.701542
2017-06-30   137.263876  135.351875   179.231632  176.375707   181.803532  180.178989   142.694403     ...      209.844259   146.920037  144.868521   140.820982  137.269649   191.127304  188.953937
2017-07-31   137.076476  135.037968   179.671911  176.745654   183.181473  181.405424   143.654029     ...      211.724453   147.320848  145.019306   141.847835  137.973854   191.666629  189.027898
2017-08-31   136.831085  135.064366   179.766361  177.458431   184.253238  182.943114   144.008325     ...      213.259237   147.344639  145.185812   141.930517  138.566212   191.847514  189.139049
2017-09-30   136.972719  135.966552   179.574158  177.838111   184.761422  183.787057   143.957040     ...      215.583300   147.047528  145.306474   141.725747  139.177291   192.017190  189.783158
2017-10-31   137.084720  136.932829   178.989936  178.045002   185.403227  184.702703   143.829009     ...      217.946751   146.774686  145.559958   141.486375  139.895949   192.242059  190.711517
2017-11-30   136.886310  137.733946   178.141870  177.993174   186.438127  186.098927   143.456254     ...      220.491188   146.816202  146.150690   140.823935  140.603174   192.402151  191.747904
2017-12-31   136.838747  138.486416   177.462500  177.994371   187.319646  187.524212   142.999699     ...      223.079771   147.152612  147.021966   139.899256  141.302010   192.480743  192.612389
2018-01-31   137.386527  139.162989   177.274582  177.986031   188.063391  188.760985   143.061276     ...      225.560577   147.623851  147.968792   139.601181  141.765811   192.432381  193.138089
2018-02-28   138.560315  139.883118   177.628707  178.165078   189.699826  190.199068   143.989044     ...      228.368460   148.279081  148.832500   140.758958  142.892707   192.392620  193.187242
2018-03-31   140.207132  140.532600   178.439853  178.343795   192.052473  191.799890   145.385364     ...      230.752912   149.224547  149.376685   143.383724  144.307930   192.866379  193.227071
2018-04-30   142.030988  141.234311   179.534347  178.499005   193.974311  192.895602   146.600819     ...      232.064224   150.265376  149.587593   146.410871  145.624758   193.963855  193.517263
2018-05-31   143.759420  142.113835   180.635290  178.656108   195.652644  194.099898   147.709272     ...      232.389332   151.117827  149.543416   148.752405  146.237262   195.499823  194.155389
2018-06-30   145.071132  143.030359   181.458619  178.590479   197.299826  195.586430   148.676113     ...      232.570205   151.704632  149.558389   150.387997  146.566876   197.230004  195.019059
2018-07-31   145.649750  143.479151   181.864196  178.924144   198.463708  196.597856   149.214379     ...      233.058889   152.059646  149.659148   151.169180  147.047478   198.758395  196.045439
2018-08-31   145.615858  143.784381   181.981395  179.702489   199.367515  197.968454   149.460333     ...      233.178394   152.202201  149.984611   151.230192  147.676558   200.044993  197.217841
2018-09-30   145.371929  144.351813   181.828577  180.063556   200.313523  199.245448   149.691593     ...      233.534676   152.105658  150.324990   151.142824  148.414725   201.069979  198.718684
2018-10-31   145.028104  144.868072   181.219464  180.220086   201.062186  200.275900   149.909376     ...      233.585392   151.784579  150.566075   150.583958  148.886498   201.945318  200.320199
2018-11-30   144.636785  145.521425   180.401487  180.242216   201.586444  201.205283   150.080574     ...      233.240225   151.494684  150.842738   149.500263  149.319695   203.000387  202.298014
2018-12-31   144.364052  146.109780   179.764783  180.331613   201.983817  202.192416   150.316729     ...      233.093139   151.254986  151.102280   148.266162  149.774247   203.983473  204.111784

[528 rows x 100 columns]

在终端输出的时候, 会自动省略掉中间的很多数据. 但是最后可以看到一共输出了100列, 因为我们每个州是有两列数据, 所以是一共有50个州的数据都在这.

这里我们也可以用 Pandas 的 pickle 来操作, 这里是在已有 HPI_data 的数据的前提下, 也就是说把一个 dataframe 输出成 pickle 文件, 以及从 pickle 文件读出 dataframe 只需要各一行数据就可以了.

HPI_data.to_pickle('pickle.pickle')
HPI_data2 = pd.read_pickle('pickle.pickle')
print(HPI_data2)

这里大家可以自己打印一下试试, 会发现非常简单, 也非常快就可以得到跟上面输出一样的数据.

你可能会有这样的疑问: 既然 Python 已经有 pickle 模块了, 为什么 Pandas 还要有呢? 我也真的不知道为什么. 但是显然, Pandas 的 pickle 在出来大型数据的时候会更快一些.

现在, 我们已经把房价数据保存成 pickle 文件了, 我们可以准备进入下一节课的进一步研究了.

后面会持续更新, 有任何问题或者错误, 欢迎留言, 希望和大家交流学习.





posted @ 2019-02-15 22:49  rachelross  阅读(1063)  评论(0编辑  收藏  举报