基于 Python 和 Pandas 的数据分析(7) --- Pickling
上一节我们介绍了几种合并数据的方法. 这一节, 我们将重新开始不动产的例子. 在第四节中我们写了如下代码:
import Quandl
import pandas as pd
fiddy_states = pd.read_html('https://simple.wikipedia.org/wiki/List_of_U.S._states')
for abbv in fiddy_states[0][1][2:]:
#print(abbv)
print("FMAC/HPI_"+str(abbv))
上面这段代码是为了获取 50 个州的简称, 进行遍历, 并生成恰当的 Quandl 查询语句, 用以查出每个州的房价. 如果单个处理, 将会有50个 dataframes. 大家可以查看一下第四节的教程, 发现房价的查询结果都会返回 "Date"索引, 所以我们可以使用 "join" 来做数据合并. (通常情况下, 可能没有现成的索引, 那么可以使用 "concatenate", 并加上参数 on = 'Date')
下面我们开始收集数据:
import Quandl
import pandas as pd
mytoken = 'HK1FqyWqrmUwLCYnWcHa'
fiddy_states = pd.read_html('https://simple.wikipedia.org/wiki/List_of_U.S._states')
main_df = pd.DataFrame()
for abbv in fiddy_states[0][1][2:]:
query = "FMAC/HPI_"+str(abbv)
df = Quandl.get(query, authtoken=mytoken)
if main_df.empty:
main_df = df
else:
main_df = main_df.join(df)
注意: 这里可能会有一点问题, Quandl 在返回数据的时候, 如果是只有一个列, 它会自动把列名改为 "value". 那么当我们循环的时候, 就会有问题, 变成每个列名都是 "value", 就会报类似这样的错:
ValueError: columns overlap but no suffix specified: Index([u'Value'], dtype='object')
所以这里我们要改一下, 循环之后, 把列名"value" 改成每个州的缩写, 也就是每个 "abbv" 的值.
df.rename(columns={'NSA Value':str(abbv)+'NSA Value', 'SA Value':str(abbv)+'SA Value'}, inplace=True)
很好, 但是你可能发现, 每运行一次都至少要花费30秒钟, 这是让人很不爽的. 所以我们要想办法把这个数据保存成一个文件, 这样以后需要用的时候, 就基于这个文件就好了.
这里就引出 python 的另一个模块"Pickle", 它可以把对象转成二进制字节流, 也可以通过 unpickling 进行反向转换. 我们可以通过它保存任何 Python 对象, 不管是机器学习的 "classifier", 还是 "Dictionary", 又或者是 "dataframe" 都没有问题. 另外, 除了 Python 的 pickle, Pandas 也有 pickle. 我们会分别学习这两种.
首先, 我们来看下 Python 的 pickle. 我们可以用它来处理任何 Python 对象, 不一定非得是 dataframe. 但是在我们这个例子中, 我们还是用它来处理 dataframe.
第一步, 引入 pickle:
import pickle
第二步:
pickle_out = open('fiddy_states.pickle','wb')
pickle.dump(main_df, pickle_out)
pickle_out.close()
在上面的代码中, 首先是以可写的方式创建并打开一个 .pickle 的文件, 然后我们将数据 main_df 写入这个文件. 最后保存关闭这个文件.
尽管我们不想重复地运行获取每个州的房价的代码, 但是我们仍然有偶尔需要运行它的可能, 所以我们还是要把代码整理一下, 以提高它的复用性:
import Quandl
import pandas as pd
import pickle
mytoken = 'HK1FqyWqrmUwLCYnWcHa'
def state_list():
fiddy_states = pd.read_html('https://simple.wikipedia.org/wiki/List_of_U.S._states')
return fiddy_states[0][1][2:]
def grab_initial_state_data():
states = state_list()
main_df = pd.DataFrame()
for abbv in states:
query = "FMAC/HPI_"+str(abbv)
df = Quandl.get(query, authtoken=api_key)
print(query)
if main_df.empty:
main_df = df
else:
main_df = main_df.join(df)
pickle_out = open('fiddy_states.pickle','wb')
pickle.dump(main_df, pickle_out)
pickle_out.close()
grab_initial_state_data()
如上, 我们封装了两个函数, 这样我们就可以很方便地随时获取所有州的列表, 也可以获取所有房价的数据并把它们保存成 pickle 文件.
现在如果想要再次获取房价的信息, 就只要运行下面的代码就可以了:
pickle_in = open('fiddy_states.pickle','rb')
HPI_data = pickle.load(pickle_in)
print(HPI_data)
以可读的方式打开刚刚保存的 pickle 文件, 把文件的内容保存到 HPI_data 变量中, 并打印出来.
ALNSA Value ALSA Value AKNSA Value AKSA Value AZNSA Value AZSA Value ARNSA Value ... WASA Value WVNSA Value WVSA Value WINSA Value WISA Value WYNSA Value WYSA Value
Date ...
1975-01-31 35.701696 35.757921 34.456361 34.655493 28.872050 28.841062 37.136508 ... 17.459678 40.675291 40.657172 27.742952 27.817227 31.322153 31.293386
1975-02-28 35.896549 35.961777 34.954360 35.123805 29.371317 29.365897 37.428949 ... 17.566117 41.759012 41.696684 28.178063 28.237985 31.808090 31.791372
1975-03-31 36.124717 36.138987 35.465174 35.513620 29.839679 29.825888 37.672883 ... 17.654450 42.890424 42.660165 28.598691 28.558429 32.319995 32.293365
1975-04-30 36.401656 36.355040 35.998432 35.890060 30.198725 30.147602 37.866182 ... 17.752318 44.062355 43.680733 28.968088 28.811228 32.845934 32.753322
1975-05-31 36.644220 36.536191 36.581805 36.298572 30.374603 30.269359 38.063496 ... 17.814398 45.245558 44.758631 29.267055 28.971560 33.354786 33.150798
1975-06-30 36.731053 36.581766 37.208749 36.714638 30.305563 30.135694 38.284160 ... 17.906027 46.371800 45.811578 29.506152 29.074535 33.840685 33.517736
1975-07-31 36.611401 36.455401 37.839987 37.179961 29.978801 29.815028 38.471020 ... 18.040211 47.354109 46.769160 29.713240 29.212413 34.309437 33.919358
1975-08-31 36.353737 36.193565 38.422661 37.742392 29.484860 29.308448 38.551793 ... 18.122491 48.125707 47.588758 29.918971 29.457899 34.741733 34.394430
1975-09-30 36.097182 35.948436 38.932931 38.391789 28.985826 28.809285 38.490112 ... 18.142710 48.614964 48.202548 30.189301 29.836924 35.110433 34.820695
1975-10-31 35.991355 35.853360 39.390670 39.092214 28.726736 28.562418 38.372037 ... 18.193266 48.775900 48.551512 30.500255 30.253867 35.379704 35.174103
1975-11-30 36.111149 36.005626 39.794554 39.774496 28.880050 28.758223 38.359030 ... 18.385560 48.679916 48.592545 30.720418 30.565281 35.553098 35.429515
1975-12-31 36.459427 36.435918 40.129427 40.291694 29.296251 29.189322 38.487981 ... 18.653890 48.456979 48.424852 30.775030 30.765919 35.667694 35.603243
1976-01-31 36.997043 37.055307 40.397424 40.630890 29.705911 29.674028 38.704100 ... 18.955271 48.212566 48.191089 30.711465 30.793688 35.763022 35.730177
1976-02-29 37.599405 37.667727 40.618981 40.815886 29.991147 29.985613 39.017508 ... 19.178024 48.050434 47.978716 30.615016 30.680120 35.889923 35.871059
1976-03-31 38.103297 38.118349 40.819723 40.875483 30.136340 30.122412 39.454330 ... 19.327694 48.016346 47.758569 30.568294 30.525259 36.078335 36.048608
1976-04-30 38.413508 38.364316 41.012457 40.888991 30.171771 30.120693 39.955964 ... 19.466310 48.064191 47.647910 30.655795 30.489796 36.330964 36.228526
1976-05-31 38.557345 38.443675 41.199138 40.880155 30.152499 30.048025 40.385980 ... 19.617460 48.133561 47.615554 30.868413 30.556750 36.666209 36.441969
1976-06-30 38.620795 38.463828 41.380588 40.831079 30.129364 29.960483 40.704702 ... 19.869510 48.182909 47.600807 31.074662 30.620101 37.123928 36.769646
1976-07-31 38.691612 38.526748 41.548439 40.823728 30.139181 29.974532 40.999260 ... 20.245008 48.190939 47.595652 31.221162 30.694918 37.695979 37.267397
1976-08-31 38.768830 38.598017 41.697225 40.958980 30.183743 30.003149 41.332880 ... 20.590821 48.177652 47.640123 31.376070 30.892543 38.319513 37.936444
1976-09-30 38.830829 38.670819 41.883560 41.301406 30.240191 30.056011 41.631820 ... 20.818679 48.185949 47.777173 31.573523 31.204989 38.962741 38.641213
1976-10-31 38.923251 38.774015 42.162554 41.843095 30.373628 30.199889 41.822286 ... 20.997060 48.231964 48.010079 31.783570 31.526815 39.616352 39.386130
1976-11-30 39.102867 38.988602 42.525372 42.503936 30.587154 30.458126 41.966578 ... 21.235578 48.329286 48.242544 32.048837 31.886991 40.227093 40.087263
1976-12-31 39.292385 39.267048 42.945530 43.119184 30.764663 30.652374 42.156294 ... 21.476291 48.512275 48.480111 32.415178 32.405580 40.718681 40.645104
1977-01-31 39.388110 39.450139 43.380501 43.631207 30.877957 30.844815 42.448026 ... 21.691517 48.804702 48.782961 32.849718 32.937665 41.087572 41.049836
1977-02-28 39.417217 39.488843 43.780287 43.992518 31.010112 31.004389 42.833127 ... 21.850019 49.189489 49.116071 33.361979 33.432924 41.393100 41.371344
1977-03-31 39.444205 39.459786 44.105283 44.165532 31.263171 31.248722 43.219935 ... 22.111569 49.644356 49.377838 33.935754 33.887978 41.675873 41.641534
1977-04-30 39.581335 39.530647 44.364021 44.230465 31.581495 31.528031 43.514299 ... 22.569938 50.169375 49.734861 34.454736 34.268166 41.956252 41.837952
1977-05-31 39.876794 39.759235 44.601709 44.256382 31.861477 31.751082 43.731107 ... 23.051615 50.689889 50.144370 34.893232 34.540933 42.276660 42.018108
1977-06-30 40.293017 40.129254 44.839057 44.243621 32.196370 32.015902 43.973704 ... 23.491496 51.108407 50.490962 35.381716 34.864151 42.736895 42.329047
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2016-07-31 132.599826 130.608248 176.787750 173.850747 168.526273 166.780562 138.132931 ... 187.322124 145.667676 143.439483 134.177075 130.504647 187.363168 184.746506
2016-08-31 133.022075 131.238156 176.034537 173.649132 169.526923 168.253567 138.164521 ... 189.210701 145.909376 143.768963 134.448757 131.207161 187.940275 185.288656
2016-09-30 132.848388 131.801316 175.196564 173.477463 170.675230 169.760097 137.838737 ... 191.147989 146.013298 144.254722 134.210854 131.783168 187.979164 185.806246
2016-10-31 132.131421 132.018031 174.602969 173.724838 171.990541 171.351792 137.478133 ... 192.798667 145.928107 144.650738 133.892183 132.380145 187.244371 185.776781
2016-11-30 131.351051 132.197436 174.315989 174.198443 172.884633 172.595990 137.410025 ... 194.543839 145.638546 144.906261 133.143545 132.875691 186.143898 185.545706
2016-12-31 130.925074 132.470198 174.203049 174.724112 173.422643 173.669422 137.741628 ... 196.686728 145.135293 145.001205 132.223438 133.512318 185.352769 185.518939
2017-01-31 131.113598 132.746046 174.320504 175.062814 174.134166 174.813060 138.201868 ... 198.674350 144.575795 144.904867 132.083906 134.139242 185.271251 185.955447
2017-02-28 132.122403 133.351309 174.893424 175.479807 175.076866 175.555160 138.752477 ... 200.559705 144.319529 144.868353 132.729630 134.736105 185.958796 186.739416
2017-03-31 133.712535 134.032444 175.890924 175.846975 176.660680 176.455310 139.560338 ... 202.905037 144.621493 144.790175 134.271819 135.171768 187.160834 187.516561
2017-04-30 135.408734 134.708751 177.145475 176.135739 178.784398 177.812440 140.529965 ... 205.335079 145.377521 144.743952 136.681300 135.986929 188.595601 188.151366
2017-05-31 136.708666 135.217818 178.362239 176.359848 180.510333 179.065741 141.558524 ... 207.588036 146.219754 144.722067 139.017516 136.697898 190.032933 188.701542
2017-06-30 137.263876 135.351875 179.231632 176.375707 181.803532 180.178989 142.694403 ... 209.844259 146.920037 144.868521 140.820982 137.269649 191.127304 188.953937
2017-07-31 137.076476 135.037968 179.671911 176.745654 183.181473 181.405424 143.654029 ... 211.724453 147.320848 145.019306 141.847835 137.973854 191.666629 189.027898
2017-08-31 136.831085 135.064366 179.766361 177.458431 184.253238 182.943114 144.008325 ... 213.259237 147.344639 145.185812 141.930517 138.566212 191.847514 189.139049
2017-09-30 136.972719 135.966552 179.574158 177.838111 184.761422 183.787057 143.957040 ... 215.583300 147.047528 145.306474 141.725747 139.177291 192.017190 189.783158
2017-10-31 137.084720 136.932829 178.989936 178.045002 185.403227 184.702703 143.829009 ... 217.946751 146.774686 145.559958 141.486375 139.895949 192.242059 190.711517
2017-11-30 136.886310 137.733946 178.141870 177.993174 186.438127 186.098927 143.456254 ... 220.491188 146.816202 146.150690 140.823935 140.603174 192.402151 191.747904
2017-12-31 136.838747 138.486416 177.462500 177.994371 187.319646 187.524212 142.999699 ... 223.079771 147.152612 147.021966 139.899256 141.302010 192.480743 192.612389
2018-01-31 137.386527 139.162989 177.274582 177.986031 188.063391 188.760985 143.061276 ... 225.560577 147.623851 147.968792 139.601181 141.765811 192.432381 193.138089
2018-02-28 138.560315 139.883118 177.628707 178.165078 189.699826 190.199068 143.989044 ... 228.368460 148.279081 148.832500 140.758958 142.892707 192.392620 193.187242
2018-03-31 140.207132 140.532600 178.439853 178.343795 192.052473 191.799890 145.385364 ... 230.752912 149.224547 149.376685 143.383724 144.307930 192.866379 193.227071
2018-04-30 142.030988 141.234311 179.534347 178.499005 193.974311 192.895602 146.600819 ... 232.064224 150.265376 149.587593 146.410871 145.624758 193.963855 193.517263
2018-05-31 143.759420 142.113835 180.635290 178.656108 195.652644 194.099898 147.709272 ... 232.389332 151.117827 149.543416 148.752405 146.237262 195.499823 194.155389
2018-06-30 145.071132 143.030359 181.458619 178.590479 197.299826 195.586430 148.676113 ... 232.570205 151.704632 149.558389 150.387997 146.566876 197.230004 195.019059
2018-07-31 145.649750 143.479151 181.864196 178.924144 198.463708 196.597856 149.214379 ... 233.058889 152.059646 149.659148 151.169180 147.047478 198.758395 196.045439
2018-08-31 145.615858 143.784381 181.981395 179.702489 199.367515 197.968454 149.460333 ... 233.178394 152.202201 149.984611 151.230192 147.676558 200.044993 197.217841
2018-09-30 145.371929 144.351813 181.828577 180.063556 200.313523 199.245448 149.691593 ... 233.534676 152.105658 150.324990 151.142824 148.414725 201.069979 198.718684
2018-10-31 145.028104 144.868072 181.219464 180.220086 201.062186 200.275900 149.909376 ... 233.585392 151.784579 150.566075 150.583958 148.886498 201.945318 200.320199
2018-11-30 144.636785 145.521425 180.401487 180.242216 201.586444 201.205283 150.080574 ... 233.240225 151.494684 150.842738 149.500263 149.319695 203.000387 202.298014
2018-12-31 144.364052 146.109780 179.764783 180.331613 201.983817 202.192416 150.316729 ... 233.093139 151.254986 151.102280 148.266162 149.774247 203.983473 204.111784
[528 rows x 100 columns]
在终端输出的时候, 会自动省略掉中间的很多数据. 但是最后可以看到一共输出了100列, 因为我们每个州是有两列数据, 所以是一共有50个州的数据都在这.
这里我们也可以用 Pandas 的 pickle 来操作, 这里是在已有 HPI_data 的数据的前提下, 也就是说把一个 dataframe 输出成 pickle 文件, 以及从 pickle 文件读出 dataframe 只需要各一行数据就可以了.
HPI_data.to_pickle('pickle.pickle')
HPI_data2 = pd.read_pickle('pickle.pickle')
print(HPI_data2)
这里大家可以自己打印一下试试, 会发现非常简单, 也非常快就可以得到跟上面输出一样的数据.
你可能会有这样的疑问: 既然 Python 已经有 pickle 模块了, 为什么 Pandas 还要有呢? 我也真的不知道为什么. 但是显然, Pandas 的 pickle 在出来大型数据的时候会更快一些.
现在, 我们已经把房价数据保存成 pickle 文件了, 我们可以准备进入下一节课的进一步研究了.
后面会持续更新, 有任何问题或者错误, 欢迎留言, 希望和大家交流学习.