Python基本数据统计(一)---- 便捷数据获取 & 数据准备和整理 & 数据显示
1. 便捷数据获取
1.1 本地数据获取:文件的打开,读写和关闭(另外的单独章节)
1.2 网络数据获取:
1.2.1 urllib, urllib2, httplib, httplib2 (python3中为urllib.request, http.client)
正则表达式(另外的单数章节)
1.2.2 通过matplotlib.finace模块获取雅虎财经上的数据
In [7]: from matplotlib.finance import quotes_historical_yahoo_ochl In [8]: from datetime import date In [9]: from datetime import datetime In [10]: import pandas as pd In [11]: today = date.today() In [12]: start = (today.year-1, today.month, today.day) In [14]: quotes = quotes_historical_yahoo_ochl('AXP', start, today) # 获取数据 In [15]: fields = ['date', 'open', 'close', 'high', 'low', 'volume'] In [16]: list1 = [] In [18]: for i in range(0,len(quotes)): ...: x = date.fromordinal(int(quotes[i][0])) # 取每一行的第一列,通过date.fromordinal设置为日期数据类型 ...: y = datetime.strftime(x,'%Y-%m-%d') # 通过datetime.strftime把日期设置为指定格式 ...: list1.append(y) # 将日期放入列表中 ...: In [19]: quotesdf = pd.DataFrame(quotes,index=list1,columns=fields) # index设置为日期,columns设置为字段 In [20]: quotesdf = quotesdf.drop(['date'],axis=1) # 删除date列 In [21]: print quotesdf open close high low volume 2016-01-20 60.374146 61.835916 62.336256 60.128882 9043800.0 2016-01-21 61.806486 61.453305 63.101479 61.325767 8992300.0 2016-01-22 57.283819 54.016907 57.774347 53.114334 43783400.0
1.2.3 通过自然语言工具包NLTK获取语料库等数据
1. 下载nltk:pip install nltk
2. 下载语料库:
In [1]: import nltk In [2]: nltk.download() NLTK Downloader --------------------------------------------------------------------------- d) Download l) List u) Update c) Config h) Help q) Quit --------------------------------------------------------------------------- Downloader> d Download which package (l=list; x=cancel)? Identifier> gutenberg Downloading package gutenberg to /root/nltk_data... Package gutenberg is already up-to-date!
3. 获取数据:
In [3]: from nltk.corpus import gutenberg In [4]: print gutenberg.fileids() [u'austen-emma.txt', u'austen-persuasion.txt', u'austen-sense.txt', u'bible-kjv.txt', u'blake-poems.txt', u'bryant-stories.txt', u'burgess-busterbrown.txt', u'carroll-alice.txt', u'chesterton-ball.txt', u'chesterton-brown.txt', u'chesterton-thursday.txt', u'edgeworth-parents.txt', u'melville-moby_dick.txt', u'milton-paradise.txt', u'shakespeare-caesar.txt', u'shakespeare-hamlet.txt', u'shakespeare-macbeth.txt', u'whitman-leaves.txt'] In [5]: texts = gutenberg.words('shakespeare-hamlet.txt') In [6]: texts Out[6]: [u'[', u'The', u'Tragedie', u'of', u'Hamlet', u'by', ...]
2. 数据准备和整理
2.1 quotes数据加入[ 列 ]属性名
In [79]: quotesdf = pd.DataFrame(quotes) In [80]: quotesdf Out[80]: 0 1 2 3 4 5 0 735983.0 60.374146 61.835916 62.336256 60.128882 9043800.0 1 735984.0 61.806486 61.453305 63.101479 61.325767 8992300.0 2 735985.0 57.283819 54.016907 57.774347 53.114334 43783400.0 3 735988.0 53.428272 53.977664 54.713455 53.114334 18498300.0 [253 rows x 6 columns] In [81]: fields = ['date','open','close','high','low','volume'] In [82]: quotesdf = pd.DataFrame(quotes,columns=fields) # 设置列属性名称 In [83]: quotesdf Out[83]: date open close high low volume 0 735983.0 60.374146 61.835916 62.336256 60.128882 9043800.0 1 735984.0 61.806486 61.453305 63.101479 61.325767 8992300.0 2 735985.0 57.283819 54.016907 57.774347 53.114334 43783400.0 3 735988.0 53.428272 53.977664 54.713455 53.114334 18498300.0
2.2 quotes数据加入[ index ]属性名
In [84]: quotesdf Out[84]: date open close high low volume 0 735983.0 60.374146 61.835916 62.336256 60.128882 9043800.0 1 735984.0 61.806486 61.453305 63.101479 61.325767 8992300.0 2 735985.0 57.283819 54.016907 57.774347 53.114334 43783400.0 [253 rows x 6 columns] In [85]: quotesdf = pd.DataFrame(quotes, index=range(1,len(quotes)+1),columns=fields) # 把index属性从0,1,2...改为1,2,3... In [86]: quotesdf Out[86]: date open close high low volume 1 735983.0 60.374146 61.835916 62.336256 60.128882 9043800.0 2 735984.0 61.806486 61.453305 63.101479 61.325767 8992300.0 3 735985.0 57.283819 54.016907 57.774347 53.114334 43783400.0
2.3 日期转换:Gregorian日历表示法 => 普通表示方法
In [88]: from datetime import date In [89]: firstday = date.fromordinal(735190) In [93]: firstday Out[93]: datetime.date(2013, 11, 18) In [95]: firstday = datetime.strftime(firstday,'%Y-%m-%d') In [96]: firstday Out[96]: '2013-11-18'
2.4 创建时间序列:
In [120]: import pandas as pd In [121]: dates = pd.date_range('20170101', periods=7) # 根据起始日期和长度生成日期序列 In [122]: dates Out[122]: DatetimeIndex(['2017-01-01', '2017-01-02', '2017-01-03', '2017-01-04','2017-01-05', '2017-01-06', '2017-01-07'],dtype='datetime64[ns]', freq='D') In [123]: import numpy as np In [124]: dates = pd.DataFrame(np.random.randn(7,3), index=dates, columns=list('ABC')) # 时间序列当作index,ABC当作列的name属性,表内容为七行三列随机数 In [125]: dates Out[125]: A B C 2017-01-01 0.705927 0.311453 1.455362 2017-01-02 -0.331531 -0.358449 0.175375 2017-01-03 -0.284583 -1.760700 -0.582880 2017-01-04 -0.759392 -2.080658 -2.015328 2017-01-05 -0.517370 0.906072 -0.106568 2017-01-06 -0.252802 -2.135604 -0.692153 2017-01-07 -0.275184 0.142973 -1.262126
2.5 练习
In [101]: datetime.now() # 显示当前日期和时间 Out[101]: datetime.datetime(2017, 1, 20, 16, 11, 50, 43258) ========================================= In [108]: datetime.now().month # 显示当前月份 Out[108]: 1 ========================================= In [126]: import pandas as pd In [127]: dates = pd.date_range('2015-02-01',periods=10) In [128]: dates Out[128]: DatetimeIndex(['2015-02-01', '2015-02-02', '2015-02-03', '2015-02-04','2015-02-05', '2015-02-06', '2015-02-07', '2015-02-08','2015-02-09', '2015-02-10'],dtype='datetime64[ns]', freq='D') In [133]: res = pd.DataFrame(range(1,11),index=dates,columns=['value']) In [134]: res Out[134]: value 2015-02-01 1 2015-02-02 2 2015-02-03 3 2015-02-04 4 2015-02-05 5 2015-02-06 6 2015-02-07 7 2015-02-08 8 2015-02-09 9 2015-02-10 10
3. 数据显示
3.1 显示方式:
In [180]: quotesdf2.index # 显示索引 Out[180]: Index([u'2016-01-20', u'2016-01-21', u'2016-01-22', u'2016-01-25', ... u'2017-01-11', u'2017-01-12', u'2017-01-13', u'2017-01-17', u'2017-01-18', u'2017-01-19'], dtype='object', length=253) In [181]: quotesdf2.columns # 显示列名 Out[181]: Index([u'open', u'close', u'high', u'low', u'volume'], dtype='object') In [182]: quotesdf2.values # 显示数据的值 Out[182]: array([[ 6.03741455e+01, 6.18359160e+01, 6.23362562e+01, 6.01288817e+01, 9.04380000e+06], ..., [ 7.76100010e+01, 7.66900020e+01, 7.77799990e+01, 7.66100010e+01, 7.79110000e+06]]) In [183]: quotesdf2.describe # 显示数据描述 Out[183]: <bound method DataFrame.describe of open close high low volume 2016-01-20 60.374146 61.835916 62.336256 60.128882 9043800.0 2016-01-21 61.806486 61.453305 63.101479 61.325767 8992300.0 2016-01-22 57.283819 54.016907 57.774347 53.114334 43783400.0
3.2 索引的格式:u 表示unicode编码
3.3 显示行:
In [193]: quotesdf.head(2) # 专用方式显示头两行 Out[193]: date open close high low volume 1 735983.0 60.374146 61.835916 62.336256 60.128882 9043800.0 2 735984.0 61.806486 61.453305 63.101479 61.325767 8992300.0 In [194]: quotesdf.tail(2) # 专用方式显示尾两行 Out[194]: date open close high low volume 252 736347.0 77.110001 77.489998 77.610001 76.510002 5988400.0 253 736348.0 77.610001 76.690002 77.779999 76.610001 7791100.0 In [195]: quotesdf[:2] # 切片方式显示头两行 Out[195]: date open close high low volume 1 735983.0 60.374146 61.835916 62.336256 60.128882 9043800.0 2 735984.0 61.806486 61.453305 63.101479 61.325767 8992300.0 In [197]: quotesdf[251:] # 切片方式显示尾两行 Out[197]: date open close high low volume 252 736347.0 77.110001 77.489998 77.610001 76.510002 5988400.0 253 736348.0 77.610001 76.690002 77.779999 76.610001 7791100.0
4. 数据选择
5. 简单统计与处理
6. Grouping
7. Merge
posted on 2017-01-20 17:38 你的踏板车要滑向哪里 阅读(684) 评论(0) 编辑 收藏 举报