定向获取固定数据 - shouchengcheng

周末下雨，原本计划泡汤，只好宅在家中。

翻翻网页，觉着写一个python爬虫吧。作为一个只会c的程序员，表示python的基本语法看起来还是蛮清楚的。大致差不多。

然后就爬一下基金的当前价格吧，本来想着写完后让它一直运行，后来写完发现mathematica里直接有函数可以查往年基金价格的接口，那就没必要了。

我的步骤就是简单暴力，因为之前也没有写过爬虫，只知道个大概，所以写起来也是野路子。

首先就是要拿到所有的基金代码，这个网站有列http://fund.eastmoney.com/allfund.html

我用python2.7的，上来直接用urllib2库的接口把html内容拿到，然后就是将数据解析出来保存在文件中。

因为python也是现学的，所以代码可能写的比较烂　　

import urllib2

#to get target html

url = 'http://fund.eastmoney.com/allfund.html'
f = urllib2.urlopen(url)
html = f.read()

#now we get the html context in html, but case 'gb2312', so we should translate to utf-8
after_translate = html.decode('gb2312','ignore').encode('utf-8')

print(after_translate)

save = f.open("fundAll-decode.txt",'w')
save.write(after_translate)
save.close()

上面的代码是得到对应网页的html原始内容，先保存起来慢慢分析处理。

因为拿的都是静态的内容，也不存在get，post那些操作，所以直接解析数据就可以得到基金代码了。

解析的话，就直接正则匹配了

import re

def unique(old_list):
    newList = []
    for x in old_list:
        if x not in newList:
            newList.append(x)
    return newList

f = open("data.txt",'r')
html = f.read()

p = re.compile(r'\d{6}')
#get all len(code) == 6 fundCode

#get fundCode type string
r = unique(p.findall(html))

saveFile = open("fundCode.txt",'w')

for i in r:
    saveFile.write(i+'\r\n')

saveFile.close()

因为基金代码是6位数字的，所以直接匹配6位数据就可以了，好吧这边我偷懒了。因为后面得到的有些会重复，所以写了一个unique函数来filter数据

那基金的话，就只要写一个接口，传入基金code，返回得到的current price，非工作日它的price是固定的。

因为网址是固定+基金代码.html的格式来的，这也是为什么一开始要得到所有基金code的原因了。

import urllib2
import time
import re

url_begin = 'http://fund.eastmoney.com/'
url_end = '.html'

def getPrice(code):
    url = url_begin + code + url_end
    html = urllib2.urlopen(url).read()
   # too avoid execpt, no handler it,jiut print log 
　　try:
        match1 = re.compile(r'(?<=fundpz\"><span\ class\=\")\D+\d\.\d{4}')
        ret_match1 = match1.findall(html)

        match2 = re.compile(r'\d\.\d{4}')
        ret_match2 = match2.findall(ret_match1[0])
    except Exception,e:
        print "["+ code +"]",Exception,":",e
        return ' '

    return ret_match2[0]

#sometimes re split there is ' '
def rightCode(a):
    newList = []
    for i in a:
        if len(i) == 6:
            newList.append(i)
    return newList

#get fund code from fundCode.txt
def getCodeDict():
    f = open("fundCode.txt",'r')
    match1 = re.compile(r'\r\n')
    x = rightCode(match1.split(f.read()))
    f.close()
    return x

if __name__ == "__main__":
    x = getCodeDict()
    while(1):
        for i in x:
            print getPrice(i)

上述代码中html解析了两次才得到price值，本来想着直接匹配数值，但是发现匹配出来有好几个价格，而且当有些基金停牌的时候，没有价格的，根本就搞不定。最后是强行正则匹配fundpz这个的值。

至此，基本能够打印网页上的fundpz这一标签的price。

总结来说，这段代码毫无通用性可以，唯一的优点也就是码一遍熟悉一下键盘键位，然后就是对python的语法有些印象。然后对正则也稍微学习了一下。了剩余无嘛，既然搞了一下，不管是好还是不好，先记录一下。下次翻一翻就会感叹：“我擦，这代码尽然是我写的，居然跟翔一样”。:)

posted on 2015-11-01 17:24 shouchengcheng 阅读(436) 评论(0) 编辑收藏举报