（原）python爬虫入门（2）---排序爬取的辽宁科技大学热点新闻

发现科大网页的源码中还有文章的点击率，何不做一个文章点击率的降序排行。简单，前面入门（1）基本已经完成我们所要的功能了，本篇我们仅仅需要添加：一个通过正则获取文章点击率的数字；再加一个根据该数字的插入排序。ok，大功告成！
简单说一下本文插入排序的第一个循环，找到列表中最大的数，放到列表 0 的位置做观察哨。

上代码：

# -*- coding: utf-8 -*-
# 程序：爬取点击排名前十的科大热点新闻
# 版本：0.1
# 时间：2014.06.30
# 语言：python 2.7
#---------------------------------

import string,urllib2,re,sys
#解决这个错误：UnicodeEncodeError: 'ascii' codec can't encode characters in position 32-34: ordinal not in range(128)
reload(sys)
sys.setdefaultencoding('utf-8')

class USTL_Spider:
    def __init__(self,url,num=10):
        self.myUrl=url
        #存放获取的标题和网址
        self.datas=[]
        self.num=num
        print 'The Spider is Starting!'

    def ustl_start(self):
        myPage=urllib2.urlopen(self.myUrl+'.html').read().decode('gb2312')
        if myPage==None:
            print 'No such is needed!'
            return
        #首先获得总的页数
        endPage=self.find_endPage(myPage)
        if endPage==0:
            return
        #处理第一页的数据
        self.deal_data(myPage)
        #处理除第一页之外的所有数据
        self.save_data(self.myUrl,endPage)

    #获取总的页数
    def find_endPage(self,myPage):
        #找到网页源码中带有尾页的一行。eg: >8</font> xxxxx title="尾页"
        #匹配中文，需要utf-8格式，并且变成ur''。
        #.*?：非贪婪匹配任意项
        #re.S：正则表达式的 . 可以匹配换行符
        myMatch=re.search(ur'>8</font>(.*?)title="尾页"',myPage,re.S)
        endPage=0
        if myMatch:
            #找到带尾页行中的数字。eg：xxxx_ NUM .html
            endPage=int(re.match(r'(.*?)_(\d+).html',myMatch.group(1),re.S).group(2))
        else:
            print 'Cant get endPage!'
        return endPage

    #将列表中元组依次写入到我的d盘tests文件夹sort_ustl.txt文件上
    def save_data(self,url,endPage):
        self.get_data(url,endPage)
        f=open("d:\\tests\\sort_ustl.txt",'w')
        for item in self.datas:
            f.write(item[1]+', '+item[0])
        f.close()
        print 'Over!'

    #提取每个网页
    def get_data(self,url,endPage):
        for i in range(2,endPage+1):
            print 'Now the spider is crawling the %d page...' % i
            #字符串做decode时候，加'ignore'忽略非法字符
            myPage=urllib2.urlopen(self.myUrl+'_'+str(i)+'.html').read().decode('gb2312','ignore')
            if myPage==None:
                print 'No such is needed!'
                return
            self.deal_data(myPage)

    #获得我们想要的字符串，追加到datas中
    def deal_data(self,myPage):
        #这里我们想要的是文章标题，网址和点击率。将（标题网址，点击率）元组添加到datas列表中，对datas进行插入排序
        myItems=re.findall(r'<TD width=565>.*?href="(.*?)">(.*?)</a>.*?class=textthick2> (\d+)</font>',myPage,re.S)
        for site,title,click in myItems:
                self.datas.append(('%s :%5swww.ustl.edu.cn%s\n' %(title,' ',site),click))
        self.insert_sort()

    #插入排序，只需要点击排名前self.num(默认是10)的文章。
    def insert_sort(self):
        for i in range(len(self.datas)-1,0,-1):
            if int(self.datas[i][1])>int(self.datas[i-1][1]):
                tmp=self.datas[i]
                self.datas[i]=self.datas[i-1]
                self.datas[i-1]=tmp
        for i in range(2,len(self.datas)):
            v=self.datas[i]
            j=i
            while int(v[1])>int(self.datas[j-1][1]):
                self.datas[j]=self.datas[j-1]
                j-=1
            self.datas[j]=v
        del self.datas[self.num:len(self.datas)]
        
#我们需要爬取的网页
ustl=USTL_Spider('http://www.ustl.edu.cn/news/news/RDXW')
#ustl=USTL_Spider('http://www.ustl.edu.cn/news/news/ZHXX')
ustl.ustl_start()

不足：我想当第一页运行过插入排序后，在其他页进行插入之前，可以直接将小于已排序列表中最后一个元素的元素直接pass，不必在放到datas中。

结果截图：
参考资料：

　　　　1.python list的方法你需要看看吧：

http://www.cnblogs.com/zhengyuxin/articles/1938300.html

　　　　2.python 默认参数值（主要是这本书挺好的，薄薄薄薄薄薄。。）：

http://woodpecker.org.cn/abyteofpython_cn/chinese/ch07s04.html

posted @ 2014-07-01 12:09 哈士奇.银桑阅读(492) 评论(0) 收藏举报

刷新页面返回顶部

Chandler's Sexyface

（原）python爬虫入门（2）---排序爬取的辽宁科技大学热点新闻

公告