Python爬取韩寒所有新浪博客
接上一篇,我们依据第一页的链接爬取了第一页的博客,我们不难发现,每一页的链接就仅仅有一处不同(页码序号),我们仅仅要在上一篇的代码外面加一个循环,这样就能够爬取全部博客分页的博文。也就是全部博文了。
# -*- coding : -utf-8 -*- import urllib import time url = [' ']*350 page = 1 link = 1 while page <=7://眼下共同拥有7页。3 con = urllib.urlopen('http://blog.sina.com.cn/s/articlelist_1191258123_0_'+str(page)+'.html').read() i = 0 title = con.find(r'<a title=') href = con.find(r'href=',title) html = con.find(r'.html',href) while title != -1 and href != -1 and html != -1 and i<350: url[i] = con[href + 6:html + 5] content = urllib.urlopen(url[i]).read() open(r'allboke/'+url[i][-26:],'w+').write(content) print 'link',link,url[i] title = con.find(r'<a title=',html) href = con.find(r'href=',title) html = con.find(r'.html',href) i = i + 1 link = link + 1 else: print 'page',page,'find end!' page = page + 1 else: print 'all find end' #i = 0 #while i < 350: #content = urllib.urlopen(url[i]).read() #open(r'save/'+url[i][-26:],'w+').write(content) #print 'downloading',i,url[i] #i = i + 1 #time.sleep(1) #else: print 'download artical finished!'
所以就将保存网页的代码放在搜索里,找到就保存!
正确执行界面:
执行结果: