python 爬虫学习篇1

python爬取diameizi网页,然后下载图片

python 环境是2.7.3

代码地址:https://gist.github.com/zjjott/5270366

作者讨论地址:http://tieba.baidu.com/p/2239765168?fr=itb_feed_jing#30880553662l

需要抓的美女图片地址:http://diameizi.diandian.com/

 1 #coding=utf-8
 2 import os
 3 os.system("wget -r --spider http://diameizi.diandian.com 2>|log.txt")#非常简单的抓取整个网页树结构的语句————实质上是一种偷懒
 4 filein=open('log.txt','r')
 5 fileout=open('dst','w+')#一个装最后的结果的没用的文件
 6 filelist=list(filein)
 7 import urllib2,time
 8 from bs4 import BeautifulSoup
 9 header={
10     'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.7; rv:8.0.1) Gecko/20100101 Firefox/8.0.1'} 
11 def getsite(url):
12     req=urllib2.Request(url,None,header)
13     site=urllib2.urlopen(req)
14     return site.read()##上面这六句基本万金油了。。
15 try:
16     dst=set()
17     for p in filelist:
18         if p.find('http://diameizi.diandian.com/post')>-1:
19             p=p[p.find('http'):]
20             dst.add(p)
21     i=0
22     for p in dst:
23         #if i<191:
24         #        i+=1
25         #        continue##断点续传部分
26         pagesoup=BeautifulSoup(getsite(p))
27         pageimg=pagesoup.find_all('img')
28         for href in pageimg:
29             print i,href['src']
30             picpath="pic/"+href['src'][-55:-13]+href['src'][-4:]##名字的起法有问题。。。不过效果还行。。
31             pic=getsite(href['src'])
32             picfile=open(picpath,'wb')
33             picfile.write(pic)
34             i+=1
35             picfile.close()
36 finally:
37     for p in dst:
38         fileout.write(p)
39     fileout.close()

 上面的log.txt

文件大体就是下面的内容。

Spider mode enabled. Check if remote file exists.
--2013-03-29 23:00:10--  http://diameizi.diandian.com/
Resolving diameizi.diandian.com (diameizi.diandian.com)... 113.31.29.120, 113.31.29.121
Connecting to diameizi.diandian.com (diameizi.diandian.com)|113.31.29.120|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 30502 (30K) [text/html]
Remote file exists and could contain links to other resources -- retrieving.

--2013-03-29 23:00:11--  http://diameizi.diandian.com/
Reusing existing connection to diameizi.diandian.com:80.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: `diameizi.diandian.com/index.html'

     0K .......... .......... .........                        94.6K=0.3s

2013-03-29 23:00:12 (94.6 KB/s) - `diameizi.diandian.com/index.html' saved [30502]

Loading robots.txt; please ignore errors.
--2013-03-29 23:00:12--  http://diameizi.diandian.com/robots.txt
Reusing existing connection to diameizi.diandian.com:80.
HTTP request sent, awaiting response... 200 OK
Length: 209 [text/plain]
Saving to: `diameizi.diandian.com/robots.txt'

     0K                                                       100% 20.8M=0s

2013-03-29 23:00:12 (20.8 MB/s) - `diameizi.diandian.com/robots.txt' saved [209/209]

Removing diameizi.diandian.com/robots.txt.
Removing diameizi.diandian.com/index.html.

Spider mode enabled. Check if remote file exists.
--2013-03-29 23:00:12--  http://diameizi.diandian.com/rss
Reusing existing connection to diameizi.diandian.com:80.
HTTP request sent, awaiting response... 200 OK
Length: 0 [text/xml]
Remote file exists but does not contain any link -- not retrieving.

Removing diameizi.diandian.com/rss.
unlink: No such file or directory

Spider mode enabled. Check if remote file exists.
--2013-03-29 23:00:12--  http://diameizi.diandian.com/archive
Reusing existing connection to diameizi.diandian.com:80.
HTTP request sent, awaiting response... 200 OK
Length: 82303 (80K) [text/html]
Remote file exists and could contain links to other resources -- retrieving.

--2013-03-29 23:00:12--  http://diameizi.diandian.com/archive
Reusing existing connection to diameizi.diandian.com:80.

从上面的文本文件中寻找需要的相关资料。

上面的代码还没有测试成功,因为是2.7.3平台的缘故吧。

例子上给的应该是python3.x版本。有些出入

posted @ 2013-03-29 23:09  spaceship9  阅读(450)  评论(0编辑  收藏  举报