python简单爬虫实现
1.主要学习这程序的编写思路
a.读取解释网站
b.找到相关页
c.找到图片链接的元素
d.保存图片到文件夹
.....
将每一个步骤都分解出来,然后用函数去实现,代码易读性高.
##代码尽快运行时会报错,还须修改
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
|
import urllib.request import os def url_open(url): #读取解释 req = urllib.request.Request(url) # req.add_header(\'User-Agent\',\'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36\') response = urllib.request.urlopen(req) html = response.read() return html def get_page(url): #找到相关页 html = url_open(url) a = html.find( 'current-comment-page' ) b = html.find(a) return html[a:b] def find_imgs(url): #找到图片链接的元素 html = url_open(url) img_addrs = [] a = html.find( 'img src=' ) while a ! = - 1 : b = html.find( '.jpg' ,a,a + 255 ') if b ! = - 1 : img_addrs.append(html[a + 9 :b + 4 ]) else : b = a + 9 a = html.find( 'img src=' ,b') return img_addrs def save_imgs(folder, img_addrs): #保存图片到文件夹 for each in img_addrs: filename = each.split( '\'/\'' ) with open (filename, 'wb' ) as f: img = url_open(each) f.write(img) def download_mm(folder = 'OOXX' ,pages = 10 ): os.mkdir(folder) os.chdir(folder) page_num = int (get_page(url)) for i in range (pages): page_num - = i page_url = url + 'page-' + str (page_num) + '#comments' img_addrs = find_imgs(page_url) save_imgs(img_addrs) if __name__ = = '__main__' : download_mm() |
积极乐观,好好coding