爬虫之妹子图爬取

宅男爬虫学习第一课! 宅男们的福利来啦~ 

话不多说,直接上代码!

 

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
# -*- encoding: utf-8 -*-
 
# FUNCTION: Capture beauty picture
 
import requests
 
from bs4 import BeautifulSoup
 
import os
 
import time
 
url_list = ['http://www.mzitu.com/201024', 'http://www.mzitu.com/169782'# interested beauties
 
headers = {
 
        'referer': 'https://www.mzitu.com/201024',
 
        'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 '
 
                      'Safari/537.36'
 
}
 
def get_page_num(url):
 
    response = requests.get(url, headers=headers)
 
    soup = BeautifulSoup(response.text, 'lxml')
 
    page_num = soup.find(class_='pagenavi').find_all('a')[-2].text
 
    name = soup.find(class_='currentpath').text.split()[-1]
 
    return page_num, name          # page_num 是字符串
 
def parse_page(url):
 
    """
 
    得到一页的图片
 
    :param url: 页面URL
 
    :return: 图片链接,图片名称
 
    """
 
    response = requests.get(url, headers=headers)
 
    soup = BeautifulSoup(response.text, 'lxml')
 
    pic_url = soup.find(class_='main-image').find('img')['src']
 
    pic_name = soup.find(class_='main-title').text
 
    return pic_url, pic_name
 
def get_pic(pic_url, pic_name, name):
 
    """下载并保存图片"""
 
    response = requests.get(pic_url, headers=headers, allow_redirects=False)
 
    filepath = '/home/f/crawler/Beauty/photo/' + name + '/' + pic_name + '.jpg'
 
    with open(filepath, 'wb') as f:
 
        f.write(response.content)
 
def main():
 
    for url in url_list:
 
        page_num, name = get_page_num(url)
 
        try:
 
            os.mkdir('/home/f/crawler/Beauty/photo/' + name)
 
        except FileExistsError:
 
            pass
 
        for page in range(1, int(page_num) + 1):  # range迭代
 
            page_url = url + '/' + str(page)
 
            print(page_url)
 
            pic_url, pic_name = parse_page(page_url)
 
            get_pic(pic_url, pic_name, name)
 
        time.sleep(2)
 
if __name__ == '__main__':
 
    main()

  

 

可以收藏一下,慢慢学习哈!

 


 

 

 

————————————————————————————————————————————

微信关注号:**爬虫王者**


 
posted @   爬虫王者  阅读(46)  评论(0编辑  收藏  举报
相关博文:
阅读排行:
· 阿里最新开源QwQ-32B,效果媲美deepseek-r1满血版,部署成本又又又降低了!
· 开源Multi-agent AI智能体框架aevatar.ai,欢迎大家贡献代码
· Manus重磅发布:全球首款通用AI代理技术深度解析与实战指南
· 被坑几百块钱后,我竟然真的恢复了删除的微信聊天记录!
· 没有Manus邀请码?试试免邀请码的MGX或者开源的OpenManus吧
点击右上角即可分享
微信分享提示