python-爬图片集【bs4】

一、目标

  1、把头像图片下载到本地中:https://www.nanrentu.cc/txtp/

  2、根据每个组的标题创建文件夹,并把每组中的全部头像图片放到对应的文件夹中

二、思路

  1、首先要先分析网页结果,通过分析发现,文件的名称其实就在<ul class = "h-piclist">标签下的ul>a标签里,可以通过对网页发起get请求,然后创建bs4来获取链接和文件名称

   2、根据bs4处理后,通过find("tag",class= "值")来获取整个ul下的内容,在调用.findall(“a”)获取a标签下的所有数据

# -*- coding: utf-8 -*-
#1、导包
import requests
import os
from bs4 import BeautifulSoup
#2、发起请求
url = "https://www.nanrentu.cc/txtp/page_1.html"
hearder = {
    "User-Agent":
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:120.0) Gecko/20100101 Firefox/120.0"
}
req = requests.get(url=url,headers=hearder)
soup = BeautifulSoup(req.text,'lxml')
a_list = soup.find('ul',class_ = 'h-piclist').findAll('a')
for a in a_list:
    name = a['title']
    hrefs =a['href']
print(name + hrefs)
#打印结果如下

   3、获取到文件夹的名称后,

    1)下面需要利用os模块批量创建文件

for a in a_list:
    name = a['title']
    hrefs =a['href']
    #创建文件夹
    file_path = r'E:\python123\\'
    dir_name = file_path + name
    if not os.path.exists(dir_name):
        os.mkdir(dir_name)

       2)对获取的href链接再次发起,把每个链接中的图片src拿到并获取内容保存到对应文件夹中:

    child_html = requests.get(url=hrefs,headers=hearder)
    child_soup = BeautifulSoup(child_html.text,'lxml')
    child_img_list = child_soup.find('div',class_="info-pic-list").findAll('img')
    for child_img in child_img_list:
        pic_name = child_img['alt']
        pic_src = child_img['src']
        print(pic_name + pic_src)

运行后如下:

for child_img in tqdm(child_img_list):
  pic_name = child_img['alt']
  pic_src = child_img['src']
  print(pic_name + pic_src)
  #保存数据
  with open(dir_name + "/" + pic_name + ".jpg" ,mode='wb')as f :
  f.write(requests.get(pic_src).content)
  print(pic_name + "下载完成")
f.close()

至此,图片就会保存到对应的文件夹中了

补充:为了下载有进度条,看起来更加优雅美观,在for i in tqmd(xxx)循环中加上tqdm即可

from tqdm import tqdm

三、过程【完整代码】

# -*- coding: utf-8 -*-
#1、导包
import requests
import os
from bs4 import BeautifulSoup
from tqdm import tqdm

#2、发起请求
url = "https://www.nanrentu.cc/txtp/page_1.html"
hearder = {
    "User-Agent":
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:120.0) Gecko/20100101 Firefox/120.0"
}
req = requests.get(url=url,headers=hearder)
soup = BeautifulSoup(req.text,'lxml')

a_list = soup.find('ul',class_ = 'h-piclist').findAll('a')
for a in tqdm(a_list):
    name = a['title']
    hrefs =a['href']
    #3、创建文件夹
    file_path = r'E:\python123\\'
    dir_name = file_path + name
    if not os.path.exists(dir_name):
        os.mkdir(dir_name)

    child_html = requests.get(url=hrefs,headers=hearder)
    child_soup = BeautifulSoup(child_html.text,'lxml')
    child_img_list = child_soup.find('div',class_="info-pic-list").findAll('img')
    for child_img in tqdm(child_img_list):
        pic_name = child_img['alt']
        pic_src = child_img['src']
        print(pic_name + pic_src)
        #4、保存数据
        with open(dir_name + "/" + pic_name + ".jpg" ,mode='wb')as f :
            f.write(requests.get(pic_src).content)
            print(pic_name + "下载完成")
    f.close()

四、总结

  1、以上代码还存在点问题,后续有时间再优化

posted @ 2024-01-06 15:31  zhang0513  阅读(17)  评论(0编辑  收藏  举报