数据采集和融合技术实践作业一

一.作业①
（作业要求：用requests和BeautifulSoup库方法定向爬取给定网址的数据，屏幕打印爬取的大学排名信息。）

1.代码与运行
（1）代码展示：

import urllib.request
from bs4 import BeautifulSoup

将header作为头部字段，和url构建http请求对象，
使用请求对象的urlopen方法发送请求，接受返回的响应对象到req变量中
使用响应对象的read方法读取响应的有效部分，由于网络传送以二进制字节流为载体，所以需要进行decode方法解码

# 目标网址
url = 'http://www.shanghairanking.cn/rankings/bcur/2020'

# 获取网页内容
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'}
req = urllib.request.Request(url, headers=headers)
response = urllib.request.urlopen(req)
html_content = response.read().decode('utf-8')

beautifulsoup把读取并解码后的内容装载为beautifulsoup对象以便html结构化操作

soup = BeautifulSoup(html_content, 'html.parser')

# 找到包含大学排名信息的表格
table = soup.find('table', {'class': 'rk-table'})

浏览器f12查看大学2020年排汗网页的html结构：
排行部分在table节点下，排行部分下单所大学的信息存储在tr节点下，大学的各部分信息分别存储在tr节点下的多个td节点
所以设计如下爬取结构：遍历table下的所有tr节点（大学节点），存储tr下的所有td节点，按照td节点顺序与信息的对应关系进行变量存储
最后一一打印出各学校的各信息

# 爬取表格中的排名信息
print(f"{'排名':<5} {'学校名称':<15} {'省市':<10} {'学校类型':<10} {'总分':<5}")
for row in table.find_all('tr')[1:]:
    cols = row.find_all('td')
    rank = cols[0].get_text(strip=True)
    name = cols[1].get_text(strip=True)
    province = cols[2].get_text(strip=True)
    type_ = cols[3].get_text(strip=True)
    score = cols[4].get_text(strip=True)

    # 打印格式化的排名信息
    print(f"{rank:<5} {name:<15} {province:<10} {type_:<10} {score:<5}")
#012202239朱佳杰

（2）运行展示：

2.心得体会：
1.对http请求和响应过程以及数据的形式的记忆更加深刻。
2.对html文档的结构解读更加精确和清晰

二.作业②
（作业要求：用requests和re库方法设计某个商城（自已选择）商品比价定向爬虫，爬取该商城，以关键词“书包”搜索页面的数据，爬取商品名称和价格。）
1.代码与运行
（1）代码展示：

import requests
import time
import random
from bs4 import BeautifulSoup
import os

观察不同页面url的变化得出不同页面的url规律
page_index赋不同值得出各个页面的url并存储在url列表中

keywords = '书包'
pages = 3
base_url = f'https://search.dangdang.com/?key=%CA%E9%B0%FC&act=input'
urls = [base_url if i == 1 else f"{base_url}&page_index={i}" for i in range(1, pages + 1)]
page_data = []
# 确保image文件夹存在
if not os.path.exists('image'):
    os.makedirs('image')  # 012202239朱佳杰

使用cookie（个人浏览网页后存储在客户端的浏览数据文件）和用户代理（电脑和浏览器配置信息）组成请求头模拟浏览器请求

Image_index = 0
headers = {
    'Cookie': 'ddscreen=2; __permanent_id=20241007105845812406897428179553110; __rpm=%7Cmix_317715...1728269927295; search_passback=3aa3a5bba56ad27e6b4e0367f00100008fb16500614e0367; dest_area=country_id%3D9000%26province_id%3D111%26city_id%3D0%26district_id%3D0%26town_id%3D0; pos_9_end=1728269931497; ad_ids=3618801%2C2573745%2C2555721%7C%231%2C1%2C1'
    , 'User-Agent': 'Mozilla / 5.0(Windows NT 10.0;Win64;x64) AppleWebKit / 537.36(KHTML, likeGecko)'
    }

对urls中不同页面的url进行遍历处理，达到“翻页爬取数据”的效果
使用time.sleep暂停2——5s时间来模拟正常人使用浏览器浏览网页的行为
使用requests直接获取单页的响应，装载后，提取img元素的图片链接同时存储在images中
对images遍历请求，并把返回的数据存储在image/imagexxxx.jpg文件中
（俩次遍历，一次翻页遍历，一次遍历请求图片）

for url in urls:  # 012202239朱佳杰
    # 随机等待时间，模拟正常用户行为
    time.sleep(random.randint(2, 5))
    print("=" * 10 + f'正在爬取第{url.split("&page_index=")[-1] if "&page_index=" in url else 1}页' + "=" * 10)
    response = requests.get(url=url, headers=headers)
    if response.status_code == 200:
        soup = BeautifulSoup(response.text, 'html.parser')
        images = [img.get('data-original') for img in soup.find_all('img', attrs={'data-original': True})]
        for image_url in images:
            if not image_url.startswith(('http://', 'https://')):
                image_url = 'https://' + image_url.lstrip('/')
            try:
                response = requests.get(url=image_url, headers=headers)
                if response.status_code == 200:
                    filename = f"image{Image_index:04d} jpg"
                    filepath = os.path.join('image', filename)
                    with open(filepath, 'wb') as img_file:
                        img_file.write(response.content)
                    print(f"图片已保存至:{filepath}")
                    Image_index += 1
                else:
                    print(f"下载图片失败:状态码:{response.status_code}")
            except requests.RequestException as e:
                print("error")

（2）运行展示：
文件转换格式后查看图片

2.心得体会：
1对url构成的认识更加深刻，知道如何处理翻页逻辑
2了解了避免反爬取机制基本逻辑和方法
3对比作业①，对requests库和urllib.requests库请求方式的区别有了更清晰的认识
4了解到可以通过查看响应码是否等于200来察看请求是否成功

三.作业③
（作业要求：爬取一个给定网页（ https://news.fzu.edu.cn/yxfd.htm ）的所有JPEG和JPG格式文件）

1.代码与运行
（1）代码展示：

from calendar import error
import requests
import time
import random
from bs4 import BeautifulSoup
import os
import urllib.request

找出url和请求头

url='https://news.fzu.edu.cn/yxfd.htm'
headers={'cookie':'JSESSIONID=CF4FF77D8024443B9F9D260CC3C39EA8',
         'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36 SLBrowser/9.0.3.5211 SLBChan/105'}

if not os.path.exists('image'):
    os.makedirs('image')

使用Request返回一个http请求对象，调用请求对象的方法发送请求，对响应对象读取有效字节流并解码

req = urllib.request.Request(url, headers=headers)
response = urllib.request.urlopen(req)
html_content = response.read().decode('utf-8')

装载beautifulsoup对象

soup = BeautifulSoup(html_content, 'html.parser')

浏览网页后f12查看html结构，发现图片链接存储在li ->a ->div下img的src属性里，
css语法查找全部img节点，提取并存储src中的图像链接

eles=[]
urls=[]
eles=soup.select("li a div img")
for i in eles:
    urls.append(i['src'])

遍历urls中的图像链接，一一请求图像并存储在image/xxxx.jpg中，具体算法与作业②相同，存储后的路径直接覆盖原书包图像

Image_index=0
for image_url in urls:
    if not image_url.startswith(('https://news.fzu.edu.cn/')):
        image_url = 'https://news.fzu.edu.cn/' + image_url.lstrip('/')
    try:
        response = requests.get(url=image_url, headers=headers)
        if response.status_code == 200:
            filename = f"image{Image_index:04d} jpg"
            filepath = os.path.join('image', filename)
            with open(filepath, 'wb') as img_file:
                img_file.write(response.content)
            print(f"图片已保存至:{filepath}")
            Image_index += 1
        else:
            print(f"下载图片失败:状态码:{response.status_code}")
    except requests.RequestException as e:
        print(error)

（2）运行展示：
文件转换格式后查看图片

2.心得体会：
1对html结构的解析更加熟练
2加深了“爬取资源链接后再次爬取资源”的认识

posted on 2024-10-20 19:47 朱艾伦阅读(9) 评论(0) 编辑收藏举报

刷新页面返回顶部

数据采集和融合技术实践作业一

导航

公告