🎇作业①

🚀（1）作业要求

要求：用requests和BeautifulSoup库方法定向爬取给定网址（http://www.shanghairanking.cn/rankings/bcur/2020 ）的数据，屏幕打印爬取的大学排名信息。
输出信息：

排名	学校名称	省市	学校类型	总分
1	清华大学	北京	综合	852.5
2......

✒️（2）代码实现及图片

import requests
from bs4 import BeautifulSoup

target_url = "http://www.shanghairanking.cn/rankings/bcur/2020"

response = requests.get(target_url)
html_content = response.content

soup = BeautifulSoup(html_content, 'lxml')
ranking_table = soup.find('table')

print(f"{'排名':<10}{'学校名称':<30}{'省市':<10}{'学校类型':<15}{'总分':<10}")

table_rows = ranking_table.find_all('tr')
for row in table_rows[1:]:
    columns = row.find_all('td')
    if columns:
        ranking = columns[0].get_text(strip=True)
        school = columns[1].get_text(strip=True)
        location = columns[2].get_text(strip=True)
        university_type = columns[3].get_text(strip=True)
        score = columns[4].get_text(strip=True)

        # 打印信息，使用格式化字符串对齐列
        print(f"{ranking:<10}{school:<30}{location:<10}{university_type:<15}{score:<10}")

🧾（3）心得体会

通过这次作业，我学习了如何使用Python的requests和BeautifulSoup库来爬取网页数据，并理解了网页结构，提高了编程能力。

🎊作业②

🕸️（1）作业要求

要求：用requests和re库方法设计某个商城（自已选择）商品比价定向爬虫，爬取该商城，以关键词“书包”搜索页面的数据，爬取商品名称和价格。
输出信息：

序号	价格	商品名
1	65.00	xxx
2......

🎄（2）代码实现及图片

import requests
from bs4 import BeautifulSoup
from urllib.parse import urlencode

class AmazonScraper:
    def __init__(self, keyword):
        self.keyword = keyword
        self.base_url = "https://www.amazon.com/"
        self.search_url = "https://www.amazon.com/s?"
        self.headers = {
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.85 Safari/537.36"
        }

    def search(self, page=1):
        params = {
            "k": self.keyword,
            "ref": "nb_sb_noss_2",
            "__mk_link_id": "MFController"
        }
        response = requests.get(self.search_url, headers=self.headers, params=params)
        return response.text

    def parse_page(self, html):
        soup = BeautifulSoup(html, 'html.parser')
        items = soup.find_all('div', {'data-component-type': 's-search-result'})
        results = []
        for item in items:
            try:
                title = item.find('span', {'class': 'a-size-medium'}).text.strip()
                price = item.find('span', {'class': 'a-price-whole'}).text.strip() if item.find('span', {'class': 'a-price-whole'}) else "No Price"
                results.append((title, price))
            except AttributeError:
                continue
        return results

    def scrape(self, pages=1):
        all_results = []
        for page in range(1, pages + 1):
            print(f"Scraping page {page}...")
            html = self.search(page)
            results = self.parse_page(html)
            all_results.extend(results)
            print(f"Page {page} scraped: {len(results)} items found.")
        return all_results

if __name__ == "__main__":
    keyword = "backpack"
    scraper = AmazonScraper(keyword)
    results = scraper.scrape(2)  # 爬取前2页
    for title, price in results:
        print(f"Title: {title}, Price: {price}")

🔮（3）心得体会

通过这次作业，我深刻体会到了网页爬虫技术的复杂性和挑战性。在爬取过程中，一些网站会设置反爬虫机制，使我遇到了一些苦难，也使得我了解了一些常见的反爬虫策略及其应对方法

🪸作业③

🐻‍❄️（1）作业要求

要求：爬取一个给定网页（ https://xcb.fzu.edu.cn/info/1071/4481.htm）或者自选网页的所有JPEG和JPG格式文件
输出信息：将自选网页内的所有JPEG和JPG文件保存在一个文件夹中

🔒（2）代码实现及图片

import os
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin

# 目标网页URL
url = 'https://news.fzu.edu.cn/yxfd.htm'

# 创建一个文件夹来保存图片
folder_name = 'images'
if not os.path.exists(folder_name):
    os.makedirs(folder_name)

# 发送HTTP请求获取网页内容
response = requests.get(url)
response.encoding = 'utf-8'  # 根据网页的编码来设置

# 解析网页内容
soup = BeautifulSoup(response.text, 'html.parser')

# 查找所有图片链接
img_tags = soup.find_all('img')

# 下载并保存图片
for img in img_tags:
    img_url = img.get('src')
    if img_url:
        # 确保图片链接是完整的
        img_url = urljoin(url, img_url)
        # 检查图片格式是否为JPEG或JPG
        if img_url.lower().endswith(('.jpg', '.jpeg')):
            # 获取图片内容
            img_response = requests.get(img_url)
            # 获取图片文件名
            img_name = os.path.join(folder_name, img_url.split('/')[-1])
            # 保存图片
            with open(img_name, 'wb') as f:
                f.write(img_response.content)
                print(f'图片已保存：{img_name}')

print('所有图片已下载完毕。')

🗝️（3）心得体会

与前两个作业相较而言，此次爬取任务的目标是爬取到图片，于是用到的方法稍有不同，让我在实践中掌握了一些关键的网页爬虫技术。

posted on 2024-10-16 19:25 pandas2 阅读(12) 评论(0) 编辑收藏举报

刷新页面返回顶部