数据采集作业1

一.作业①
（作业要求：用requests和BeautifulSoup库方法定向爬取给定网址的数据，屏幕打印爬取的大学排名信息。）

1.代码与运行
（1）代码展示：
import requests
from bs4 import BeautifulSoup
import re
target_url = "http://www.shanghairanking.cn/rankings/bcur/2020"
import urllib.request
from bs4 import BeautifulSoup

目标网址

url = "http://www.shanghairanking.cn/rankings/bcur/2020"

使用 urllib 请求网页内容

response = urllib.request.urlopen(url)
html_content = response.read()

使用 BeautifulSoup 解析 HTML

soup = BeautifulSoup(html_content, 'html.parser')

找到包含排名信息的表格

table = soup.find('table')

打印标题

print("排名\t\t学校名称\t\t省市\t\t学校类型\t\t总分")
print("-" * 60) # 打印分隔线

遍历表格的每一行

for row in table.find_all('tr')[1:]: # 跳过标题行
# 提取每行的数据
cols = row.find_all('td')
rank = cols[0].text.strip()
school_name_full = cols[1].get_text(strip=True, separator=" ")
province = cols[2].text.strip()
school_type = cols[3].text.strip()
total_score = cols[4].text.strip()

# 使用正则表达式匹配中文字符
school_name = re.search(r'[\u4e00-\u9fa5]+', school_name_full)
school_name = school_name.group(0) if school_name else "未知"

# 打印提取的信息，使用格式化字符串确保排列工整
print(f"{rank}\t\t{school_name}\t\t\t{province}\t\t{school_type}\t\t\t{total_score}")

运行展示：

二.作业②
（作业要求：用requests和re库方法设计某个商城（自已选择）商品比价定向爬虫，爬取该商城，以关键词“书包”搜索页面的数据，爬取商品名称和价格。）
1.代码与运行
（1）代码展示：
import requests
import re

假设的商城搜索URL，需要根据实际情况替换

SEARCH_URL = 'https://www.example.com/search?q=书包'

模拟浏览器头部信息，防止被网站拦截

HEADERS = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'
}

def get_search_results(url):
response = requests.get(url, headers=HEADERS)
if response.status_code == 200:
return response.text
else:
print("Error fetching the page")
return None

def parse_product_info(html):
# 假设商品信息在特定的HTML标签中，需要根据实际情况调整
# 此处使用正则表达式来匹配商品名称和价格
pattern = re.compile(r'

.?(.?).?(.?)', re.S)
matches = pattern.findall(html)
products = []
for match in matches:
product_name, product_price = match
product_price = float(re.sub(r'[^\d.]', '', product_price)) # 移除非数字字符并转换为浮点数
products.append((product_name, product_price))
return products

def main():
html = get_search_results(SEARCH_URL)
if html:
products = parse_product_info(html)
print("序号\t价格\t商品名")
print("-" * 20)
for index, (product_name, product_price) in enumerate(products, start=1):
print(f"{index}\t{product_price}\t{product_name}")

if name == "main":
main()
三.作业③
（作业要求：爬取一个给定网页（ https://news.fzu.edu.cn/yxfd.htm ）的所有JPEG和JPG格式文件）

1.代码与运行
（1）代码展示：
import os
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import re

def sanitize_filename(filename):
# 移除文件名中不允许的字符
return re.sub(r'[\/*?:"<>|]', '', filename)

def save_images_from_url(url, save_dir):
# 发送HTTP请求
response = requests.get(url)
response.encoding = 'utf-8'

# 解析HTML内容
soup = BeautifulSoup(response.text, 'html.parser')

# 创建一个文件夹来保存图片
if not os.path.exists(save_dir):
    os.makedirs(save_dir)

# 找到所有图片标签
img_tags = soup.find_all('img')

# 下载并保存图片
for img in img_tags:
    img_url = img.get('src')
    if img_url:
        # 确保图片格式为JPEG或JPG
        if img_url.lower().endswith(('.jpg', '.jpeg')):
            # 将相对路径转换为完整的URL
            full_img_url = urljoin(url, img_url)
            try:
                img_data = requests.get(full_img_url).content
                file_name = sanitize_filename(full_img_url.split('/')[-1])
                file_path = os.path.join(save_dir, file_name)
                with open(file_path, 'wb') as file:
                    file.write(img_data)
                print(f'图片已保存：{file_path}')
            except requests.exceptions.RequestException as e:
                print(f'无法下载图片：{full_img_url}，错误信息：{e}')

目标网页URL

url = 'https://news.fzu.edu.cn/yxfd.htm'
save_dir = 'downloaded_images'

save_images_from_url(url, save_dir)
运行展示：

心得体会：

代码结构和逻辑

模块化设计：代码被分为几个函数，每个函数负责特定的任务。这种模块化设计使得代码易于理解和维护。
异常处理：在 get_search_results 函数中，通过检查响应状态码来处理可能的错误，这是一种良好的编程实践。
正则表达式：使用正则表达式来解析HTML内容是一种常见且有效的方法，但需要注意正则表达式的复杂性和性能问题。

代码的局限性

假设性：代码中使用了假设的URL和HTML结构，这在实际应用中需要根据目标网站的具体情况进行调整。
反爬虫策略：实际网站可能有反爬虫机制，如动态加载内容、验证码、IP限制等，这需要额外的处理。
性能考虑：如果目标网站内容量大，需要考虑分页处理和请求频率控制，避免对服务器造成过大压力。

代码的改进方向

动态内容处理：对于动态加载的内容，可能需要使用Selenium等工具来模拟浏览器行为。
反爬虫策略应对：可能需要设置请求头、使用代理IP、处理Cookies等策略来应对反爬虫机制。
代码优化：对于正则表达式和HTML解析部分，可以进一步优化以提高性能和准确性。

代码的合规性和道德考量

遵守法律法规：在进行网络爬虫开发时，必须遵守相关法律法规，尊重网站的版权和隐私政策。
道德责任：合理使用网络爬虫，避免对网站造成不必要的负担，尊重网站的 robots.txt 文件。

总结

编写网络爬虫时，需要考虑代码的结构、性能、合规性和道德责任。通过模块化设计、异常处理和正则表达式等技术，可以有效地提取和处理网页数据。同时，需要关注网站的反爬虫策略，并采取相应的应对措施。最重要的是，要确保网络爬虫的使用符合法律法规和道德标准。

posted @ 2024-10-20 19:59 淋祁阅读(29) 评论(0) 编辑收藏举报

刷新页面返回顶部

linzihao