数据采集第一次作业

作业①

1）作业要求：用requests和BeautifulSoup库方法定向爬取给定网址（http://www.shanghairanking.cn/rankings/bcur/2020）的数据，屏幕打印爬取的大学排名信息。
输出信息：

排名	学校名称	省市	学校类型	总分
1	清华大学	北京	综合	852.5
2

部分代码展示：

try:
    response = urllib.request.urlopen(request)
    html_content = response.read()

    soup = BeautifulSoup(html_content, 'html.parser')

    table = soup.find('table')

    print("{:<10} {:<30} {:<10} {:<10} {:<10}".format("排名", "学校名称", "省市", "学校类型", "总分"))

    rows = table.find_all('tr')
    for row in rows[1:]: 
        cols = row.find_all('td')
        if len(cols) >= 5:
            rank = cols[0].text.strip()
            school_name = cols[1].text.strip().split('\n')[0]
            province = cols[2].text.strip()
            school_type = cols[3].text.strip()
            total_score = cols[4].text.strip()
            print("{:<10} {:<30} {:<10} {:<10} {:<10}".format(rank, school_name, province, school_type, total_score))
        else:
            print(f"跳过行：{[cols[i].text.strip() for i in range(len(cols))]}")

运行图片展示：

2）心得体会：

在本次实验中，我学到了两个重要的教训。首先，我遇到了一个关于BeautifulSoup库的问题，原因是没有安装lxml解析器。通过安装它，我解决了问题，这让我意识到了检查库依赖性的重要性。其次，我在处理网页表格数据时遇到了格式问题。通过使用f-strings来格式化输出，我让数据看起来更整洁。这个经历提醒我，数据的呈现方式对于理解和使用数据至关重要。总结来说，这些挑战让我认识到了细节的重要性和持续学习的必要性。每个小问题都是提升技能的机会。

作业②

1）作业要求：
用requests和re库方法设计某个商城（自已选择）商品比价定向爬虫，爬取该商城，以关键词“书包”搜索页面的数据，爬取商品名称和价格。
输出信息：

序号	价格	商品名
1	65.00	xxx
2

部分代码展示：

for title, price in matches:
    if title.strip() == "当当":
        dangdang_price = price.strip()
        break
else:
    dangdang_price = None 

filtered_matches = [(title, price) for title, price in matches if title.strip() != "当当"]

prev_price = dangdang_price

for i, (title, price) in enumerate(filtered_matches):
    if prev_price:
        print("商品名称:", title.strip())
        print("价格:", prev_price.replace("&yen;", ""))
    else:
        print("商品名称:", title.strip())
        print("价格:", price.strip().replace("&yen;", ""))
    prev_price = price.strip()
    print("------")

运行图片展示：

2）心得体会：

在这次实验过程中，我学到了如何使用正则表达式来提取网页上的数据，这对我来说是一个全新的技能。我原本以为这会是一个简单直接的过程，但很快我就发现，网页的结构远比我想象的要复杂。通过不断地调试和改进我的正则表达式，我逐渐掌握了如何准确地定位和提取我想要的信息。它不仅加深了我对爬虫的理解，也让我在遇到问题并寻找到解决方案的过程中更加有信心。

作业③：

1）作业要求：
爬取一个给定网页（ https://news.fzu.edu.cn/yxfd.htm）或者自选网页的所有JPEG和JPG格式文件
输出信息：
将自选网页内的所有JPEG和JPG文件保存在一个文件夹中

部分代码展示：

def save_news(news_data, images_folder='images'):
    if not os.path.exists(images_folder):
        os.makedirs(images_folder)
    for i, news in enumerate(news_data):
        try:
            response = requests.get(news['img_url'], stream=True)
            if response.status_code == 200:
                with open(f'{images_folder}/img_{i}.jpg', 'wb') as f:
                    for chunk in response.iter_content(1024):
                        f.write(chunk)
            else:
                print(f'Failed to download image: {news["img_url"]}')
        except Exception as e:
            print(f'Error occurred while processing image {news["img_url"]}: {e}')

def crawl_fzu_news():
    url = 'https://news.fzu.edu.cn/yxfd.htm'
    html = get_html(url)
    news_data = parse_page(html)
    save_news(news_data)

if __name__ == '__main__':
    crawl_fzu_news()

运行图片展示：

2）心得体会：

在编写这个Python脚本的过程中，我学到了很多关于网络请求、HTML解析和文件操作的知识。这个项目让我意识到，将网页上的数据抓取并保存到本地，并不是一件简单的事情，它涉及到多个步骤，每个步骤都需要仔细处理。首先，我使用了requests库来发送网络请求；接着，我使用BeautifulSoup来解析HTML；最后，我尝试了将图片保存到本地。总的来说，这个项目让我学到了很多实用的编程技巧，也让我对网络编程有了更深的理解。虽然过程中遇到了一些挑战，但通过不断尝试和调试，我最终完成了任务，这让我感到非常有成就感。我相信这些经验将在我的未来编程生涯中发挥重要作用。

posted on 2024-10-17 15:46 吴鱼子阅读(13) 评论(0) 编辑收藏举报