随笔- 5 文章- 0 评论- 0 阅读- 253

2023数据采集与融合技术实践作业一

作业①

o *要求：*用requests和BeautifulSoup库方法定向爬取给定网址（http://www.shanghairanking.cn/rankings/bcur/2020 ）的数据，屏幕打印爬取的大学排名信息。

o *输出信息：*

*排名*	*学校名称*	*省市*	*学校类型*	*总分*
1	清华大学	北京	综合	852.5
2......

代码

import urllib.request
from bs4 import BeautifulSoup

url = "http://www.shanghairanking.cn/rankings/bcur/2020"
response = urllib.request.urlopen(url)
html = response.read().decode('utf-8')  
soup = BeautifulSoup(html, 'html.parser')

tr_list = soup.find('tbody').find_all('tr')[0:4]

print("排名\t学校名称\t省市\t学校类型\t总分")
for row in tr_list[0:4]:  
    # 获取每一行中的列数据
    cols = row.find_all('td')
    # 提取排名、学校名称、省市、学校类型和总分
    rank = cols[0].text.strip()
    name = cols[1].text.strip()
    location = cols[2].text.strip()
    type = cols[3].text.strip()
    score = cols[4].text.strip()
    print(rank + "\t" + name + "\t" + location + "\t" + type + "\t" + score)

输出结果

心得体会

原本输出时，代码为：

for tr in tr_list:
  td_list = tr.find_all('td')
  print(f"{td_list[0].text}\t{td_list[1].text}\t{td_list[2].text}\t{td_list[3].text}\t{td_list[4].text}", end=" ")

但即使是/t输出依然会被换行，导致格式不是很美观于是修改成：

for row in tr_list[0:4]:  
    # 获取每一行中的列数据
    cols = row.find_all('td')
    # 提取排名、学校名称、省市、学校类型和总分
    rank = cols[0].text.strip()
    name = cols[1].text.strip()
    location = cols[2].text.strip()
    type = cols[3].text.strip()
    score = cols[4].text.strip()
    print(rank + "\t" + name + "\t" + location + "\t" + type + "\t" + score)

使得输出结果在最大程度上分布于同一行。

由于该网站的反爬没有很严格，此次爬虫实验较为容易，耗时也比起其他几个比较少，直接套用爬虫的模版，再针对作业内容进行修改，此次作业的完成过程使我对爬虫有了初步的理解。

作业②

o *要求：*用requests和re库方法设计某个商城（自已选择）商品比价定向爬虫，爬取该商城，以关键词“书包”搜索页面的数据，爬取商品名称和价格。

o *输出信息：*

*序号*	*价格*	*商品名*
1	65.00	xxx
2......

代码

from bs4 import BeautifulSoup
import requests
import re

def gethtml(url):
    r = requests.get(url)
    r.raise_for_status()
    r.encoding=r.apparent_encoding
    r=r.text.encode('utf-8')
    return r

def getGoodsMsg(msg):
    soup=BeautifulSoup(msg,'html.parser')
    ul = soup.find('ul',class_='bigimg')
    li=ul.find_all("li")

    print("序号\t价格"+"\t"*6+"商品名")
    title_regex = r'<a.*?单品标题.*?>(.*?)</a>'
    price_regex = r'<p class="price">.*?<span class="search_now_price">(.*?)</span>'
    count=1
    for i in li:
        price = re.search(price_regex,str(i))
        name = re.search(title_regex,str(i))
        if price:
            price=price.group(1)
        if name:
            name=name.group(1)
            soup1=BeautifulSoup(name,'html.parser')
            name=soup1.get_text()
        print(count)
        print("\t"+price+"\t"+name)
        count+=1

if __name__=="__main__":   
    pageurl = 'http://search.dangdang.com/?key=%B6%AB%D2%B0%B9%E7%CE%E1&act=input'
    msg = gethtml(pageurl)
    getGoodsMsg(msg)

运行结果

心得体会

最开始爬虫淘宝，但在获取网页时会自动跳转到登录页面，同时淘宝网反爬严重，因此选择了拼多多，可是拼多多也有同样的问题。经过每个购物网站的实验，最后发现当当不需要登录就可以查找商品，选择了当当作为最终的网站。通过F12查找网站的标签，通过平行遍历的方式与正则表达式进行商品名称以及价格的提取，并按要求进行输出。通过本次作业对正则表达式有了更加深刻的理解。

作业③

o *要求：*爬取一个给定网页（ https://xcb.fzu.edu.cn/info/1071/4481.htm）或者自选网页的所有JPEG和JPG格式文件

o *输出信息：*将自选网页内的所有JPEG和JPG文件保存在一个文件夹中

代码

import requests
from bs4 import BeautifulSoup
import re
import os

def get_html(url):
  r = requests.get(url)
  r.raise_for_status()
  r.encoding = r.apparent_encoding
  return r.text

def get_img(html, path):
  soup = BeautifulSoup(html, 'html.parser')
  body = soup.find('div', id='vsb_content')
  img_pattern = re.compile(r'src="(.*?)" ')
  img_list = img_pattern.findall(str(body))

  index = 1
  for img_url in img_list:
    if not img_url.startswith('http'):
      img_url = 'http://xcb.fzu.edu.cn' + img_url
    img_data = requests.get(img_url).content
    with open(os.path.join(path, str(index)+'.jpg'), 'wb') as f:
      f.write(img_data)
      print('图片{}下载完成'.format(index))
    index += 1
    
if __name__ == '__main__':
  url = 'https://xcb.fzu.edu.cn/info/1071/4481.htm'
  path =r'C:\Users\lenovo\Desktop\大学\数据采集与融合技术\实践\作业1_images\\'
  if not os.path.exists(path):
    os.makedirs(path)
  html = get_html(url)
  get_img(html, path)

运行结果

心得体会

本次爬虫无需反爬也无需翻页，大大减轻了爬虫难度，但代码编写完之后一直无法输出，最后发现在图片的原码中对应的文件名还需加上http://xcb.fzu.edu.cn的前缀才可以对应的相应的文件。

posted @ 2023-09-21 17:14 失重漂流阅读(36) 评论(0) 编辑收藏举报

刷新页面返回顶部

登录后才能查看或发表评论，立即登录或者逛逛博客园首页

相关博文：

· 2023数据采集与融合技术实践作业二

· 2023数据采集与融合技术实践作业三

· 2023数据采集与融合技术实践作业一

· 2023数据采集与融合技术实践作业1

公告

昵称：失重漂流
园龄： 1年5个月
粉丝： 5
关注： 5

+加关注

2025年3月

日

一

二

三

四

五

六

Yangyq0103

2023数据采集与融合技术实践作业一

作业①

代码

输出结果

心得体会

作业②

代码

运行结果

心得体会

作业③

代码

运行结果

心得体会

公告

搜索

常用链接

随笔档案

阅读排行榜

推荐排行榜