102102146洪松渝数据采集与融合技术作业1

作业①:

要求：用requests和BeautifulSoup库方法定向爬取给定网址（http://www.shanghairanking.cn/rankings/bcur/2020）的数据，屏幕打印爬取的大学排名信息。

输出信息：

排名	学校名称	省市	学校类型	总分
1	清华大学	北京	综合	852.5
2......

代码：

import urllib.request

import requests
import bs4
from bs4 import BeautifulSoup


def getText(url):
    url = 'https://www.shanghairanking.cn/rankings/bcur/202011'
    try:
        r = urllib.request.Request(url)
        d = urllib.request.urlopen(r)
        d = d.read().decode()
        return d
    except:
        return ""


def UniList(ulist, html):
    soup = BeautifulSoup(html, "html.parser")
    for tr in soup.find('tbody').children:
        if isinstance(tr, bs4.element.Tag):
            index = tr('td')[0].text.replace('\n', '').replace(' ', '')
            name = tr('a', class_='name-cn')[0].text.replace('\n', '').replace(' ', '')
            score = tr('td')[4].text.replace('\n', '').replace(' ', '')
            ulist.append([index, name, score])


def printList(ulist, num):
    tplt = "{0:^10}\t{1:{3}^10}\t{2:^10}"
    print(tplt.format("排名", "学校名称", "总分", chr(12288)))
    for i in range(num):
        u = ulist[i]
        print(tplt.format(u[0], u[1], u[2], chr(12288)))


def main():
    uinfo = []
    url = 'https://www.shanghairanking.cn/rankings/bcur/202011'
    html = getText(url)  # 获取大学排名网页内容
    UniList(uinfo, html)  # 提取网页内容中信息
    printList(uinfo, 26)


main()

运行截图：

心得体会：第一次动手实践去爬静态网页，对request，bs4库的初次使用

作业②:

要求：用requests和re库方法设计某个商城（自已选择）商品比价定向爬虫，爬取该商城，以关键词“书包”搜索页面的数据，爬取商品名称和价格。

输出信息：

序号	价格	商品名
1	65.00	xxx
2......

代码：

import re
import requests
from bs4 import BeautifulSoup

url = "http://search.dangdang.com/?key=%CA%E9%B0%FC&act=input"
headers = {
    'User-Agent': "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:109.0) Gecko/20100101 Firefox/113.0"
}

response = requests.get(url=url, headers=headers)
response.encoding = 'GBK'
response = response.text

soup = BeautifulSoup(response, 'html.parser')

names = soup.find_all('a', class_='pic')
prices = soup.find_all('span', class_='price_n')

for i, (name, price) in enumerate(zip(names, prices), 1):
    print(f"编号：{i}")
    print(f"名称：{name.get('title')}")
    print(f"价格：{price.text}")
    print()

结果截图：

心得体会：

发现用正则式的方式比较复杂，于是改用了beautifulsoup库来解析页面

作业③：

要求：爬取一个给定网页（ https://xcb.fzu.edu.cn/info/1071/4481.htm ）或者自选网页的所有JPEG和JPG格式文件

输出信息：将自选网页内的所有JPEG和JPG文件保存在一个文件夹中

代码：

import os
import requests
from bs4 import BeautifulSoup

url = 'https://xcb.fzu.edu.cn/info/1071/4481.htm'
hd = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36"
}

def getHTML(url):
    try:
        r = requests.get(url, headers=hd)
        r.raise_for_status()
        r.encoding = r.apparent_encoding
        return r.text
    except:
        return ''

def download_img(url, img_name):
    response = requests.get(url=img_url, headers=hd).content
    with open(img_name, 'wb') as f:
        f.write(response)

text = getHTML(url)
soup = BeautifulSoup(text, 'html.parser')
imglist = soup.find_all('img')

folder_name = 'images2'
if not os.path.exists(folder_name):
    os.makedirs(folder_name)

for img in imglist:
    url_ = img.attrs['src']
    img_url = 'https://xcb.fzu.edu.cn{}'.format(url_)
    img_name = os.path.basename(img_url).split('?')[0]
    img_path = os.path.join(folder_name, img_name)
    download_img(img_url, img_path)
    print(f'下载图片：{img_path}')

结果截图：

心得体会：

这次对爬取图片有了初步的认识和实践

posted @ 2023-09-26 21:38 羊耶飞舞阅读(11) 评论(0) 收藏举报

刷新页面返回顶部

yyk48

102102146洪松渝数据采集与融合技术作业1

公告