数据采集与融合技术作业一

作业1

要求：用requests和BeautifulSoup库方法定向爬取给定网址（http://www.shanghairanking.cn/rankings/bcur/2020 ）的数据，屏幕打印爬取的大学排名信息。

输出信息：

排名	学校名称	省市	学校类型	总分
1	清华大学	北京	综合	852.5
2	...	...	...	...

代码

from bs4 import BeautifulSoup
import requests

url = "http://www.shanghairanking.cn/rankings/bcur/2020"
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/117.0',
}

response = requests.get(url, headers=headers)
response.encoding='gkd'
data = response.text

soup = BeautifulSoup(data, "html.parser")
university_list = soup.select("tbody tr")

template = "{0:^13}\t{1:^13}\t{2:^13}\t{3:^10}\t{4:^10}"
print(template.format("排名", "学校", "省市", "类型", "总分", chr(12288)))

for item in university_list[0:30]:
    tds = item.select("td")
    info = []
    for td in tds[0:30]:
        text = td.text.replace(" ", "").replace("\n", "")
        info.append(text)
    print(template.format(info[0], info[1], info[2], info[3], info[4], chr(12288)))

运行截图

心得体会

爬取下来的数据格式未对齐，仍需改进。

以关键词“书包”搜索页面的数据，爬取商品名称和价格。

输出信息：

序号	价格	商品名
1	65.00	xxx
2	...	...

代码

import requests
import os
from bs4 import BeautifulSoup

headers = {
    'User-Agent':'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Mobile Safari/537.36 Edg/117.0.2045.43',
    'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
    'Accept-Encoding':'gzip, deflate, br',
    'Accept-Language':'zh-CN,zh;q=0.9'
}

url = 'http://search.dangdang.com/?key=%CA%E9%B0%FC&act=input'
response = requests.get(url, headers=headers)
# response.encoding = 'utf8'
html = response.text
soup = BeautifulSoup(html, "lxml")

names = soup.select("li[ddt-pit] p[class = name] a")
prices = soup.select("li[ddt-pit] p[class = price] span")

i = 0
for name, price in zip(names, prices):
    i = i+1
    print(i, name["title"], price.text)

运行截图

心得体会

淘宝反爬取措施较完善，故爬取当当网。

作业3

要求：爬取一个给定网页（ https://xcb.fzu.edu.cn/info/1071/4481.htm）或者自选网页的所有JPEG和JPG格式文件

输出信息：将自选网页内的所有JPEG和JPG文件保存在一个文件夹中

代码

import requests
import os
from bs4 import BeautifulSoup

headers = {
    'User-Agent':'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Mobile Safari/537.36 Edg/117.0.2045.43',
    'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
    'Accept-Encoding':'gzip, deflate, br',
    'Accept-Language':'zh-CN,zh;q=0.9'
}

url = 'https://xcb.fzu.edu.cn/info/1071/4481.htm'
response = requests.get(url,headers=headers)
response.encoding = 'utf8'
html = response.text
soup = BeautifulSoup(html, "html.parser")
img_list = soup.find_all('img', src=True)

img_path = "./fzuxcbedu"
if not os.path.exists(img_path):
    os.mkdir(img_path)

count = 0
for img in img_list:
    count += 1
    img_url = img.attrs['src']
    img_url = 'https://xcb.fzu.edu.cn{}'.format(img_url)
    img_response = requests.get(img_url)
    file_name = f'{img_path}/图片{str(count)}.jpg'
    with open(file_name, 'wb') as f:
        f.write(img_response.content)
    print(img_url)

运行截图

心得体会

使用BeatifulSoup库较re库更简单

posted @ 2023-09-28 10:45 chencanming 阅读(70) 评论(0) 收藏举报

刷新页面返回顶部

ccm-2333

数据采集与融合技术作业一

作业1

要求：用requests和BeautifulSoup库方法定向爬取给定网址（http://www.shanghairanking.cn/rankings/bcur/2020 ）的数据，屏幕打印爬取的大学排名信息。

输出信息：

代码

运行截图

心得体会

以关键词“书包”搜索页面的数据，爬取商品名称和价格。

输出信息：

代码

运行截图

心得体会

作业3

要求：爬取一个给定网页（ https://xcb.fzu.edu.cn/info/1071/4481.htm）或者自选网页的所有JPEG和JPG格式文件

输出信息：将自选网页内的所有JPEG和JPG文件保存在一个文件夹中

代码

运行截图

心得体会

公告