数据采集与融合技术作业一
作业1
要求:用requests和BeautifulSoup库方法定向爬取给定网址(http://www.shanghairanking.cn/rankings/bcur/2020 )的数据,屏幕打印爬取的大学排名信息。
输出信息:
| 排名 | 学校名称 | 省市 | 学校类型 | 总分 |
|---|---|---|---|---|
| 1 | 清华大学 | 北京 | 综合 | 852.5 |
| 2 | ... | ... | ... | ... |
代码
from bs4 import BeautifulSoup
import requests
url = "http://www.shanghairanking.cn/rankings/bcur/2020"
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/117.0',
}
response = requests.get(url, headers=headers)
response.encoding='gkd'
data = response.text
soup = BeautifulSoup(data, "html.parser")
university_list = soup.select("tbody tr")
template = "{0:^13}\t{1:^13}\t{2:^13}\t{3:^10}\t{4:^10}"
print(template.format("排名", "学校", "省市", "类型", "总分", chr(12288)))
for item in university_list[0:30]:
tds = item.select("td")
info = []
for td in tds[0:30]:
text = td.text.replace(" ", "").replace("\n", "")
info.append(text)
print(template.format(info[0], info[1], info[2], info[3], info[4], chr(12288)))
运行截图

心得体会
爬取下来的数据格式未对齐,仍需改进。
以关键词“书包”搜索页面的数据,爬取商品名称和价格。
输出信息:
| 序号 | 价格 | 商品名 |
|---|---|---|
| 1 | 65.00 | xxx |
| 2 | ... | ... |
代码
import requests
import os
from bs4 import BeautifulSoup
headers = {
'User-Agent':'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Mobile Safari/537.36 Edg/117.0.2045.43',
'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
'Accept-Encoding':'gzip, deflate, br',
'Accept-Language':'zh-CN,zh;q=0.9'
}
url = 'http://search.dangdang.com/?key=%CA%E9%B0%FC&act=input'
response = requests.get(url, headers=headers)
# response.encoding = 'utf8'
html = response.text
soup = BeautifulSoup(html, "lxml")
names = soup.select("li[ddt-pit] p[class = name] a")
prices = soup.select("li[ddt-pit] p[class = price] span")
i = 0
for name, price in zip(names, prices):
i = i+1
print(i, name["title"], price.text)
运行截图

心得体会
淘宝反爬取措施较完善,故爬取当当网。
作业3
要求:爬取一个给定网页( https://xcb.fzu.edu.cn/info/1071/4481.htm)或者自选网页的所有JPEG和JPG格式文件
输出信息:将自选网页内的所有JPEG和JPG文件保存在一个文件夹中
代码
import requests
import os
from bs4 import BeautifulSoup
headers = {
'User-Agent':'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Mobile Safari/537.36 Edg/117.0.2045.43',
'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
'Accept-Encoding':'gzip, deflate, br',
'Accept-Language':'zh-CN,zh;q=0.9'
}
url = 'https://xcb.fzu.edu.cn/info/1071/4481.htm'
response = requests.get(url,headers=headers)
response.encoding = 'utf8'
html = response.text
soup = BeautifulSoup(html, "html.parser")
img_list = soup.find_all('img', src=True)
img_path = "./fzuxcbedu"
if not os.path.exists(img_path):
os.mkdir(img_path)
count = 0
for img in img_list:
count += 1
img_url = img.attrs['src']
img_url = 'https://xcb.fzu.edu.cn{}'.format(img_url)
img_response = requests.get(img_url)
file_name = f'{img_path}/图片{str(count)}.jpg'
with open(file_name, 'wb') as f:
f.write(img_response.content)
print(img_url)
运行截图


心得体会
使用BeatifulSoup库较re库更简单

浙公网安备 33010602011771号