102102146洪松渝数据采集与融合技术作业1
作业①:
要求:用requests和BeautifulSoup库方法定向爬取给定网址(http://www.shanghairanking.cn/rankings/bcur/2020)的数据,屏幕打印爬取的大学排名信息。
输出信息:
排名 |
学校名称 |
省市 |
学校类型 |
总分 |
1 |
清华大学 |
北京 |
综合 |
852.5 |
2...... |
代码:
import urllib.request import requests import bs4 from bs4 import BeautifulSoup def getText(url): url = 'https://www.shanghairanking.cn/rankings/bcur/202011' try: r = urllib.request.Request(url) d = urllib.request.urlopen(r) d = d.read().decode() return d except: return "" def UniList(ulist, html): soup = BeautifulSoup(html, "html.parser") for tr in soup.find('tbody').children: if isinstance(tr, bs4.element.Tag): index = tr('td')[0].text.replace('\n', '').replace(' ', '') name = tr('a', class_='name-cn')[0].text.replace('\n', '').replace(' ', '') score = tr('td')[4].text.replace('\n', '').replace(' ', '') ulist.append([index, name, score]) def printList(ulist, num): tplt = "{0:^10}\t{1:{3}^10}\t{2:^10}" print(tplt.format("排名", "学校名称", "总分", chr(12288))) for i in range(num): u = ulist[i] print(tplt.format(u[0], u[1], u[2], chr(12288))) def main(): uinfo = [] url = 'https://www.shanghairanking.cn/rankings/bcur/202011' html = getText(url) # 获取大学排名网页内容 UniList(uinfo, html) # 提取网页内容中信息 printList(uinfo, 26) main()
运行截图:
心得体会:第一次动手实践去爬静态网页,对request,bs4库的初次使用
作业②:
要求:用requests和re库方法设计某个商城(自已选择)商品比价定向爬虫,爬取该商城,以关键词“书包”搜索页面的数据,爬取商品名称和价格。
输出信息:
序号 |
价格 |
商品名 |
1 |
65.00 |
xxx |
2...... |
代码:
import re import requests from bs4 import BeautifulSoup url = "http://search.dangdang.com/?key=%CA%E9%B0%FC&act=input" headers = { 'User-Agent': "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:109.0) Gecko/20100101 Firefox/113.0" } response = requests.get(url=url, headers=headers) response.encoding = 'GBK' response = response.text soup = BeautifulSoup(response, 'html.parser') names = soup.find_all('a', class_='pic') prices = soup.find_all('span', class_='price_n') for i, (name, price) in enumerate(zip(names, prices), 1): print(f"编号:{i}") print(f"名称:{name.get('title')}") print(f"价格:{price.text}") print()
结果截图:
心得体会:
发现用正则式的方式比较复杂,于是改用了beautifulsoup库来解析页面
作业③:
要求:爬取一个给定网页( https://xcb.fzu.edu.cn/info/1071/4481.htm )或者自选网页的所有JPEG和JPG格式文件
输出信息:将自选网页内的所有JPEG和JPG文件保存在一个文件夹中
代码:
import os import requests from bs4 import BeautifulSoup url = 'https://xcb.fzu.edu.cn/info/1071/4481.htm' hd = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36" } def getHTML(url): try: r = requests.get(url, headers=hd) r.raise_for_status() r.encoding = r.apparent_encoding return r.text except: return '' def download_img(url, img_name): response = requests.get(url=img_url, headers=hd).content with open(img_name, 'wb') as f: f.write(response) text = getHTML(url) soup = BeautifulSoup(text, 'html.parser') imglist = soup.find_all('img') folder_name = 'images2' if not os.path.exists(folder_name): os.makedirs(folder_name) for img in imglist: url_ = img.attrs['src'] img_url = 'https://xcb.fzu.edu.cn{}'.format(url_) img_name = os.path.basename(img_url).split('?')[0] img_path = os.path.join(folder_name, img_name) download_img(img_url, img_path) print(f'下载图片:{img_path}')
结果截图:
心得体会:
这次对爬取图片有了初步的认识和实践