爬虫作业
描述
(2)请用requests库的get()函数访问如下一个网站20次,打印返回状态,text()内容,计算text()属性和content属性所返回网页内容的长度。(不同学号选做如下网页,必做及格)
网站:a: 百度主页 (尾号1,2学号做)
b : 搜狗主页(尾号3,4学号做)
c: google主页(尾号5,6学号做)
d: 360搜索主页(尾号7,8学号做)
f: 必应主页(尾号9,0学号做)
1 import requests 2 url = "https://hao.360.com/" 3 for i in range(20): 4 try: 5 rest = requests.get (url,timeout=30) 6 rest.raise_for_status() 7 r=rest.content.decode('utf-8') 8 print(r) 9 10 except: 11 print("error") 12 print("text属性长度",len(rest.text)) 13 print("content属性长度",len(rest.content))
(3)这是一个简单的html页面,请保持为字符串,完成后面的计算要求。(良好)
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<title>菜鸟教程(runoob.com)</title>
</head>
<body>
<h1>我的第一个标题学号07</h1>
<p id="first">我的第一个段落。</p>
</body>
<table border="1">
<tr>
<td>row 1, cell 1</td>
<td>row 1, cell 2</td>
</tr>
<tr>
<td>row 2, cell 1</td>
<td>row 2, cell 2</td>
</tr>
</table>
</html>
要求:a 打印head标签内容和你的学号后两位
b,获取body标签的内容
c 获取id 为first的标签对象
d. 获取并打印html页面中的中文字符
参考书:P269
1 from bs4 import BeautifulSoup 2 import re 3 path = 'C:/Users/asus/Desktop/html.html' 4 htmlfile = open(path, 'r', encoding='utf-8') 5 htmlhandle = htmlfile.read() 6 soup=BeautifulSoup(htmlhandle, "html.parser") 7 print(soup.head,"48") 8 print(soup.body) 9 print(soup.find_all(id="first")) 10 r=soup.text 11 pattern = re.findall('[\u4e00-\u9fa5]+',r) 12 print(pattern)
(4) 爬中国大学排名网站内容,http://www.zuihaodaxue.com/zuihaodaxuepaiming2018.html
要求:(一),爬取大学排名(学号尾号1,2,爬取年费2020,a,爬取大学排名(学号尾号3,4,爬取年费2016,)a,爬取大学排名(学号尾号5,6,爬取年费2017,)a,爬取大学排名(学号尾号7,8,爬取年费2018,))a,爬取大学排名(学号尾号9,0,爬取年费2019,)
import csv import os import requests from bs4 import BeautifulSoup allUniv = [] def getHTMLText(url): try: r = requests.get(url, timeout=30) r.raise_for_status() r.encoding ='utf-8' return r.text except: return "" def fillUnivList(soup): data = soup.find_all('tr') for tr in data: ltd = tr.find_all('td') if len(ltd)==0: continue singleUniv = [] for td in ltd: singleUniv.append(td.string) allUniv.append(singleUniv) def writercsv(save_road,num,title): if os.path.isfile(save_road): with open(save_road,'a',newline='')as f: csv_write=csv.writer(f,dialect='excel') for i in range(num): u=allUniv[i] csv_write.writerow(u) else: with open(save_road,'w',newline='')as f: csv_write=csv.writer(f,dialect='excel') csv_write.writerow(title) for i in range(num): u=allUniv[i] csv_write.writerow(u) title=["排名","学校名称","省市","总分","生源质量","培养结果","科研规模", "科研质量","顶尖成果","顶尖人才","科技服务","产学研究合作","成果转化","学生国际化"] save_road="D:\\pm.csv" def main(): url = 'http://www.zuihaodaxue.com/zuihaodaxuepaiming2018.html' html = getHTMLText(url) soup = BeautifulSoup(html, "html.parser") fillUnivList(soup) writercsv(save_road,20,title) main()