爬取全部的校园新闻

Posted on 2019-04-15 20:59  赵铭杰  阅读(150)  评论(0编辑  收藏  举报
作业来源:https://edu.cnblogs.com/campus/gzcc/GZCC-16SE1/homework/3002
from
bs4 import BeautifulSoup import requests as re url='http://news.gzcc.cn/html/xiaoyuanxinwen/' res=requests.get(url) res.encoding='utf-8' soup=BeautifulSoup(res.text,'html.parser') def getclick(newurl):##一般需要通过使用compile的方法把正则表达式转化为pattern(编译时的字符串表达式) id=re.search('_(.*).html',newurl).group(1).split('/')[1]##取数组第二个元素,以/分开字符串,默认为-1,全分, clickurl='http://oa.gzcc.cn/api.php?op=count&id={}&modelid+80'.format(id)##format代替%d代表参数的方式,把参数填到{}内 click=int(request.get(clickurl).text.split(".")[-1].lstrip("html('").rstrip("');"))##删除)末尾的字符串,删除html(前面的字符串,选取第一个数组,以.分开字符串##向服务器发出请求,构造对象,内容为服务器返回的值 return click; def getonpages(listurl): res=request.get(listurl) res.encoding='utf-8' soup=BeautifulSoup(res.text,'html.parser') for news in soup.select('li'): if len(news.select('.news-list-title'))>0: title=news.select('.new-list-title')[0].text time=news.select('.news-list-info')[0].content[0].text url1=news.select('a')[0]['href'] bumen=news.select('.news-list-info')[0].content[1].text description=news.select('.news-list-description')[0].text read=requests.get(url1) read.encoding='utf-8' soupd=BeautifulSoup(read.text,'html.parser') detail=soupd.select('.show-content')[0].text click=getclick(url1) print(title,click) count=int(soup.select('.a1')[0].text.rstrip)("")##rstript删除末尾指定的字符串,数组获取为第一组,以选择方式得到类名为a1内容 pages=count//10+1 for i in range(2,4): pagesurl="http://news.gzcc.cn/html/xiaoyuanxinwen/{}.html".format(i) getonpages(pagesurl)

参考网站:https://www.jb51.net/article/141830.htm
https://blog.csdn.net/k_koris/article/details/82950654
https://www.cnblogs.com/tina-python/p/5508402.html
https://www.cnblogs.com/keye/p/7868059.html
https://www.cnblogs.com/benric/p/4965224.html
http://www.runoob.com/python/att-string-split.html