获取一篇新闻的全部信息
作业来源:https://edu.cnblogs.com/campus/gzcc/GZCC-16SE2/homework/2894
题目:
给定一篇新闻的链接newsUrl,获取该新闻的全部信息
标题、作者、发布单位、审核、来源
发布时间:转换成datetime类型
点击:
- newsUrl
- newsId(使用正则表达式re)
- clickUrl(str.format(newsId))
- requests.get(clickUrl)
- newClick(用字符串处理,或正则表达式)
- int()
整个过程包装成一个简单清晰的函数。
newsURL为:
http://news.gzcc.cn/html/2019/xiaoyuanxinwen_0402/11131.html
代码为
# -*- coding: utf-8 -*- import requests from datetime import datetime from bs4 import BeautifulSoup url = 'http://news.gzcc.cn/html/2019/xiaoyuanxinwen_0402/11131.html' clickNumURL = 'http://oa.gzcc.cn/api.php?op=count&id=11131&modelid=80' def newsTime(shareinfo): newsDate = shareinfo.split()[0].split(':')[1] newsTime = shareinfo.split()[1] dt = newsDate + " " + newsTime # datetime模块的strptime能够将文本字符串格式的数据转换成时间格式的数据 showtime = datetime.strptime(dt, "%Y-%m-%d %H:%M:%S") print("新闻发布时间:", end="") print(showtime) def click(click_num_url): return_click_num = requests.get(click_num_url) click_info = BeautifulSoup(return_click_num.text, 'html.parser') click_num = int(click_info.text.split('.html')[3].split("'")[1]) print("点击次数:", end="") print(click_num) resourses = requests.get(url) resourses.encoding = 'UTF-8' soup = BeautifulSoup(resourses.text, 'html.parser') print("\n新闻标题:" + soup.select('.show-title')[0].text) # 使用BeautifulSoup的select方法根据元素的类名来查找元素的内容,返回的是list类型 publishing_unit = soup.select('.show-info')[0].text.split()[4].split(':')[1] print("新闻发布单位:", end="") print(publishing_unit) print("作者:", end="") writer = soup.select('.show-info')[0].text.split()[2].split(':')[1] print(writer) print("新闻内容:" + soup.select('.show-content')[0].text.replace('\u3000', '')) shareinfo = soup.select('.show-info')[0].text newsTime(shareinfo) click(clickNumURL)
标题、作者、发布单位、审核、来源
整体效果为: