获取一篇新闻的全部信息

给定一篇新闻的链接newsUrl,获取该新闻的全部信息

标题、作者、发布单位、审核、来源

发布时间:转换成datetime类型

点击:

  • newsUrl
  • newsId(使用正则表达式re)
  • clickUrl(str.format(newsId))
  • requests.get(clickUrl)
  • newClick(用字符串处理,或正则表达式)
  • int()

整个过程包装成一个简单清晰的函数。

 1 import re
 2 import requests
 3 from bs4 import BeautifulSoup
 4 
 5 
 6 
 7 def newsnum(url):
 8     newsid = re.match('http://news.gzcc.cn/html/2019/.*/(\d+).html', url).group(1)
 9     return newsid
10 
11 
12 def newstime(soup):
13     pattern1 = re.compile(r'发布时间:(.*?)\xa0', re.S)
14     time = re.findall(pattern1, soup.select('.show-info')[0].text)[0]
15     return time
16 
17 
18 def click(url):
19     id = re.findall('(\d{5})', url)[0]
20     clickUrl = 'http://oa.gzcc.cn/api.php?op=count&id={}&modelid=80'.format(id)
21     res = requests.get(clickUrl)
22     click = re.findall('(\d+)', res.text)[-1]
23     return click
24 
25 
26 
27 def main(url):
28     res = requests.get(url)
29     res.encoding = 'utf-8'
30     soup = BeautifulSoup(res.text, 'html.parser')
31 
32     print("新闻编号:" + newsnum(url))
33     print("标题:" + soup.select('.show-title')[0].text)
34     print("发布时间:" + newstime(soup))
35     print(soup.select('.show-info')[0].text.split()[2])
36     print(soup.select('.show-info')[0].text.split()[3])
37     print(soup.select('.show-info')[0].text.split()[4])
38     print("内容:" + soup.select('.show-content p')[0].text)
39     print("点击次数:" + click(url))
40 
41 
42 if __name__ == "__main__":
43     url = "http://news.gzcc.cn/html/2019/xiaoyuanxinwen_0402/11131.html"
44     main(url)

运行结果:

posted @ 2019-04-03 17:07  qwertuyt124  阅读(158)  评论(0编辑  收藏  举报