一、实现功能
获取中公网时政每日新闻,通过构造翻页网址实现量化
发送每日新闻的请求获取xptah匹配到的位置
import requests import time from lxml import etree import re def write_info(info): with open('时政2.txt','a',encoding='utf-8') as f: f.write(info) f.close() url_temp = 'http://gd.zgsydw.com/ziliao/shizheng/{}.html' headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'} url_f = [] for k in range(11,34): url_f.append(url_temp.format(str(k))) #构造url地址 try: for url in url_f: #遍历主页的网址 response = requests.get(url,headers=headers) html = etree.HTML(response.content.decode('gbk')) ##result = etree.tostring(html).decode() item = html.xpath('//*[@class="whole clearfix zg_list_one"]/div[@class="zg_list_lf"]/div[@class="zpxx_nr_change1"]/div/ul/li') page = re.findall(r'http://gd.zgsydw.com/ziliao/shizheng/(.*?).html',url) for i in item: title = i.xpath('.//a/span/b/text()')[0] #获取标题 href = i.xpath('.//@href')[0] #获取每日网址 ## url_list.append(href) ## title_list.append(title) print('现在已经进行到第{}页'.format(page[0])+'\n',title,href) write_info(title) res = requests.get(href,headers=headers) html2 = etree.HTML(res.content.decode('gbk')) item1 = html2.xpath('//*[@class="whole clearfix zg_show_one"]/div[@class="zg_show_lf"]/div[@class="show_con"]') for i in item1: one = i.xpath('.//div[@class="show_con_box"]/text()')[0] ## print(one) write_info(one) two = i.xpath('.//div[@class="show_con_box"]/p/text()') for j in two[5:]: write_info(j) write_info('\n') time.sleep(3) ## print(j) ## print('\n') except Exception as e: print(e)
import re with open('时政2.txt','r',encoding='utf-8') as f: s = f.read() research = re.sub(r'和各位考生探讨考试中的疑惑,以下为正文内容详情:','',s) with open('时政3.txt','w',encoding='utf-8') as f: f.write(research) f.close()
二、存在问题
编码格式问题,网页编码属于charset="GB2312",但到一些页面会解析失败,统一用gbk可解决
xpath位置问题,第一条新闻没有单独的p标签,xpath('.//div[@class="show_con_box"]/text()')[0] 通过标签下文本匹配组合
2019年时期的网页标签不一致,因此无法完全匹配
三、参照
编码
https://blog.csdn.net/lxdcyh/article/details/4018054
lxml
https://blog.csdn.net/mouday/article/details/105376949
列表推导
https://blog.csdn.net/lexi3555/article/details/80633441