一、实现功能

获取中公网时政每日新闻,通过构造翻页网址实现量化

发送每日新闻的请求获取xptah匹配到的位置

import requests
import time
from lxml import etree
import re

def write_info(info):
    with open('时政2.txt','a',encoding='utf-8') as f:
        f.write(info)
        f.close()


url_temp = 'http://gd.zgsydw.com/ziliao/shizheng/{}.html'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'}
url_f = []
for k in range(11,34):
    url_f.append(url_temp.format(str(k)))   #构造url地址
    
try:
    for url in url_f:    #遍历主页的网址
        response = requests.get(url,headers=headers)
        html = etree.HTML(response.content.decode('gbk'))    
    ##result = etree.tostring(html).decode()
        item = html.xpath('//*[@class="whole clearfix zg_list_one"]/div[@class="zg_list_lf"]/div[@class="zpxx_nr_change1"]/div/ul/li')
        page = re.findall(r'http://gd.zgsydw.com/ziliao/shizheng/(.*?).html',url)
        for i in item:
            title = i.xpath('.//a/span/b/text()')[0]    #获取标题
            href = i.xpath('.//@href')[0]   #获取每日网址
        ##    url_list.append(href)
        ##    title_list.append(title)
            print('现在已经进行到第{}页'.format(page[0])+'\n',title,href)
            write_info(title)
            res = requests.get(href,headers=headers)
            html2 = etree.HTML(res.content.decode('gbk'))
            item1 = html2.xpath('//*[@class="whole clearfix zg_show_one"]/div[@class="zg_show_lf"]/div[@class="show_con"]')
            for i in item1:
                one = i.xpath('.//div[@class="show_con_box"]/text()')[0]
    ##            print(one)
                write_info(one)
                two = i.xpath('.//div[@class="show_con_box"]/p/text()')
                for j in two[5:]:
                    write_info(j)
                    write_info('\n')
                time.sleep(3)
##                print(j)
##            print('\n')
except Exception as e:
    print(e)
爬取代码
import re


with open('时政2.txt','r',encoding='utf-8') as f:
    s = f.read()
    research = re.sub(r'和各位考生探讨考试中的疑惑,以下为正文内容详情:','',s)
    
with open('时政3.txt','w',encoding='utf-8') as f:
    f.write(research)
    f.close()
初步清洗标签

 

二、存在问题

编码格式问题,网页编码属于charset="GB2312",但到一些页面会解析失败,统一用gbk可解决

xpath位置问题,第一条新闻没有单独的p标签,xpath('.//div[@class="show_con_box"]/text()')[0] 通过标签下文本匹配组合

2019年时期的网页标签不一致,因此无法完全匹配

 

三、参照

编码

https://blog.csdn.net/lxdcyh/article/details/4018054

lxml

https://blog.csdn.net/mouday/article/details/105376949

列表推导

https://blog.csdn.net/lexi3555/article/details/80633441

 

posted on 2020-09-02 11:14  喻解  阅读(210)  评论(0编辑  收藏  举报