爬取中公网新闻时政

一、实现功能

获取中公网时政每日新闻，通过构造翻页网址实现量化

发送每日新闻的请求获取xptah匹配到的位置

import requests
import time
from lxml import etree
import re

def write_info(info):
    with open('时政2.txt','a',encoding='utf-8') as f:
        f.write(info)
        f.close()


url_temp = 'http://gd.zgsydw.com/ziliao/shizheng/{}.html'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'}
url_f = []
for k in range(11,34):
    url_f.append(url_temp.format(str(k)))   #构造url地址
    
try:
    for url in url_f:    #遍历主页的网址
        response = requests.get(url,headers=headers)
        html = etree.HTML(response.content.decode('gbk'))    
    ##result = etree.tostring(html).decode()
        item = html.xpath('//*[@class="whole clearfix zg_list_one"]/div[@class="zg_list_lf"]/div[@class="zpxx_nr_change1"]/div/ul/li')
        page = re.findall(r'http://gd.zgsydw.com/ziliao/shizheng/(.*?).html',url)
        for i in item:
            title = i.xpath('.//a/span/b/text()')[0]    #获取标题
            href = i.xpath('.//@href')[0]   #获取每日网址
        ##    url_list.append(href)
        ##    title_list.append(title)
            print('现在已经进行到第{}页'.format(page[0])+'\n',title,href)
            write_info(title)
            res = requests.get(href,headers=headers)
            html2 = etree.HTML(res.content.decode('gbk'))
            item1 = html2.xpath('//*[@class="whole clearfix zg_show_one"]/div[@class="zg_show_lf"]/div[@class="show_con"]')
            for i in item1:
                one = i.xpath('.//div[@class="show_con_box"]/text()')[0]
    ##            print(one)
                write_info(one)
                two = i.xpath('.//div[@class="show_con_box"]/p/text()')
                for j in two[5:]:
                    write_info(j)
                    write_info('\n')
                time.sleep(3)
##                print(j)
##            print('\n')
except Exception as e:
    print(e)

爬取代码

import re


with open('时政2.txt','r',encoding='utf-8') as f:
    s = f.read()
    research = re.sub(r'和各位考生探讨考试中的疑惑，以下为正文内容详情：','',s)
    
with open('时政3.txt','w',encoding='utf-8') as f:
    f.write(research)
    f.close()

初步清洗标签

二、存在问题

编码格式问题，网页编码属于charset="GB2312"，但到一些页面会解析失败，统一用gbk可解决

xpath位置问题，第一条新闻没有单独的p标签，xpath('.//div[@class="show_con_box"]/text()')[0] 通过标签下文本匹配组合

2019年时期的网页标签不一致，因此无法完全匹配

三、参照

编码

https://blog.csdn.net/lxdcyh/article/details/4018054

lxml

https://blog.csdn.net/mouday/article/details/105376949

列表推导

https://blog.csdn.net/lexi3555/article/details/80633441

posted on 2020-09-02 11:14 喻解阅读(210) 评论(0) 编辑收藏举报