【2022.05.20】对无验证码的整个网页公告的内容进行自适应爬取(1)

学习内容

xpath，以及python字符串替换，

url自适应拼接，因为很多网站的href不完整

使用Selenium 抓取动态页面内容

前言

这次要实现的是根据网址和xpath，去抓取同一页面中的所有公告内容

代码

源码

import requests
from lxml import etree
import lxml
import json
import time
import string
from bs4 import BeautifulSoup
from urllib.parse import urljoin

# 全局变量
with open('config.json', 'r', encoding='utf-8') as f:
    JsonFile = json.load(f)

base_url = JsonFile['url']
notice_title_href_xpath = JsonFile['notice_title_href_xpath']
notice_title_xpath = JsonFile['notice_title_xpath']
notice_content_xpath = JsonFile['notice_content_xpath']
search = JsonFile['search']
notice_content_xpath = notice_content_xpath.replace("替换", search)


# 返回下一页的链接，这部分还没写完
def get_next_page_url(current_url):
    current_html = requests.get(current_url)
    current_html.encoding = "utf-8"
    selecter = etree.HTML(current_html.text)
    next_page_url = selecter.xpath("""//*[contains(text(),'下一页') or contains(text(),'下页') or contains(text(),'next') or contains(text(),'Next')]/@href""")
    print("下一页链接", next_page_url)
    return next_page_url

if __name__ == '__main__':
    current_url = base_url

    current_html = requests.get(current_url)
    current_html.encoding = "utf-8"
    selecter = etree.HTML(current_html.text)
    notice = selecter.xpath(notice_title_href_xpath)
    print(notice)

    # print(current_html)
    # next_page_url_xpath = """//*[@href = 'tj-sgsj_2.shtml']/@href"""
    # next_page_url = selecter.xpath(next_page_url_xpath)
    # print("下一页链接", next_page_url)

    # 获取当前页面的所有公告
    for result in notice:
        # 获得网址
        result_url = urljoin(current_url, result)
        print("网址： ", result_url)
        result_html = requests.get(result_url)
        result_html.encoding = "utf-8"
        result_detail = etree.HTML(result_html.text)
        result_title = result_detail.xpath(notice_title_xpath)
        print("标题: ", result_title)

        result_content = result_detail.xpath(notice_content_xpath)
        print("内容： ")
        for result_print in result_content:
            print(result_print)
        print("\n")
        time.sleep(1)

配置文件

{
    "？url": "【可修改】用于查找的网页",
    "url": "http://www.lswz.gov.cn/html/ywpd/lstk/tj-sgsj.shtml",
    "？search": "【可修改】要查找的公告内容",
    "search": "玉米",
    "？notice_title_href_xpath": "【不修改】获取每个公告href的Xpath位置",
    "notice_title_href_xpath": "//*[@class='lists diylist']/li/a/@href",
    "？notice_title_xpath": "【不修改】获取每个公告title的Xpath位置",
    "notice_title_xpath": "//div[@class='pub-det-title']/text()",
    "？notice_content_xpath": "【不修改】获取每个公告content的Xpath位置",
    "notice_content_xpath": "//*[contains(text(),'替换') and @style]/text()"
}

参考链接

(22条消息) XPath中text方法和string方法的用法总结_Jock2018的博客-CSDN博客_xpath中text

python爬虫使用requests请求无法获取网页元素时终极解决方案 - 咖啡少女不加糖。 - 博客园 (cnblogs.com)

requests get不到完整页面源码 - SegmentFault 思否

python JS 反爬用 request 获取到的 HTML 不全，请各位帮我看看 | Python | Python 技术论坛 (learnku.com)

posted @ 2022-05-20 19:41 Mokou 阅读(45) 评论(0) 编辑收藏举报

刷新页面返回顶部

莫多心情小站

———— I continue to fight.

【2022.05.20】对无验证码的整个网页公告的内容进行自适应爬取(1)

学习内容

前言

代码

源码

配置文件

参考链接

公告