【2022.05.20】对无验证码的整个网页公告的内容进行自适应爬取(1)
学习内容
xpath,以及python字符串替换,
url自适应拼接,因为很多网站的href不完整
使用Selenium 抓取动态页面内容
前言
这次要实现的是根据网址和xpath,去抓取同一页面中的所有公告内容
代码
源码
import requests
from lxml import etree
import lxml
import json
import time
import string
from bs4 import BeautifulSoup
from urllib.parse import urljoin
# 全局变量
with open('config.json', 'r', encoding='utf-8') as f:
JsonFile = json.load(f)
base_url = JsonFile['url']
notice_title_href_xpath = JsonFile['notice_title_href_xpath']
notice_title_xpath = JsonFile['notice_title_xpath']
notice_content_xpath = JsonFile['notice_content_xpath']
search = JsonFile['search']
notice_content_xpath = notice_content_xpath.replace("替换", search)
# 返回下一页的链接,这部分还没写完
def get_next_page_url(current_url):
current_html = requests.get(current_url)
current_html.encoding = "utf-8"
selecter = etree.HTML(current_html.text)
next_page_url = selecter.xpath("""//*[contains(text(),'下一页') or contains(text(),'下页') or contains(text(),'next') or contains(text(),'Next')]/@href""")
print("下一页链接", next_page_url)
return next_page_url
if __name__ == '__main__':
current_url = base_url
current_html = requests.get(current_url)
current_html.encoding = "utf-8"
selecter = etree.HTML(current_html.text)
notice = selecter.xpath(notice_title_href_xpath)
print(notice)
# print(current_html)
# next_page_url_xpath = """//*[@href = 'tj-sgsj_2.shtml']/@href"""
# next_page_url = selecter.xpath(next_page_url_xpath)
# print("下一页链接", next_page_url)
# 获取当前页面的所有公告
for result in notice:
# 获得网址
result_url = urljoin(current_url, result)
print("网址: ", result_url)
result_html = requests.get(result_url)
result_html.encoding = "utf-8"
result_detail = etree.HTML(result_html.text)
result_title = result_detail.xpath(notice_title_xpath)
print("标题: ", result_title)
result_content = result_detail.xpath(notice_content_xpath)
print("内容: ")
for result_print in result_content:
print(result_print)
print("\n")
time.sleep(1)
配置文件
{
"?url": "【可修改】用于查找的网页",
"url": "http://www.lswz.gov.cn/html/ywpd/lstk/tj-sgsj.shtml",
"?search": "【可修改】要查找的公告内容",
"search": "玉米",
"?notice_title_href_xpath": "【不修改】获取每个公告href的Xpath位置",
"notice_title_href_xpath": "//*[@class='lists diylist']/li/a/@href",
"?notice_title_xpath": "【不修改】获取每个公告title的Xpath位置",
"notice_title_xpath": "//div[@class='pub-det-title']/text()",
"?notice_content_xpath": "【不修改】获取每个公告content的Xpath位置",
"notice_content_xpath": "//*[contains(text(),'替换') and @style]/text()"
}
参考链接
(22条消息) XPath中text方法和string方法的用法总结_Jock2018的博客-CSDN博客_xpath中text
python爬虫使用requests请求无法获取网页元素时终极解决方案 - 咖啡少女不加糖。 - 博客园 (cnblogs.com)
requests get不到完整页面源码 - SegmentFault 思否
python JS 反爬用 request 获取到的 HTML 不全,请各位帮我看看 | Python | Python 技术论坛 (learnku.com)
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· 分享一个免费、快速、无限量使用的满血 DeepSeek R1 模型,支持深度思考和联网搜索!
· 基于 Docker 搭建 FRP 内网穿透开源项目(很简单哒)
· ollama系列1:轻松3步本地部署deepseek,普通电脑可用
· 按钮权限的设计及实现
· Apache Tomcat RCE漏洞复现(CVE-2025-24813)