python实战:用requests+BeautifulSoup做爬虫抓取网页
一,安装requests
1,用pip安装
(venv) liuhongdi@192 news % pip3 install requests
2,查看所安装库的版本:
(venv) liuhongdi@192 news % pip3 show requests
Name: requests
Version: 2.31.0
Summary: Python HTTP for Humans.
Home-page: https://requests.readthedocs.io
Author: Kenneth Reitz
Author-email: me@kennethreitz.org
License: Apache 2.0
Location: /Users/liuhongdi/python_work/tutorial/news/venv/lib/python3.12/site-packages
Requires: certifi, charset-normalizer, idna, urllib3
Required-by:
二,安装BeautifulSoup
1,用pip安装
(venv) liuhongdi@192 news % pip3 install beautifulsoup4
2,查看安装库的信息
(venv) liuhongdi@192 news % pip3 show beautifulsoup4
Name: beautifulsoup4
Version: 4.12.3
Summary: Screen-scraping library
Home-page:
Author:
Author-email: Leonard Richardson <leonardr@segfault.org>
License: MIT License
Location: /Users/liuhongdi/python_work/tutorial/news/venv/lib/python3.12/site-packages
Requires: soupsieve
Required-by:
说明:刘宏缔的架构森林—专注it技术的博客,
网址:https://imgtouch.com
本文: https://blog.imgtouch.com/index.php/2024/02/20/python-shi-zhan-yong-requests-zuo-pa-chong/
代码: https://github.com/liuhongdi/ 或 https://gitee.com/liuhongdi
说明:作者:刘宏缔 邮箱: 371125307@qq.com
三,用requests+BeautifulSoup抓取页面并进行解析
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
|
import requests from bs4 import BeautifulSoup # 抓取观察者网的科技频道 # 网页中链接的主机地址 base_url = "https://www.guancha.cn" # 要爬取的页面 url = "https://www.guancha.cn/GongYe%C2%B7KeJi/list_1.shtml" # 参数 params = {} # header headers = { "User-Agent" : "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36" } # 抓取 response = requests.get(url, params = params, headers = headers) # print(response.text) # 解析 soup = BeautifulSoup(response.text, 'html.parser' ) # 从中找到我们需要的元素 container = soup.find( 'ul' , { 'class' : 'column-list fix' }) # 从ul下得到所有的li nodes = container.find_all( 'li' ) # 遍历 for node in nodes: link = base_url + node.find( 'a' )[ 'href' ] print (link) text = node.find( 'a' ).string print (text) |
运行结果:
https://www.guancha.cn/industry-science/2024_02_20_725732.shtml
在南极取得突破!国产极地重型载具完成技术测试
https://www.guancha.cn/industry-science/2024_02_20_725718.shtml
我国科学家发现非常规反铁磁体
https://www.guancha.cn/industry-science/2024_02_17_725482.shtml
我国猴痘mRNA疫苗将进入临床试验
https://www.guancha.cn/industry-science/2024_02_17_725468.shtml
首飞失败后,日本新型H3火箭2号机发射升空
https://www.guancha.cn/industry-science/2024_02_17_725467.shtml
“我们真的看到新工业革命来临”?
https://www.guancha.cn/industry-science/2024_02_17_725454.shtml
OpenAI视频生成模型,会让哪些人失业?
https://www.guancha.cn/industry-science/2024_02_16_725430.shtml
OpenAI发布首个视频生成模型:输文字出视频,1分钟流畅高清
https://www.guancha.cn/industry-science/2024_02_16_725411.shtml
自主研制离子成像技术探测量子态,我国科学家有了新发现
https://www.guancha.cn/industry-science/2024_02_14_725323.shtml
新春伊始,一批大国重器取得新突破
https://www.guancha.cn/industry-science/2024_02_14_725314.shtml
微型机器人在国际空间站首次模拟手术任务
https://www.guancha.cn/industry-science/2024_02_12_725209.shtml
实现突破性进展!这一领域,我国处于全球第一梯队
https://www.guancha.cn/industry-science/2024_02_08_724888.shtml
向理解高温超导机理迈出重要一步,中国科学家首次观测到
https://www.guancha.cn/industry-science/2024_02_07_724761.shtml
Vision Pro开卖炸出各种显眼包,有人戴着开车…
https://www.guancha.cn/industry-science/2024_02_07_724738.shtml
我国编制首部脑机接口研究伦理指引
https://www.guancha.cn/industry-science/2024_02_05_724538.shtml
英伟达对华“阉割版”芯片已可接受预订,但经销商说…
https://www.guancha.cn/industry-science/2024_02_02_724305.shtml
研究:月球正在缩小,南极月震使月球基地可能没那么宜居
https://www.guancha.cn/industry-science/2024_02_02_724227.shtml
这项重大突破,避免了“美国人比中国人更了解中国人”
https://www.guancha.cn/industry-science/2024_02_01_724214.shtml
此前只有两个国家掌握这一技术,我国实现突破
https://www.guancha.cn/industry-science/2024_01_31_724012.shtml
对标GPT-4,讯飞星火V3.5发布
https://www.guancha.cn/industry-science/2024_01_30_723955.shtml
AI作品是否享有著作权?北京互联网法院曾判决支持