python实战:用requests+BeautifulSoup做爬虫抓取网页

一,安装requests

1,用pip安装

(venv) liuhongdi@192 news % pip3 install requests

2,查看所安装库的版本:

(venv) liuhongdi@192 news % pip3 show requests
Name: requests
Version: 2.31.0
Summary: Python HTTP for Humans.
Home-page: https://requests.readthedocs.io
Author: Kenneth Reitz
Author-email: me@kennethreitz.org
License: Apache 2.0
Location: /Users/liuhongdi/python_work/tutorial/news/venv/lib/python3.12/site-packages
Requires: certifi, charset-normalizer, idna, urllib3
Required-by:

二,安装BeautifulSoup

1,用pip安装

(venv) liuhongdi@192 news % pip3 install beautifulsoup4

2,查看安装库的信息

(venv) liuhongdi@192 news % pip3 show beautifulsoup4
Name: beautifulsoup4
Version: 4.12.3
Summary: Screen-scraping library
Home-page:
Author:
Author-email: Leonard Richardson <leonardr@segfault.org>
License: MIT License
Location: /Users/liuhongdi/python_work/tutorial/news/venv/lib/python3.12/site-packages
Requires: soupsieve
Required-by:

说明:刘宏缔的架构森林—专注it技术的博客,
网址:https://imgtouch.com
本文: https://blog.imgtouch.com/index.php/2024/02/20/python-shi-zhan-yong-requests-zuo-pa-chong/
代码: https://github.com/liuhongdi/ 或 https://gitee.com/liuhongdi
说明:作者:刘宏缔 邮箱: 371125307@qq.com

三,用requests+BeautifulSoup抓取页面并进行解析

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
import requests
from bs4 import BeautifulSoup
 
# 抓取观察者网的科技频道
 
# 网页中链接的主机地址
base_url = "https://www.guancha.cn"
 
# 要爬取的页面
url = "https://www.guancha.cn/GongYe%C2%B7KeJi/list_1.shtml"
# 参数
params = {}
# header
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36"
}
 
# 抓取
response = requests.get(url, params = params, headers = headers)
# print(response.text)
 
# 解析
soup = BeautifulSoup(response.text, 'html.parser')
 
# 从中找到我们需要的元素
container = soup.find('ul', {'class': 'column-list fix'})
 
# 从ul下得到所有的li
nodes = container.find_all('li')
 
# 遍历
for node in nodes:
    link = base_url + node.find('a')['href']
    print(link)
    text = node.find('a').string
    print(text)

运行结果:

https://www.guancha.cn/industry-science/2024_02_20_725732.shtml
在南极取得突破!国产极地重型载具完成技术测试
https://www.guancha.cn/industry-science/2024_02_20_725718.shtml
我国科学家发现非常规反铁磁体
https://www.guancha.cn/industry-science/2024_02_17_725482.shtml
我国猴痘mRNA疫苗将进入临床试验
https://www.guancha.cn/industry-science/2024_02_17_725468.shtml
首飞失败后,日本新型H3火箭2号机发射升空
https://www.guancha.cn/industry-science/2024_02_17_725467.shtml
“我们真的看到新工业革命来临”?
https://www.guancha.cn/industry-science/2024_02_17_725454.shtml
OpenAI视频生成模型,会让哪些人失业?
https://www.guancha.cn/industry-science/2024_02_16_725430.shtml
OpenAI发布首个视频生成模型:输文字出视频,1分钟流畅高清
https://www.guancha.cn/industry-science/2024_02_16_725411.shtml
自主研制离子成像技术探测量子态,我国科学家有了新发现
https://www.guancha.cn/industry-science/2024_02_14_725323.shtml
新春伊始,一批大国重器取得新突破
https://www.guancha.cn/industry-science/2024_02_14_725314.shtml
微型机器人在国际空间站首次模拟手术任务
https://www.guancha.cn/industry-science/2024_02_12_725209.shtml
实现突破性进展!这一领域,我国处于全球第一梯队
https://www.guancha.cn/industry-science/2024_02_08_724888.shtml
向理解高温超导机理迈出重要一步,中国科学家首次观测到
https://www.guancha.cn/industry-science/2024_02_07_724761.shtml
Vision Pro开卖炸出各种显眼包,有人戴着开车…
https://www.guancha.cn/industry-science/2024_02_07_724738.shtml
我国编制首部脑机接口研究伦理指引
https://www.guancha.cn/industry-science/2024_02_05_724538.shtml
英伟达对华“阉割版”芯片已可接受预订,但经销商说…
https://www.guancha.cn/industry-science/2024_02_02_724305.shtml
研究:月球正在缩小,南极月震使月球基地可能没那么宜居
https://www.guancha.cn/industry-science/2024_02_02_724227.shtml
这项重大突破,避免了“美国人比中国人更了解中国人”
https://www.guancha.cn/industry-science/2024_02_01_724214.shtml
此前只有两个国家掌握这一技术,我国实现突破
https://www.guancha.cn/industry-science/2024_01_31_724012.shtml
对标GPT-4,讯飞星火V3.5发布
https://www.guancha.cn/industry-science/2024_01_30_723955.shtml
AI作品是否享有著作权?北京互联网法院曾判决支持
posted @ 2024-02-20 16:26  刘宏缔的架构森林  阅读(83)  评论(0编辑  收藏  举报