python实战:用requests+BeautifulSoup做爬虫抓取网页

一，安装requests

1,用pip安装

(venv) liuhongdi@192 news % pip3 install requests

2,查看所安装库的版本:

(venv) liuhongdi@192 news % pip3 show requests
Name: requests
Version: 2.31.0
Summary: Python HTTP for Humans.
Home-page: https://requests.readthedocs.io
Author: Kenneth Reitz
Author-email: me@kennethreitz.org
License: Apache 2.0
Location: /Users/liuhongdi/python_work/tutorial/news/venv/lib/python3.12/site-packages
Requires: certifi, charset-normalizer, idna, urllib3
Required-by:

二，安装BeautifulSoup

1,用pip安装

(venv) liuhongdi@192 news % pip3 install beautifulsoup4

2,查看安装库的信息

(venv) liuhongdi@192 news % pip3 show beautifulsoup4
Name: beautifulsoup4
Version: 4.12.3
Summary: Screen-scraping library
Home-page:
Author:
Author-email: Leonard Richardson <leonardr@segfault.org>
License: MIT License
Location: /Users/liuhongdi/python_work/tutorial/news/venv/lib/python3.12/site-packages
Requires: soupsieve
Required-by:

说明：刘宏缔的架构森林—专注it技术的博客，
网址：https://imgtouch.com
本文: https://blog.imgtouch.com/index.php/2024/02/20/python-shi-zhan-yong-requests-zuo-pa-chong/
代码: https://github.com/liuhongdi/ 或 https://gitee.com/liuhongdi
说明：作者:刘宏缔邮箱: 371125307@qq.com

三，用requests+BeautifulSoup抓取页面并进行解析

import requests
from bs4 import BeautifulSoup
 
# 抓取观察者网的科技频道
 
# 网页中链接的主机地址
base_url = "https://www.guancha.cn"
 
# 要爬取的页面
url = "https://www.guancha.cn/GongYe%C2%B7KeJi/list_1.shtml"
# 参数
params = {}
# header
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36"
}
 
# 抓取
response = requests.get(url, params = params, headers = headers)
# print(response.text)
 
# 解析
soup = BeautifulSoup(response.text, 'html.parser')
 
# 从中找到我们需要的元素
container = soup.find('ul', {'class': 'column-list fix'})
 
# 从ul下得到所有的li
nodes = container.find_all('li')
 
# 遍历
for node in nodes:
    link = base_url + node.find('a')['href']
    print(link)
    text = node.find('a').string
    print(text)

运行结果:

https://www.guancha.cn/industry-science/2024_02_20_725732.shtml
在南极取得突破！国产极地重型载具完成技术测试
https://www.guancha.cn/industry-science/2024_02_20_725718.shtml
我国科学家发现非常规反铁磁体
https://www.guancha.cn/industry-science/2024_02_17_725482.shtml
我国猴痘mRNA疫苗将进入临床试验
https://www.guancha.cn/industry-science/2024_02_17_725468.shtml
首飞失败后，日本新型H3火箭2号机发射升空
https://www.guancha.cn/industry-science/2024_02_17_725467.shtml
“我们真的看到新工业革命来临”？
https://www.guancha.cn/industry-science/2024_02_17_725454.shtml
OpenAI视频生成模型，会让哪些人失业？
https://www.guancha.cn/industry-science/2024_02_16_725430.shtml
OpenAI发布首个视频生成模型：输文字出视频，1分钟流畅高清
https://www.guancha.cn/industry-science/2024_02_16_725411.shtml
自主研制离子成像技术探测量子态，我国科学家有了新发现
https://www.guancha.cn/industry-science/2024_02_14_725323.shtml
新春伊始，一批大国重器取得新突破
https://www.guancha.cn/industry-science/2024_02_14_725314.shtml
微型机器人在国际空间站首次模拟手术任务
https://www.guancha.cn/industry-science/2024_02_12_725209.shtml
实现突破性进展！这一领域，我国处于全球第一梯队
https://www.guancha.cn/industry-science/2024_02_08_724888.shtml
向理解高温超导机理迈出重要一步，中国科学家首次观测到
https://www.guancha.cn/industry-science/2024_02_07_724761.shtml
Vision Pro开卖炸出各种显眼包，有人戴着开车…
https://www.guancha.cn/industry-science/2024_02_07_724738.shtml
我国编制首部脑机接口研究伦理指引
https://www.guancha.cn/industry-science/2024_02_05_724538.shtml
英伟达对华“阉割版”芯片已可接受预订，但经销商说…
https://www.guancha.cn/industry-science/2024_02_02_724305.shtml
研究：月球正在缩小，南极月震使月球基地可能没那么宜居
https://www.guancha.cn/industry-science/2024_02_02_724227.shtml
这项重大突破，避免了“美国人比中国人更了解中国人”
https://www.guancha.cn/industry-science/2024_02_01_724214.shtml
此前只有两个国家掌握这一技术，我国实现突破
https://www.guancha.cn/industry-science/2024_01_31_724012.shtml
对标GPT-4，讯飞星火V3.5发布
https://www.guancha.cn/industry-science/2024_01_30_723955.shtml
AI作品是否享有著作权？北京互联网法院曾判决支持

一，安装requests

二，安装BeautifulSoup

三，用requests+BeautifulSoup抓取页面并进行解析

公告