Python爬虫学习01(使用requests爬取网页数据)

Python爬虫学习01(使用requests爬取网页数据)

1.1，使用的库：

import requests
from bs4 import BeautifulSoup

1.2，流程

#1，获取网页的对象
res = requests.get(baseurl,params=params,headers=headers)
#params即为参数，数据类型为字典
#2，编码
res.encoding='utf-8'
#3，将res.text交给BeautifulSoup解析
soup = BeautifulSoup(rs.text,'lxml')
#后一个参数即为使用的解析器
#4,使用soup寻找元素进行操作

1.3，用到的函数

1，bs4.element.Tag.get('标签类型',attr={'属性的种类':'属性的值'})
返回类型：bs4.element.Tag
示例：city.find('div')
2，1，bs4.element.Tag.get_all('标签类型',attr={'属性的种类':'属性的值'})
返回类型：list
示例：cities = table.find_all('tr')

1.4，示例：爬取百度百科中的湖北省行政区划

#导入两个库
import requests
from bs4 import BeautifulSoup
#请求头，防止网站识别爬虫
headers = {
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.64 Safari/537.36 Edg/101.0.1210.53'
}
#地址
url = 'https://baike.baidu.com/item/湖北省行政区划'
rs = requests.get(url=url,headers=headers)
rs.encoding='utf-8'
#使用BeautifulSoup解析数据
soup = BeautifulSoup(rs.text,'lxml')
#找到百度中行政区划的表
table = soup.find_all('table')[1]
#找到表中的各项
cities = table.find_all('tr')
#循环处理
for city in cities[1:]:
    #找到市级单位
    city_level = city.find('div').text
    #找到区级单位
    counties_level = city.find_all('div')[1:]
    counties = []
    #将爬取的字符串进行处理
    for item in counties_level:
        counties.append(item.text.split('、'))
    counties.append(['市区'])
    for county in counties:
        for item in county:
            if item != '':
                print('湖北省', city_level, item)

posted @ 2022-07-13 17:09 xiiii 阅读(1309) 评论(0) 编辑收藏举报

刷新页面返回顶部

xiiii

Python爬虫学习01(使用requests爬取网页数据)

Python爬虫学习01(使用requests爬取网页数据)

1.1，使用的库：

1.2，流程

1.3，用到的函数

1.4，示例：爬取百度百科中的湖北省行政区划

公告