python爬虫边看边学（bs4安装与使用）

BeautifulSoup 模块

一、安装

pip install bs4

二、使用

bs4在使用的时候需要参考一些html基本语法来进行，我们尝试抓取北京新发地时长的农产品价格。地址：http://www.xinfadi.com.cn/marketanalysis/0/list/1.shtml

步骤如下：

1、获取页面

使用requests获取页面内容

import requests

url='http://www.xinfadi.com.cn/marketanalysis/0/list/1.shtml'
resp=requests.get(url)

　　2、解析数据

把页面源代码交给BeautifulSoup进行处理，生成bs对象

import requests
from bs4 import BeautifulSoup

url = 'http://www.xinfadi.com.cn/marketanalysis/0/list/1.shtml'
resp = requests.get(url)

page = BeautifulSoup(resp.text, 'html.parser')  # 指定html解释器

# 从bs对象中查找数据
# find(标签，属性=值)
# find_all(标签，属性=值）
# table = page.find("table", class_="hq_table")
table = page.find("table", attrs={"class": "hq_table"})

trs = table.find_all("tr")[1:]  # 查找除第一行的所有行
for tr in trs:  # 遍历每一行
    tds = tr.find_all("td")  # 每行中所有td
    name = tds[0].text  # .text表示拿到被标签标记的内容 还可以使用.string或.get_text()
    low = tds[1].text
    avg = tds[2].text
    high = tds[3].text
    gui = tds[4].text
    kind = tds[5].text
    date = tds[6].text
    print(name, low, avg, high, gui, kind, date)

　　3、保存数据

把解析后的数据存入csv文件

import requests
from bs4 import BeautifulSoup
import csv

url = 'http://www.xinfadi.com.cn/marketanalysis/0/list/1.shtml'
resp = requests.get(url)
f=open('菜价.csv','w',encoding='utf-8',newline='')
csvwriter=csv.writer(f)

# 把页面源代码交给BeautifulSoup进行处理，生成bs对象
page = BeautifulSoup(resp.text, 'html.parser')  # 指定html解释器

# 从bs对象中查找数据
# find(标签，属性=值)
# find_all(标签，属性=值）
# table = page.find("table", class_="hq_table")
table = page.find("table", attrs={"class": "hq_table"})

trs = table.find_all("tr")[1:]  # 查找除第一行的所有行
for tr in trs:  # 遍历每一行
    tds = tr.find_all("td")  # 每行中所有td
    name = tds[0].text  # .text表示拿到被标签标记的内容
    low = tds[1].text
    avg = tds[2].text
    high = tds[3].text
    gui = tds[4].text
    kind = tds[5].text
    date = tds[6].text
    # print(name, low, avg, high, gui, kind, date)
    csvwriter.writerow([name, low, avg, high, gui, kind, date])
f.close()

二、案例

桌面壁纸爬取

import requests
from bs4 import BeautifulSoup

url = 'https://www.umei.cc/bizhitupian/weimeibizhi/'
resp = requests.get(url)
resp.encoding = 'utf-8'  # 处理乱码

# print(resp.text)
main_page = BeautifulSoup(resp.text, "html.parser")
alist = main_page.find("div", class_="TypeList").find_all('a')   # 链式查找
# print(alist)
for a in alist:
    href = a.get("href")  # 直接通过get就可以拿到属性的值
    # 拿到子页面的源代码
    child_page_resp = requests.get(href)
    child_page_resp.encoding = 'utf-8'
    # 从子页面中拿到图片的下载路径
    child_page = BeautifulSoup(child_page_resp.text, "html.parser")
    p = child_page.find('p', align="center")  # 查找带<p align='center'>标签
    img = p.find('img')      # 查找带<img>标签
    file_name = img.get('alt') + '.jpg'
    file_content = requests.get(img.get('src')).content    #字节内容 .content
    with open('img\\' + file_name, 'wb') as f:
        f.write(file_content)
print('over!')
# print(img.get('alt'),img.get('src'))    #定位到属性值也可以用img['alt']或img['src']

posted @ 2021-03-31 16:12 wangshanglinju 阅读(549) 评论(0) 编辑收藏举报

刷新页面返回顶部

wangshanglinju

python爬虫边看边学（bs4安装与使用）

公告