python爬虫边看边学(bs4安装与使用)
BeautifulSoup 模块
一、安装
pip install bs4
二、使用
bs4在使用的时候需要参考一些html基本语法来进行,我们尝试抓取北京新发地时长的农产品价格。地址:http://www.xinfadi.com.cn/marketanalysis/0/list/1.shtml
步骤如下:
1、获取页面
使用requests获取页面内容
import requests url='http://www.xinfadi.com.cn/marketanalysis/0/list/1.shtml' resp=requests.get(url)
2、解析数据
把页面源代码交给BeautifulSoup进行处理,生成bs对象
import requests from bs4 import BeautifulSoup url = 'http://www.xinfadi.com.cn/marketanalysis/0/list/1.shtml' resp = requests.get(url) page = BeautifulSoup(resp.text, 'html.parser') # 指定html解释器 # 从bs对象中查找数据 # find(标签,属性=值) # find_all(标签,属性=值) # table = page.find("table", class_="hq_table") table = page.find("table", attrs={"class": "hq_table"}) trs = table.find_all("tr")[1:] # 查找除第一行的所有行 for tr in trs: # 遍历每一行 tds = tr.find_all("td") # 每行中所有td name = tds[0].text # .text表示拿到被标签标记的内容 还可以使用.string或.get_text() low = tds[1].text avg = tds[2].text high = tds[3].text gui = tds[4].text kind = tds[5].text date = tds[6].text print(name, low, avg, high, gui, kind, date)
3、保存数据
把解析后的数据存入csv文件
import requests from bs4 import BeautifulSoup import csv url = 'http://www.xinfadi.com.cn/marketanalysis/0/list/1.shtml' resp = requests.get(url) f=open('菜价.csv','w',encoding='utf-8',newline='') csvwriter=csv.writer(f) # 把页面源代码交给BeautifulSoup进行处理,生成bs对象 page = BeautifulSoup(resp.text, 'html.parser') # 指定html解释器 # 从bs对象中查找数据 # find(标签,属性=值) # find_all(标签,属性=值) # table = page.find("table", class_="hq_table") table = page.find("table", attrs={"class": "hq_table"}) trs = table.find_all("tr")[1:] # 查找除第一行的所有行 for tr in trs: # 遍历每一行 tds = tr.find_all("td") # 每行中所有td name = tds[0].text # .text表示拿到被标签标记的内容 low = tds[1].text avg = tds[2].text high = tds[3].text gui = tds[4].text kind = tds[5].text date = tds[6].text # print(name, low, avg, high, gui, kind, date) csvwriter.writerow([name, low, avg, high, gui, kind, date]) f.close()
二、案例
桌面壁纸爬取
import requests from bs4 import BeautifulSoup url = 'https://www.umei.cc/bizhitupian/weimeibizhi/' resp = requests.get(url) resp.encoding = 'utf-8' # 处理乱码 # print(resp.text) main_page = BeautifulSoup(resp.text, "html.parser") alist = main_page.find("div", class_="TypeList").find_all('a') # 链式查找 # print(alist) for a in alist: href = a.get("href") # 直接通过get就可以拿到属性的值 # 拿到子页面的源代码 child_page_resp = requests.get(href) child_page_resp.encoding = 'utf-8' # 从子页面中拿到图片的下载路径 child_page = BeautifulSoup(child_page_resp.text, "html.parser") p = child_page.find('p', align="center") # 查找带<p align='center'>标签 img = p.find('img') # 查找带<img>标签 file_name = img.get('alt') + '.jpg' file_content = requests.get(img.get('src')).content #字节内容 .content with open('img\\' + file_name, 'wb') as f: f.write(file_content) print('over!') # print(img.get('alt'),img.get('src')) #定位到属性值也可以用img['alt']或img['src']