【Python爬虫】之爬取页面内容、图片以及用selenium爬取
下面不做过多文字描述:
首先、安装必要的库
# 安装BeautifulSoup pip install beautifulsoup4 # 安装requests pip install requests
其次、上代码!!!
①重定向网站爬虫h4文字
import requests from bs4 import BeautifulSoup from selenium import webdriver from selenium.webdriver.common.by import By from PIL import Image # 重定向爬虫h4 url = "http://www.itest.info/courses" soup = BeautifulSoup(requests.get(url).text,'html.parser') for courses in soup.find_all('p'): print(courses.text) print("\r")
②v2ex爬取标题
import requests from bs4 import BeautifulSoup # v2ex爬虫标题 url = "https://www.v2ex.com" v2ex = BeautifulSoup(requests.get(url).text,'html.parser') for span in v2ex.find_all('span',class_='item_hot_topic_title'): print(span.find('a').text,span.find('a')['href']) for title in v2ex.find_all("a",class_="topic-link"): print(title.text,url+title["href"])
③煎蛋爬虫图片
import requests from bs4 import BeautifulSoup headers = { 'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36' } def download_file(url): '''下载图片''' print('Downding %s' %url) local_filename = url.split('/')[-1] # 指定目录保存图片 img_path = "/Users/zhangc/Desktop/GitTest/project_Buger_2/Python爬虫/img/" + local_filename print(local_filename) r = requests.get(url, stream=True, headers=headers) with open(img_path, 'wb') as f: for chunk in r.iter_content(chunk_size=1024): if chunk: f.write(chunk) f.flush() return img_path url = 'http://jandan.net/drawings' soup = BeautifulSoup(requests.get(url, headers=headers).text, 'html.parser') def valid_img(src): '''判断地址符不符合关键字''' return src.endswith('jpg') and '.sinaimg.cn' in src for img in soup.find_all('img', src=valid_img): src = img['src'] if not src.startswith('http'): src = 'http:' + src download_file(src)
④爬取知乎热门标题
import requests from bs4 import BeautifulSoup headers ={ "user-agent":"user-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36" } url = "https://www.zhihu.com/explore" zhihu = BeautifulSoup(requests.get(url,headers=headers).text,"html.parser") for title in zhihu.find_all('a',class_="ExploreSpecialCard-contentTitle"): print(title.text)
⑤selenium爬虫知乎热门标题
import requests from bs4 import BeautifulSoup # selenium爬虫 url = "https://www.zhihu.com/explore" driver = webdriver.Chrome("/Users/zhangc/Desktop/GitTest/project_Buger_2/poium测试库/tools/chromedriver") driver.get(url) info = driver.find_element(By.CSS_SELECTOR,"div.ExploreHomePage-specials") for title in info.find_elements(By.CSS_SELECTOR,"div.ExploreHomePage-specialCard > div.ExploreSpecialCard-contentList > div.ExploreSpecialCard-contentItem > a.ExploreSpecialCard-contentTitle"): print(title.text,title.get_attribute('href'))
不积跬步,无以致千里;不集小流,无以成江海。
如转载本文,请还多关注一下我的博客:https://www.cnblogs.com/Owen-ET/;
我的Github地址:https://github.com/Owen-ET————————————
无善无恶心之体, 有善有恶意之动, 知善知恶是良知, 为善去恶是格物。
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· .NET Core 中如何实现缓存的预热?
· 从 HTTP 原因短语缺失研究 HTTP/2 和 HTTP/3 的设计差异
· AI与.NET技术实操系列:向量存储与相似性搜索在 .NET 中的实现
· 基于Microsoft.Extensions.AI核心库实现RAG应用
· Linux系列:如何用heaptrack跟踪.NET程序的非托管内存泄露
· TypeScript + Deepseek 打造卜卦网站:技术与玄学的结合
· 阿里巴巴 QwQ-32B真的超越了 DeepSeek R-1吗?
· 如何调用 DeepSeek 的自然语言处理 API 接口并集成到在线客服系统
· 【译】Visual Studio 中新的强大生产力特性
· 2025年我用 Compose 写了一个 Todo App