《手牵手带你走进python世界》系列二
爬虫系列
-
什么是爬虫?都有哪些爬虫?我学了爬虫要找什么样的工作?工资有多少?
-
爬虫常用的模块和框架?
- 常用的模块有 requests,bs4,lxml,re,selenium,appium等模块。
- 常见的爬虫框架有scrapy框架和PySpider框架等等。
-
开胃小菜,实现在线翻译的语句
import requests import json headers = { 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36' } def main(keys=''): url = 'http://fy.iciba.com/ajax.php?a=fy' data = { 'f': 'auto', 't': 'auto', 'w': keys } response = requests.post(url,headers=headers,data=data) info = response.text data_list = json.loads(info) try: val = data_list['content']['word_mean'] # 中文转英文 except: val = data_list['content']['out'] # 英文转中文 return val if __name__ == '__main__': keys = input('请输入需要翻译的英文或者中文...') if not keys: print('请您正确输入需要翻译的中文或者英文...') else: data = main(keys) print(data)
-
爬取豆瓣电影Top250
import requests from bs4 import BeautifulSoup from openpyxl import * url = 'https://movie.douban.com/top250' headers = { 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36' } # 获取响应值 response = requests.get(url=url,headers=headers) data = response.text soup = BeautifulSoup(data,'html.parser') ol = soup.find(name='ol',attrs={"class":"grid_view"}) li_list = ol.find_all(name='li') wb = Workbook() sheet = wb.active sheet['A1'].value = '序号' sheet['B1'].value = '名称' sheet['C1'].value = '评分' sheet['D1'].value = '摘要' sheet['E1'].value = '图片' for li in li_list: index = li_list.index(li)+1 name = li.find(name='span',attrs={'class','title'}) rate = li.find(name='span',attrs={'class','rating_num'}) inq = li.find(name='span',attrs={'class','inq'}) img = li.find(name='img') imgs = img['src'] sheet['A'+str(index+1)].value = index sheet['B'+str(index+1)].value = name.text sheet['C'+str(index+1)].value = rate.text sheet['D'+str(index+1)].value = inq.text sheet['E'+str(index+1)].value = imgs wb.save('douban.xlsx')
-
爬取汽车之家新闻
import requests from bs4 import BeautifulSoup from openpyxl import Workbook def run(url): headers = { 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36' } response = requests.get(url,headers=headers) response.encoding = 'gbk' soup = BeautifulSoup(response.text,'html.parser') # 获取ul ul = soup.find(name='ul',attrs={"class":"article"}) # 获取所有的li li_list = ul.find_all(name='li') infos = [] for li in li_list: name = li.find(name="h3") name1 = "" if name: name1 = (name.text) href = li.find(name='a') href1 = "" if href: href1 = ('http:'+href['href']) info = li.find(name='p') info1 = "" if info: info1 = (info.text) infos.append({"title":name1,"href":href1,"info":info1}) print(infos) if __name__ == '__main__': url = 'https://www.autohome.com.cn/news/' run(url)
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· AI与.NET技术实操系列:基于图像分类模型对图像进行分类
· go语言实现终端里的倒计时
· 如何编写易于单元测试的代码
· 10年+ .NET Coder 心语,封装的思维:从隐藏、稳定开始理解其本质意义
· .NET Core 中如何实现缓存的预热?
· 25岁的心里话
· 闲置电脑爆改个人服务器(超详细) #公网映射 #Vmware虚拟网络编辑器
· 零经验选手,Compose 一天开发一款小游戏!
· 因为Apifox不支持离线,我果断选择了Apipost!
· 通过 API 将Deepseek响应流式内容输出到前端