一、爬虫介绍
网络爬虫(webcrawler)又称为网络蜘蛛(webspider)或网络机器人(webrobot),另外一些不常使用的名字还有蚂蚁、自动索引、模拟程序或蠕虫,同时它也是“物联网”概念的核心之一。网络爬虫本质上是一段计算机程序或脚本,其按照一定的逻辑和算法规则自动地抓取和下载万维网的网页,是搜索引擎的一个重要组成部分。
现在所有的软件原理:大部分都是基于http请求发送和获取数据
pc端的网页
移动端app
爬虫就是模拟发送http请求,从别人的服务端获取数据
如何绕过反扒:不同程序反扒措施不一样,比较复杂
发送http请求【requests,selenium】- - - - 》第三方服务端- - - - 》服务端响应的数据解析出想要的数据【selenium, bs4】- - - 》入库( 文件,excel,mysql, redis, mongodb。。)
scrapy: 专业的爬虫框架
爬虫协议:每个网站根路径下都有robots. txt,这个文件规定了,该网站哪些内容可以爬取,哪些不能爬取
二、request模块介绍
>> > import requests
>> > r = requests. get( 'https://api.github.com/events' )
>> > r = requests. post( 'http://httpbin.org/post' , data = { 'key' : 'value' } )
>> > r = requests. put( 'http://httpbin.org/put' , data = { 'key' : 'value' } )
>> > r = requests. delete( 'http://httpbin.org/delete' )
>> > r = requests. head( 'http://httpbin.org/get' )
>> > r = requests. options( 'http://httpbin.org/get' )
三、基于GET请求
3.1 基本get请求
pip install request
import requests
info = requests. get( 'https://baike.baidu.com/item/%E4%B8%AD%E5%9B%BD%E5%8E%86%E5%8F%B2/152769' )
print ( info. text)
3.2 get请求携带params
import request
info = requests. get( 'https://www.baidu.com/s?wd=%E5%8E%86%E5%8F%B2' )
print ( info. text)
info = requests. get( 'https://www.baidu.com/s' , params= {
'wd' : '历史' ,
'country' : 'china'
} )
print ( info. text)
from urllib import parse
info = parse. quote( '历史' )
print ( info)
info = parse. unquote( '%E5%8E%86%E5%8F%B2' )
print ( info)
import requests
header = {
'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36'
}
info = requests. get( 'https://dig.chouti.com/' , headers= header)
print ( info. text)
3.4 get请求携带cookie
import requests
data = {
'linkId' : '36996038'
}
header = {
'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36' ,
'Cookie' : 'deviceId=web.eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJqaWQiOiI3MzAyZDQ5Yy1mMmUwLTRkZGItOTZlZi1hZGFmZTkwMDBhMTEiLCJleHBpcmUiOiIxNjYxNjU0MjYwNDk4In0.4Y4LLlAEWzBuPRK2_z7mBqz4Tw5h1WeqibvkBG6GM3I; __snaker__id=ozS67xizRqJGq819; YD00000980905869%3AWM_TID=M%2BzgJgGYDW5FVFVAVQbFGXQ654xCRHj8; _9755xjdesxxd_=32; Hm_lvt_03b2668f8e8699e91d479d62bc7630f1=1666756750,1669172745; gdxidpyhxdE=W7WrUDABQTf1nd8a6mtt5TQ1fz0brhRweB%5CEJfQeiU61%5C1WnXIUkZH%2FrE4GnKkGDX767Jhco%2B7xUMCiiSlj4h%2BRqcaNohAkeHsmj3GCp2%2Fcj4HmXsMVPPGClgf5AbhAiztHgnbAz1Xt%5CIW9DMZ6nLg9QSBQbbeJSBiUGK1RxzomMYSU5%3A1669174630494; YD00000980905869%3AWM_NI=OP403nvDkmWQPgvYedeJvYJTN18%2FWgzQ2wM3g3aA3Xov4UKwq1bx3njEg2pVCcbCfP9dl1RnAZm5b9KL2cYY9eA0DkeJo1zfCWViwVZUm303JyNdJVAEOJ1%2FH%2BJFZxYgMVI%3D; YD00000980905869%3AWM_NIKE=9ca17ae2e6ffcda170e2e6ee92bb45a398f8d1b34ab5a88bb7c54e839b8aacc1528bb8ad89d45cb48ae1aac22af0fea7c3b92a8d90fcd1b266b69ca58ed65b94b9babae870a796babac9608eeff8d0d66dba8ffe98d039a5edafa2b254adaafcb6ca7db3efae99b266aa9ba9d3f35e81bdaea4e55cfbbca4d2d1668386a3d6e1338994fe84dc53fbbb8fd1c761a796a1d2f96e81899a8af65e9a8ba3d4b3398aa78285c95e839b81abb4258cf586a7d9749bb983b7cc37e2a3; token=eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJqaWQiOiJjZHVfNTMyMDcwNzg0NjAiLCJleHBpcmUiOiIxNjcxNzY1NzQ3NjczIn0.50e-ROweqV0uSd3-Og9L7eY5sAemPZOK_hRhmAzsQUk; Hm_lpvt_03b2668f8e8699e91d479d62bc7630f1=1669173865'
}
info = requests. post( 'https://dig.chouti.com/link/vote' , data= data, headers= header)
print ( info. text)
data = {
'linkId' : '36996038'
}
header = {
'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36' ,
}
info = requests. post( 'https://dig.chouti.com/link/vote' , data= data, headers= header, cookies= { 'key' : 'value' } )
print ( info. text)
四、基于POST请求
4.1 基本POST请求
import requests
data = {
'username' : '1652814964@qq.com' ,
'password' : '******' ,
'captcha' : 'cccc' ,
'remember' : '1' ,
'ref' : 'http://www.aa7a.cn/' ,
'act' : 'act_login'
}
info = requests. post( 'http://www.aa7a.cn/user.php' , data= data)
print ( info. text)
print ( info. cookies)
info1 = requests. get( 'http://www.aa7a.cn/' , cookies= info. cookies)
print ( '1652814964@qq.com' in info1. text)
4.2 post请求携带参数
import request
info = requests. post( 'http://www.aa7a.cn/user.php' , json= { } )
print ( info. text)
4.3 request.session使用
import requests
session = requests. session( )
data = {
'username' : '1652814964@qq.com' ,
'password' : '*******' ,
'captcha' : 'cccc' ,
'remember' : '1' ,
'ref' : 'http://www.aa7a.cn/' ,
'act' : 'act_login'
}
info = session. post( 'http://www.aa7a.cn/user.php' , data= data)
info1 = session. get( 'http://www.aa7a.cn/' )
print ( '1652814964@qq.com' in info1. text)
五、响应Response
5.1 Response属性
import requests
header = {
'User-Agent' : 'Mozilla/5.0 (Windows NT 0.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36' ,
}
respone = requests. get( 'https://www.jianshu.com' , params= { 'name' : 'jason' , 'age' : 18 } , headers= header)
print ( respone. text)
print ( respone. content)
print ( respone. status_code)
print ( respone. headers)
print ( respone. cookies)
print ( respone. cookies. get_dict( ) )
print ( respone. cookies. items( ) )
print ( respone. url)
print ( respone. history)
print ( respone. encoding)
六、获取二进制数据
import requests
image = requests. get( 'http://www.aa7a.cn/data/afficheimg/20220913pmsadf.png' )
with open ( '图片.png' , 'wb' ) as f:
f. write( image. content)
MP4 = requests. get(
'https://vd3.bdstatic.com/mda-mk21ctb1n2ke6m6m/sc/cae_h264/1635901956459502309/mda-mk21ctb1n2ke6m6m.mp4' )
with open ( '视频.mp4' , 'wb' ) as f:
for line in MP4. iter_content( ) :
f. write( line)
七、解析JSON格式数据
res = requests. get(
'https://api.map.baidu.com/place/v2/search?ak=6E823f587c95f0148c19993539b99295®ion=%E4%B8%8A%E6%B5%B7&query=%E8%82%AF%E5%BE%B7%E5%9F%BA&output=json' )
print ( res. text)
print ( type ( res. text) )
print ( res. json( ) [ 'results' ] [ 0 ] [ 'name' ] )
print ( type ( res. json( ) ) )
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· 震惊!C++程序真的从main开始吗?99%的程序员都答错了
· 别再用vector<bool>了!Google高级工程师:这可能是STL最大的设计失误
· 单元测试从入门到精通
· 【硬核科普】Trae如何「偷看」你的代码?零基础破解AI编程运行原理
· 上周热点回顾(3.3-3.9)