爬虫之urllib.request基础使用(一)
urllib模块
urllib模块简介:
urllib提供了一系列用于操作URL的功能。包含urllib.request,urllib.error,urllib.parse,urllib.robotparser四个子模块
- urllib.request打开和浏览url中内容
- urllib.error包含从 urllib.request发生的错误或异常
- urllib.parse解析url
- urllib.robotparser解析 robots.txt文件
urllib.request.urlopen()格式:
urllib.request.urlopen(url, data=None, [timeout, ]*, cafile=None, capath=None, cadefault=False, context=None)
urllib模块介绍:
urlopen函数参数:
- url: 需要打开的网址
- data:Post提交的数据
- timeout:设置网站的访问超时时间
urlopen返回对象提供方法:
- read() , readline() ,readlines() , fileno() , close() :对HTTPResponse类型数据进行操作
- info():返回HTTPMessage对象,表示远程服务器返回的头信息
- getcode():返回Http状态码。如果是http请求,200请求成功完成 ; 404网址未找到
- geturl():返回请求的url
urlopen返回对象提供的属性:
- status:返回Http状态码。如果是http请求,200请求成功完成 ; 404网址未找到
- reason:返回数字,比如200
url参数的使用
get请求
例如对百度的一个URL https://www.baidu.com/
进行抓取:
1 2 3 4 5 6 | #demoe5.pyfrom urllib import request<br> url = 'https://www.baidu.com/' f = request.urlopen(url) data = f.read() print ( 'Status:' , f.status, f.reason) print ( 'Data:' , data)<br><br> |
运行程序可以得到如下:
1 2 3 4 5 | C:\Pycham\venv\Scripts\python.exe C: / Pycham / demoe5.py Status: 200 OK Data: b '<html>\r\n<head>\r\n\t<script>\r\n\t\tlocation.replace(location.href.replace("https://","http://"));\r\n\t</script>\r\n</head>\r\n<body>\r\n\t<noscript><meta http-equiv="refresh" content="0;url=http://www.baidu.com/"></noscript>\r\n</body>\r\n</html>' Process finished with exit code 0 |
Data的数据格式为bytes类型,需要decode()解码,转换成str类型
我们将最后一句代码改为
print('Data:', data.decode('utf-8'))
1 2 3 4 5 6 | #demoe5.pyfrom urllib import request url = 'https://www.baidu.com/' f = request.urlopen(url) data = f.read() print ( 'Status:' , f.status, f.reason) print ( 'Data:' , data.decode( 'utf-8' )) |
运行程序可以得到如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | C:\Pycham\venv\Scripts\python.exe C: / Pycham / demoe5.py Status: 200 OK Data: <html> <head> <script> location.replace(location.href.replace( "https://" , "http://" )); < / script> < / head> <body> <noscript><meta http - equiv = "refresh" content = "0;url=http://www.baidu.com/" >< / noscript> < / body> < / html> Process finished with exit code 0 |
这样得到的内容就可以与网页编码内容一样了
data参数的使用
post请求
urlopen()的data参数默认为None,当data参数不为空的时候,urlopen()提交方式为Post
Post的数据必须是bytes或者iterable of bytes,不能是str,如果是str需要进行encode()编码
然后作为data
参数传递给Request对象。编码是使用一个urllib.parse库中的函数完成的。
1 2 3 4 5 6 7 8 9 10 11 12 13 | import urllib.parse import urllib.request url = 'http://www.someserver.com/cgi-bin/register.cgi' values = { 'name' : 'Michael Foord' , 'location' : 'Northampton' , 'language' : 'Python' } data = urllib.parse.urlencode(values) data = data.encode( 'ascii' ) # data should be bytes req = urllib.request.Request(url, data) with urllib.request.urlopen(req) as response: the_page = response.read() |
如果不传递data
参数,那urllib就会使用GET请求方式。GET方式和POST方式的其中一个区别在于POST请求经常有副作用:它们会以某种方式改变系统的状态(比如,在网上下订单,会有一英担的午餐肉罐头送到你家门口)。尽管HTTP标准明确说POST方式总是会造成副作用,而GET方式从来不会,但是并没有保证措施。数据也可以用GET方式传递,只要把它编码在url中。
timeout参数的使用
在某些网络情况不好或者服务器端异常的情况会出现请求慢的情况,或者请求异常,所以这个时候我们需要给
请求设置一个超时时间,而不是让程序一直在等待结果。例子如下:
1 2 3 4 5 6 7 8 | #demoe5.py from urllib import request url = 'https://www.baidu.com/' f = request.urlopen(url,timeout = 0.1 ) data = f.read() print ( 'Status:' , f.status, f.reason) print ( 'Data:' , data) |
使用Request包装请求
有很多网站为了防止程序爬虫爬网站造成网站瘫痪或者会给不同的浏览器发送不同的版本,会需要携带一些headers头部信息才能访问,最长见的有user-agent参数
格式:
urllib.request.
Request
(url, data=None, headers={}, method=None)
使用request()来包装请求,再通过urlopen()获取页面
用来包装头部的数据:
- User-Agent :这个头部可以携带如下几条信息:浏览器名和版本号、操作系统名和版本号、默认语言
- Referer:可以用来防止盗链,有一些网站图片显示来源http://***.com,就是检查Referer来鉴定的
- Connection:表示连接状态,记录Session的状态。
第一种方法:
1 2 3 4 5 6 7 8 9 10 11 12 13 | import urllib.request url = "https://www.baidu.com/" #创建Request对象 request = urllib.request.Request(url) #添加http的header request.add_header( 'User-Agent' , 'Mozilla/5.0 (compatible; MSIE 5.5; Windows NT)' ) #发送请求获取结果 response2 = urllib.request.urlopen(request) data = response2.read() print (response2.status,response2.reason) #打印请求的状态码 print ( len (data)) #输出网页字符串的长度 print (data.decode()) #输出网页内容 |
运行结果如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | C:\Pycham\venv\Scripts\python.exe C: / Pycham / demoe5.py 200 OK 227 <html> <head> <script> location.replace(location.href.replace( "https://" , "http://" )); < / script> < / head> <body> <noscript><meta http - equiv = "refresh" content = "0;url=http://www.baidu.com/" >< / noscript> < / body> < / html> Process finished with exit code 0 |
第二种方法:
1 2 3 4 5 6 7 8 9 10 11 12 | import urllib.request url = 'https://www.baidu.com/' headers = { 'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) ' 'Chrome/45.0.2454.85 Safari/537.36 115Browser/6.0.3' , 'Referer' : 'https://www.baidu.com/' , 'Connection' : 'keep-alive' } request = urllib.request.Request(url, headers = headers) response = urllib.request.urlopen(request).read() data = response.decode( 'utf-8' ) print (data) |
运行结果如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 | < / script> <script type = "text/javascript" >var Cookie = { set :function(e,t,o,i,s,n){document.cookie = e + "=" + (n?t:escape(t)) + (s? "; expires=" + s.toGMTString():" ")+(i?" ; path = "+i:" ; path = / ")+(o?" ; domain = "+o:" ")},get:function(e,t){var o=document.cookie.match(new RegExp(" (^| ) "+e+" = ([^;] * )(;|$) "));return null!=o?unescape(o[2]):t},clear:function(e,t,o){this.get(e)&&(document.cookie=e+" = "+(t?" ; path = "+t:" ; path = / ")+(o?" ; domain = "+o:" ")+" ;expires = Fri, 02 - Jan - 1970 00 : 00 : 00 GMT ")}};!function(){function save(e){var t=[];for(tmpName in options)options.hasOwnProperty(tmpName)&&" duRobotState "!==tmpName&&t.push('" ' + tmpName + '":"' + options[tmpName] + '"' ); var o = "{" + t.join( "," ) + "}" ;bds.comm.personalData?$.ajax({url: "//www.baidu.com/ups/submit/addtips/?product=ps&tips=" + encodeURIComponent(o) + "&_r=" + (new Date).getTime(),success:function(){writeCookie(), "function" = = typeof e&&e()}}):(writeCookie(), "function" = = typeof e&&setTimeout(e, 0 ))}function set (e,t){options[e] = t}function get(e){ return options[e]}function writeCookie(){ if (options.hasOwnProperty( "sugSet" )){var e = "0" = = options.sugSet? "0" : "3" ;clearCookie( "sug" ),Cookie. set ( "sug" ,e,document.domain, "/" ,expire30y) } if (options.hasOwnProperty( "sugStoreSet" )){var e = 0 = = options.sugStoreSet? "0" : "1" ;clearCookie( "sugstore" ),Cookie. set ( "sugstore" ,e,document.domain, "/" ,expire30y)} if (options.hasOwnProperty( "isSwitch" )){var t = { 0 : "2" , 1 : "0" , 2 : "1" },e = t[options.isSwitch];clearCookie( "ORIGIN" ),Cookie. set ( "ORIGIN" ,e,document.domain, "/" ,expire30y)} if (options.hasOwnProperty( "imeSwitch" )){var e = options.imeSwitch;clearCookie( "bdime" ),Cookie. set ( "bdime" ,e,document.domain, "/" ,expire30y)}}function writeBAIDUID(){var e,t,o,i = Cookie.get( "BAIDUID" ); / FG = (\d + ) / .test(i)&&(t = RegExp.$ 1 ), / SL = (\d + ) / .test(i)&&(o = RegExp.$ 1 ), / NR = (\d + ) / .test(i)&&(e = RegExp.$ 1 ),options.hasOwnProperty( "resultNum" )&&(e = options.resultNum),options.hasOwnProperty( "resultLang" )&&(o = options.resultLang),Cookie. set ( "BAIDUID" ,i.replace( / :. * $ / ," ")+(" undefined "!=typeof o?" :SL = "+o:" ")+(" undefined "!=typeof e?" :NR = "+e:" ")+(" undefined "!=typeof t?" :FG = "+t:" ")," .baidu.com "," / ",expire30y,!0)}function clearCookie(e){Cookie.clear(e," / "),Cookie.clear(e," / ",document.domain),Cookie.clear(e," / "," . "+document.domain),Cookie.clear(e," / "," .baidu.com") }function reset(e){options = defaultOptions,save(e)}var defaultOptions = {sugSet: 1 ,sugStoreSet: 1 ,isSwitch: 1 ,isJumpHttps: 1 ,imeSwitch: 0 ,resultNum: 10 ,skinOpen: 1 ,resultLang: 0 ,duRobotState: "000" },options = {},tmpName,expire30y = new Date;expire30y.setTime(expire30y.getTime() + 94608e7 ); try { if (bds&&bds.comm&&bds.comm.personalData){ if ( "string" = = typeof bds.comm.personalData&&(bds.comm.personalData = eval ( "(" + bds.comm.personalData + ")" )),!bds.comm.personalData) return ; for (tmpName in bds.comm.personalData)defaultOptions.hasOwnProperty(tmpName)&&bds.comm.personalData.hasOwnProperty(tmpName)&& "SUCCESS" = = bds.comm.personalData[tmpName].ErrMsg&&(options[tmpName] = bds.comm.personalData[tmpName].value) } try {parseInt(options.resultNum)||delete options.resultNum,parseInt(options.resultLang)|| "0" = = options.resultLang||delete options.resultLang}catch(e){}writeCookie(), "sugSet" in options||(options.sugSet = 3 ! = Cookie.get( "sug" , 3 )? 0 : 1 ), "sugStoreSet" in options||(options.sugStoreSet = Cookie.get( "sugstore" , 0 ));var BAIDUID = Cookie.get( "BAIDUID" ); "resultNum" in options||(options.resultNum = / NR = (\d + ) / .test(BAIDUID)&&RegExp.$ 1 ?parseInt(RegExp.$ 1 ): 10 ), "resultLang" in options||(options.resultLang = / SL = (\d + ) / .test(BAIDUID)&&RegExp.$ 1 ?parseInt(RegExp.$ 1 ): 0 ), "isSwitch" in options||(options.isSwitch = 2 = = Cookie.get( "ORIGIN" , 0 )? 0 : 1 = = Cookie.get( "ORIGIN" , 0 )? 2 : 1 ), "imeSwitch" in options||(options.imeSwitch = Cookie.get( "bdime" , 0 )) }catch(e){}window.UPS = {writeBAIDUID:writeBAIDUID,reset:reset,get:get, set : set ,save:save}}(),function(){var e = "https://ss1.bdstatic.com/5eN1bjq8AAUYm2zgoY3K/r/www/cache/static/protocol/https/plugins/every_cookie_4644b13.js" ;( "Mac68K" = = navigator.platform|| "MacPPC" = = navigator.platform|| "Macintosh" = = navigator.platform|| "MacIntel" = = navigator.platform)&&(e = "https://ss1.bdstatic.com/5eN1bjq8AAUYm2zgoY3K/r/www/cache/static/protocol/https/plugins/every_cookie_mac_82990d4.js" ),setTimeout(function(){$.ajax({url:e,cache:! 0 ,dataType: "script" })}, 0 );var t = navigator&&navigator.userAgent?navigator.userAgent:" ",o=document&&document.cookie?document.cookie:" ",i=!!(t.match(/(msie [2-8])/i)||t.match(/windows.*safari/i)&&!t.match(/chrome/i)||t.match(/(linux.*firefox)/i)||t.match(/Chrome\/29/i)||t.match(/mac os x.*firefox/i)||o.match(/\bISSW=1/)||0==UPS.get(" isSwitch")); bds&&bds.comm&&(bds.comm.supportis = !i,bds.comm.isui = ! 0 ),window.__restart_confirm_timeout = ! 0 ,window.__confirm_timeout = 8e3 ,window.__disable_is_guide = ! 0 ,window.__disable_swap_to_empty = ! 0 ,window.__switch_add_mask = ! 0 ;var s = "https://ss1.bdstatic.com/5eN1bjq8AAUYm2zgoY3K/r/www/cache/static/protocol/https/global/js/all_async_search_7edb824.js" ,n = "/script" ;document.write( "<script src='" + s + "'><" + n + ">" ),bds.comm.newindex&&$(window).on( "index_off" ,function(){$( '<div class="c-tips-container" id="c-tips-container"></div>' ).insertAfter( "#wrapper" ),window.__sample_dynamic_tab&&$( "#s_tab" ).remove() }),bds.comm&&bds.comm.ishome&&Cookie.get( "H_PS_PSSID" )&&(bds.comm.indexSid = Cookie.get( "H_PS_PSSID" ));var a = $(document).find( "#s_tab" ).find( "a" );a&&a.length> 0 &&a.each(function(e,t){t.innerHTML&&t.innerHTML.match( / 新闻 / )&&(t.innerHTML = "资讯" ,t.href = "//www.baidu.com/s?rtt=1&bsst=1&cl=2&tn=news&word=" ,t.setAttribute( "sync" ,! 0 ))})}();< / script> <script> if (bds.comm.supportis){ window.__restart_confirm_timeout = true; window.__confirm_timeout = 8000 ; window.__disable_is_guide = true; window.__disable_swap_to_empty = true; } initPreload({ 'isui' :true, 'index_form' : "#form" , 'index_kw' : "#kw" , 'result_form' : "#form" , 'result_kw' : "#kw" }); < / script> <script> if (navigator.cookieEnabled){ document.cookie = "NOJS=;expires=Sat, 01 Jan 2000 00:00:00 GMT" ; } < / script> < / body> < / html> Process finished with exit code 0 |
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· 从 HTTP 原因短语缺失研究 HTTP/2 和 HTTP/3 的设计差异
· AI与.NET技术实操系列:向量存储与相似性搜索在 .NET 中的实现
· 基于Microsoft.Extensions.AI核心库实现RAG应用
· Linux系列:如何用heaptrack跟踪.NET程序的非托管内存泄露
· 开发者必知的日志记录最佳实践
· TypeScript + Deepseek 打造卜卦网站:技术与玄学的结合
· Manus的开源复刻OpenManus初探
· AI 智能体引爆开源社区「GitHub 热点速览」
· 三行代码完成国际化适配,妙~啊~
· .NET Core 中如何实现缓存的预热?