Python3爬虫入门(一)
Python3爬虫入门
网络爬虫,也叫网络蜘蛛(Web?Spider)。它根据网页地址(URL)爬取网页内容,而网页地址(URL)就是我们在浏览器中输入的网站链接。
在浏览器的地址栏输入URL地址,在网页处右键单击,找到检查。(不同浏览器的叫法不同,Chrome浏览器叫做检查,Firefox浏览器叫做查看元素,但是功能都是相同的)
- 每个网站都有爬虫协议,(例如:https://www.baidu.com/robots.txt,这里会写清楚哪些允许 哪些不被允许)
- 可见即可爬(技术上)
- 违法的:擦边球
一、URL 专业一些的叫法是统一资源定位符(Uniform Resource Locator),它的一般格式如下(带方括号[]的为可选项):
protocol 😕/ hostname[:port] / path / [;parameters][?query]#fragment
主要由前个三部分组成:
protocol:第一部分就是协议,例如google使用的是https协议;
hostname[:port]:第二部分就是主机名(还有端口号为可选参数),一般使用http协议的网站默认的端口号为80、使用https协议的网站端口号为443。
path:第三部分就是主机资源的具体地址,如目录和文件名等也是就我们常说的路径,这里很重要我们访问不同的路径对应着我们向服务器请求不同的资源,比如,京东这两双大拖鞋对应的path分别为
100006079301.html和100003887822.html
二、网络爬虫的第一步就是根据 URL ,获取网页的 HTML 信息。在 Python3 中,可以使用 urllib.request 和 requests 进行网页爬取
1、request模块
- 安装:pip3 install requests. --- urllib,urllib2 (这两个是py内置的),requests模块是基于这两个模块封装的
# **** 基本使用 ****
# 导入模块
# import requests
#
# # 发送get请求,有返回结果
# resp = requests.get('https://www.baidu.com')
#
# # 请求回来的内容
# print(resp.text)
#
# with open('a.html','w',encoding='utf-8') as f:
# f.write(resp.text)
#
#
# # 请求返回的状态码
# print(res.status_code)
三、简单实例
1、首先,让我们看下 requests.get() 方法,它用于向服务器发起 GET 请求。
import requests
if __name__ == '__main__':
url= "http://www.baidu.com/"
req = requests.get(url=url)
req.encoding = 'utf-8'
print(req.text)
运行结果:
<!DOCTYPE html>
<!--STATUS OK--><html> <head><meta http-equiv=content-type content=text/html;charset=utf-8><meta http-equiv=X-UA-Compatible content=IE=Edge><meta content=always name=referrer><link rel=stylesheet type=text/css href=http://s1.bdstatic.com/r/www/cache/bdorz/baidu.min.css><title>百度一下,你就知道</title></head> <body link=#0000cc> <div id=wrapper> <div id=head> <div class=head_wrapper> <div class=s_form> <div class=s_form_wrapper> <div id=lg> <img hidefocus=true src=//www.baidu.com/img/bd_logo1.png width=270 height=129> </div> <form id=form name=f action=//www.baidu.com/s class=fm> <input type=hidden name=bdorz_come value=1> <input type=hidden name=ie value=utf-8> <input type=hidden name=f value=8> <input type=hidden name=rsv_bp value=1> <input type=hidden name=rsv_idx value=1> <input type=hidden name=tn value=baidu><span class="bg s_ipt_wr"><input id=kw name=wd class=s_ipt value maxlength=255 autocomplete=off autofocus></span><span class="bg s_btn_wr"><input type=submit id=su value=百度一下 class="bg s_btn"></span> </form> </div> </div> <div id=u1> <a href=http://news.baidu.com name=tj_trnews class=mnav>新闻</a> <a href=http://www.hao123.com name=tj_trhao123 class=mnav>hao123</a> <a href=http://map.baidu.com name=tj_trmap class=mnav>地图</a> <a href=http://v.baidu.com name=tj_trvideo class=mnav>视频</a> <a href=http://tieba.baidu.com name=tj_trtieba class=mnav>贴吧</a> <noscript> <a href=http://www.baidu.com/bdorz/login.gif?login&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2f%3fbdorz_come%3d1 name=tj_login class=lb>登录</a> </noscript> <script>document.write('<a href="http://www.baidu.com/bdorz/login.gif?login&tpl=mn&u='+ encodeURIComponent(window.location.href+ (window.location.search === "" ? "?" : "&")+ "bdorz_come=1")+ '" name="tj_login" class="lb">登录</a>');</script> <a href=//www.baidu.com/more/ name=tj_briicon class=bri style="display: block;">更多产品</a> </div> </div> </div> <div id=ftCon> <div id=ftConw> <p id=lh> <a href=http://home.baidu.com>关于百度</a> <a href=http://ir.baidu.com>About Baidu</a> </p> <p id=cp>©2017 Baidu <a href=http://www.baidu.com/duty/>使用百度前必读</a> <a href=http://jianyi.baidu.com/ class=cp-feedback>意见反馈</a> 京ICP证030173号 <img src=//www.baidu.com/img/gs.gif> </p> </div> </div> </div> </body> </html>
我们把获取到的结果粘贴到文本文档,保存为1.html,访问就是我们爬取到的内容
2、爬取京东某商品评论
点进去之后找到我们想要找的内容
右键>检查>network>点击商品评价>搜索部分评论内容找到对应请求
找到了对应的URL请求
开始写代码
# -*_coding:utf8-*-
# 爬取京东销量最高口红的10页评论
import requests
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36'
}
resp=requests.get('https://club.jd.com/comment/productPageComments.action?callback=fetchJSON_comment98&productId=100006079301&score=0&sortType=5&page=0&pageSize=10&isShadowSku=0&fold=1',headers=headers)
sp=resp.text
print(sp)
注意京东这里是反扒的,需要验证agent,所以加了header。
发现数据有点多不好粘贴,我们这里设置只读取一页
运行结果:
fetchJSON_comment98({"productAttr":null,"productCommentSummary":{"skuId":100006079301,"averageScore":5,"defaultGoodCount":226069,"defaultGoodCountStr":"22万+","commentCount":267810,"commentCountStr":"26万+","goodCount":40390,"goodCountStr":"4万+","goodRate":0.96,"goodRateShow":96,"generalCount":539,"generalCountStr":"500+","generalRate":0.012,"generalRateShow":1,"poorCount":812,"poorCountStr":"800+","poorRate":0.028,"poorRateShow":3,"videoCount":315,"videoCountStr":"300+","afterCount":538,"afterCountStr":"500+","showCount":6217,"showCountStr":"6200+","oneYear":0,"sensitiveBook":0,"fixCount":0,"plusCount":0,"plusCountStr":"0","buyerShow":0,"poorRateStyle":4,"generalRateStyle":2,"goodRateStyle":144,"installRate":0,"productId":100006079301,"score1Count":812,"score2Count":174,"score3Count":365,"score4Count":427,"score5Count":39963},"hotCommentTagStatistics":[{"id":"51197de19f0ba763","name":"高端大气","count":83,"type":4,"canBeFiltered":true,"stand":1,"rid":"51197de19f0ba763","ckeKeyWordBury":"eid=100^^tagid=51197de19f0ba763^^pid=20006^^sku=100006079301^^sversion=1000^^token=cf387ae8a15eca59"},{"id":"8d5387624e70bbd2","name":"质地细腻","count":21,"type":4,"canBeFiltered":true,"stand":1,"rid":"8d5387624e70bbd2","ckeKeyWordBury":"eid=100^^tagid=8d5387624e70bbd2^^pid=20006^^sku=100006079301^^sversion=1000^^token=4060dfd99e9f9368"},{"id":"13bacd43c59bb7d8","name":"质感极佳","count":19,"type":4,"canBeFiltered":true,"stand":1,"rid":"13bacd43c59bb7d8","ckeKeyWordBury":"eid=100^^tagid=13bacd43c59bb7d8^^pid=20006^^sku=100006079301^^sversion=1000^^token=22259161e6f06a4a"},{"id":"4244676cbb4a9a7a","name":"质地柔软舒适","count":18,"type":4,"canBeFiltered":true,"stand":1,"rid":"4244676cbb4a9a7a","ckeKeyWordBury":"eid=100^^tagid=4244676cbb4a9a7a^^pid=20006^^sku=100006079301^^sversion=1000^^token=d4c916443c62aa1d"},{"id":"eeb3d5553c5b4d96","name":"少女感十足","count":10,"type":4,"canBeFiltered":true,"stand":1,"rid":"eeb3d5553c5b4d96","ckeKeyWordBury":"eid=100^^tagid=eeb3d5553c5b4d96^^pid=20006^^sku=100006079301^^sversion=1000^^token=0cfa91f4619d42cd"},{"id":"751bb69d96ad1a03","name":"不卡唇纹","count":3,"type":4,"canBeFiltered":true,"stand":1,"rid":"751bb69d96ad1a03","ckeKeyWordBury":"eid=100^^tagid=751bb69d96ad1a03^^pid=20006^^sku=100006079301^^sversion=1000^^token=13417f3acd3ee870"},{"id":"2f0435df19147d25","name":"没有异味","count":2,"type":4,"canBeFiltered":true,"stand":1,"rid":"2f0435df19147d25","ckeKeyWordBury":"eid=100^^tagid=2f0435df19147d25^^pid=20006^^sku=100006079301^^sversion=1000^^token=d7844a0a37a59036"},{"id":"3a1d9b72c8f37e71","name":"乌黑亮丽","count":1,"type":4,"canBeFiltered":true,"stand":1,"rid":"3a1d9b72c8f37e71","ckeKeyWordBury":"eid=100^^tagid=3a1d9b72c8f37e71^^pid=20006^^sku=100006079301^^sversion=1000^^token=1aa00f26771093f3"},{"id":"3a57805849e14dbb","name":"安全可靠","count":1,"type":4,"canBeFiltered":true,"stand":1,"rid":"3a57805849e14dbb","ckeKeyWordBury":"eid=100^^tagid=3a57805849e14dbb^^pid=20006^^sku=100006079301^^sversion=1000^^token=633dd3c0804f145d"},{"id":"d1a61b29c4d11818","name":"分量足够","count":1,"type":4,"canBeFiltered":true,"stand":1,"rid":"d1a61b29c4d11818","ckeKeyWordBury":"eid=100^^tagid=d1a61b29c4d11818^^pid=20006^^sku=100006079301^^sversion=1000^^token=04b8c858e9da8a54"}],"jwotestProduct":null,"maxPage":100,"testId":"cmt","score":0,"soType":5,"imageListCount":500,"vTagStatistics":null,"csv":"eid=100^^tagid=ALL^^pid=20006^^sku=100010958774^^sversion=1001^^pageSize=11","comments":[{"id":13868985681,"guid":"8bf8cff2-173c-45f7-a2cd-44e5fb295c13","content":"情人节的时候首发抢的,焦急的等待快递的到来,很满意,很不错,不仅送了礼盒,还有面膜的小样,瞬间感觉熬夜都是值得的,开心。口红的包装很有质感,设计的也很有档次,颜色拿捏的也很细致,非常不错,没有色差。而且保湿效果也很好,不会很干,看上去很棒。非常的满意。","vcontent":"情人节的时候首发抢的,焦急的等待快递的到来,很满意,很不错,不仅送了礼盒,还有面膜的小样,瞬间感觉熬夜都是值得的,开心。口红的包装很有质感,设计的也很有档次,颜色拿捏的也很细致,非常不错,没有色差。而且保湿效果也很好,不会很干,看上去很棒。非常的满意。","creationTime":"2020-03-03 19:26:32","isDelete":false,"isTop":false,"userImageUrl":"misc.360buyimg.com/user/myjd-2015/css/i/peisong.jpg","topped":0,"replyCount":0,"score":5,"imageStatus":1,"title":"","usefulVoteCount":1,"userClient":2,"discussionId":688960275,"imageCount":4,"anonymousFlag":1,"plusAvailable":201,"mobileVersion":"","images":[{"id":1070279057,"imgUrl":"//img30.360buyimg.com/n0/s128x96_jfs/t1/86015/17/13903/221606/5e5e3ee8E7c2135bb/3015e0cc8d1dcf28.jpg","imgTitle":"","status":0},{"id":1070279058,"imgUrl":"//img30.360buyimg.com/n0/s128x96_jfs/t1/90879/28/13906/189795/5e5e3ee7E16c8d3fb/00e1e7f373c4c967.jpg","imgTitle":"","status":0},{"id":1070279059,"imgUrl":"//img30.360buyimg.com/n0/s128x96_jfs/t1/101831/8/13791/230984/5e5e3ee7E290fb10c/53a6be934b8cd090.jpg","imgTitle":"","status":0},{"id":1070279060,"imgUrl":"//img30.360buyimg.com/n0/s128x96_jfs/t1/107793/21/7596/62938/5e5e3ee7E1f6b32ea/88234ceb80ab0c4d.jpg","imgTitle":"","status":0}],"mergeOrderStatus":2,"productColor":"限量款196","productSize":"","textIntegral":40,"imageIntegral":40,"status":1,"referenceId":"100010958774","referenceTime":"2020-02-11 09:49:01","nickname":"j***0","replyCount2":0,"userImage":"misc.360buyimg.com/user/myjd-2015/css/i/peisong.jpg","orderId":0,"integral":80,"productSales":"[]","referenceImage":"jfs/t1/148778/34/16762/154478/5fc9e554E45dd107a/fa5f882090cf848e.jpg","referenceName":"兰蔻(LANCOME)口红196 3.4g 菁纯丝绒雾面哑光唇膏 化妆品礼盒 胡萝卜色","firstCategory":1316,"secondCategory":1387,"thirdCategory":1425,"aesPin":null,"days":21,"afterDays":0}]});
最后,热爱网络安全与Python的朋友可以关注我的公众号。
-------------------------------------------
个性签名:独学而无友,则孤陋而寡闻。做一个灵魂有趣的人!知识源于分享!
如果觉得这篇文章对你有小小的帮助的话,记得在右下角点个“推荐”哦,博主在此感谢!