python_爬虫
1、网络爬虫
1、定义:网络蜘蛛,网络机器人,抓取网络数据的程序
2、总结:用Python程序去模仿人去访问网站,模仿的越逼真越好
3、目的:通过有效的大量的数据分析市场走势,公司的决策
2、企业获取数据的方式
1、公司自有
2、第三方数据平台购买
1、数据堂、贵阳大数据交易所
3、爬虫程序爬取数据
市场上没有或者价格太高,利用爬虫程序去爬取
3、Python做爬虫的优势
1、Python:请求模块、解析模块丰富成熟
2、PHP:多线程,异步支持不够好
3、JAVA:代码笨重,代码量大
4、C/C++:虽然效率高,但代码成型太慢
4、爬虫的分类
1、通用的网络爬虫(搜索引擎引用,需要遵守robots协议)
1、搜索引擎如何获取一个新网站的URL
1、网站主动向搜索引擎提供(百度站长平台)
2、和DNS服务商(万网),快速收录新网站
2、聚焦网络爬虫(需要什么爬取什么)
自己写的爬虫程序:面向主题爬虫,面向需求爬虫
5、爬取数据步骤
1、确定需要爬取的URL地址
2、通过HTTP/HTTPS协议来获取响应的HTML页面
3、提取HTML页面里有用的数据
1、所需数据,保存
2、页面中其他的URL,继续重复第2步
6、Chrome浏览器插件
1、插件安装步骤
1、右上角->更多工具->扩展程序
2、点开 开发者模式
3、把插件拖拽到浏览器界面
2、插件介绍
1、Proxy SwitchyOmega:代理切换插件
2、XPath Helper:网页数据解析插件
3、JSON View:查看json格式的数据(好看)
7、Fiddler抓包工具
1、抓包设置
1、设置Fiddler抓包工具
2、设置浏览器代理
Proxy SwitchyOmega ->选项->新建情景模式->HTTP 127.0.0.1 8888
2、Fiddler常用菜单
1、Inspector:查看抓到数据包的详细内容
2、常用选项
1、Headers:客户端发送到服务器的header,包含web客户端信息 cookie传输状态
2、WebForms:显示请求的POST的数据
3、Raw:将整个请求显示为纯文本
8、Anaconda 和 spyder
1、Anaconda:开源的python发行版本
2、Spyder:集成的开发工具
spyder常用快捷键
1、注释/取消注释:ctrl+1
2、保存:ctrl+s
3、运行程序:F5
9、WEB
1、HTTP 和 HTTPS
1、HTTP:80
2、HTTPS:443,HTTP的升级版
2、GET 和 POST
1、GET:查询参数会在URL上显示出来
2、POST:查询参数和提交的数据在form表单里,不会在URL地址上显示
3、URL
http:// item.jd.com :80 /2660656.html #detail
协议 域名/IP地址 默认端口 资源路径 锚点(可选)
4、User-Agent
记录用户浏览器、操作系统等,为了让用户获取更好的HTML页面效果
Mozilla:Fireox(Gecko内核)
IE:Trident(自己内核)
Linux:KHIML(like Gecko)
Apple:Webkit(like KHTML)
google:Chrome(like webkit)
10、爬虫请求模块
1、urllib.request
1、版本
1、Python2中:urllib 和 urllib2
2、Python3中:把两者合并,urllib.request
2、常用方法
1、urllib.request.urlopen('URL')
作用:向网站发起请求并获取响应
urlopen(),得到的响应对象response:bytes
import urllib.request url = 'http://www.baidu.com/' #发起请求并获取响应对象 response = urllib.request.urlopen(url) #响应对象的read()方法获取响应内容 #read()方法得到的是bytes类型 #read() bytes -->string html = response.read().decode('utf-8') print(html)
2、urllib.request.Request(url,headers={})
1、重构User-Agent,爬虫和反爬虫斗争第一步
2、使用步骤
1、构建请求对象request:Request()
2、获取响应对象response:urlopen(request)
3、利用响应对象response.read().decode('utf-8')
# -*- coding: utf-8 -*- import urllib.request url = 'http://www.baidu.com/' headers = {'User-Agent':'User-Agent:Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50'} #1、构建请求对象 request = urllib.request.Request(url,headers=headers) #2、得到响应对象 response = urllib.request.urlopen(request) #3、获取响应对象的内容 html = response.read().decode('utf-8') print(html)
3、请求对象request方法
1、add_header()
作用:添加或修改headers(User-Agent)
2、get_header(‘User-agent’),只有U是大写
作用:获取已有的HTTP报头的值
import urllib.request url = 'http://www.baidu.com/' headers = 'User-AgentUser-Agent:Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50' request = urllib.request.Request(url) #请求对象方法add_header() request.add_header("User-Agent",headers) #获取响应对象 response = urllib.request.urlopen(request) #get_header()方法获取User-agent, #注意User-agent的写法,只有U是大写的 print(request.get_header('User-agent')) #获取响应码 print(response.getcode()) #获取响应报头信息,返回结果是一个字典 print(response.info()) html = response.read().decode('utf-8') print(html)
4、响应对象response方法
1、read();读取服务器响应的内容
2、getcode():
作用:返回HTTP的响应状态码
200:成功
4XX:服务器页面出错(连接到了服务器)
5XX:服务器出错(没有连接到服务器)
3、info():
作用:返回服务器的响应报头信息
2、urllib.parse
1、quote('中文字符串')
2、urlencode(字典)
3、unquote("编码之后的字符串"),解码
import urllib.request import urllib.parse url = 'http://www.baidu.com/s?wd=' key = input('请输入要搜索的内容') #编码,拼接URL key = urllib.parse.quote(key) fullurl = url+key print(fullurl)#http://www.baidu.com/s?wd=%E7%BE%8E%E5%A5%B3 headers = {'User-Agent':"User-AgentUser-Agent:Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50"} request = urllib.request.Request(fullurl,headers = headers) resp = urllib.request.urlopen(request) html = resp.read().decode('utf-8') print(html)
import urllib.request import urllib.parse baseurl = "http://www.baidu.com/s?" headers = {'User-Agent':"User-AgentUser-Agent:Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50"} key = input("请输入要搜索的内容") #urlencode编码,参数一定是字典 d = {"wd":key} d = urllib.parse.urlencode(d) url = baseurl + d resq = urllib.request.Request(url,headers = headers) resp = urllib.request.urlopen(resq) html = resp.read().decode('utf-8') print(html)
练习:爬取百度贴吧
1、简单版
# -*- coding: utf-8 -*- """ 百度贴吧数据抓取 要求: 1、输入贴吧的名称 2、输入抓取的起始页和终止页 3、把每一页的内容保存到本地:第一页.html 第二页.html http://tieba.baidu.com/f?kw=%E6%B2%B3%E5%8D%97%E5%A4%A7%E5%AD%A6&ie=utf-8&pn=0 """ import urllib.request import urllib.parse baseurl = "http://tieba.baidu.com/f?" headers = {'User-Agent':'User-Agent:Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50'} title = input("请输入要查找的贴吧") begin_page = int(input("请输入起始页")) end_page = int(input("请输入起始页")) #RUL进行编码 kw = {"kw":title} kw = urllib.parse.urlencode(kw) #写循环拼接URL,发请求获取响应,写入本地文件 for page in range(begin_page,end_page+1): pn = (page-1)*50 #拼接URL url = baseurl + kw + "&pa=" + str(pn) #发请求,获取响应 req = urllib.request.Request(url,headers=headers) res = urllib.request.urlopen(req) html = res.read().decode("utf-8") #写文件保存在本地 filename = "第" + str(page) +"页.html" with open(filename,'w',encoding='utf-8') as f: print("正在下载第%d页"%page) f.write(html) print("第%d页下载成功"%page)
2、函数版
import urllib.request import urllib.parse #发请求,获取响应,得到html def getPage(url): headers = {'User-Agent':'User-Agent:Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50'} req = urllib.request.Request(url,headers=headers) res = urllib.request.urlopen(req) html = res.read().decode("utf-8") return html #保存html文件到本地 def writePage(filename,html): with open(filename,'w',encoding="utf-8") as f: f.write(html) #主函数 def workOn(): name = input("请输入贴吧名") begin = int(input("请输入起始页")) end = int(input("请输入终止页")) baseurl = "http://tieba.baidu.com/f?" kw = {"kw":name} kw = urllib.parse.urlencode(kw) for page in range(begin,end+1): pn = (page-1) *50 url = baseurl + kw + "&pn=" + str(pn) html = getPage(url) filename = "第"+ str(page) + "页.html" writePage(filename,html) if __name__ == "__main__": workOn()
3、封装为类
import urllib.request import urllib.parse class BaiduSpider: def __init__(self): self.baseurl = "http://tieba.baidu.com/f?" self.headers = {'User-Agent':'User-Agent:Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50'} def getPage(self,url): '''发请求,获取响应,得到html''' req = urllib.request.Request(url,headers = self.headers) res = urllib.request.urlopen(req) html = res.read().decode("utf-8") return html def writePage(self,filename,html): '''保存html文件到本地''' with open(filename,'w',encoding="utf-8") as f: f.write(html) def workOn(self): '''主函数''' name = input("请输入贴吧名") begin = int(input("请输入起始页")) end = int(input("请输入终止页")) kw = {"kw":name} kw = urllib.parse.urlencode(kw) for page in range(begin,end+1): pn = (page-1) *50 url = self.baseurl + kw + "&pn=" + str(pn) html = self.getPage(url) filename = "第"+ str(page) + "页.html" writePage(filename,html) if __name__ == "__main__": #创建对象 daiduSpider = BaiduSpider() #调用类内的方法 daiduSpider.workOn()
1、解析
1、数据分类
1、结构化数据
特点:有固定的格式:HTML、XML、JSON等
2、非结构化数据
示例:图片、音频、视频,这类数据一般存储为二进制
2、正则表达式(re模块)
1、使用流程
1、创建编译对象:p = re.compile(r"\d")
2、对字符串匹配:result = p.match('123ABC')
3、获取匹配结果:print(result.group())
2、常用方法
1、match(s):只匹配字符串开头,返回一个对象
2、search(s):从开始往后去匹配第一个,返回一个对象
3、group():从match和search返回的对象中取值
4、findall(s):全部匹配,返回一个列表
3、表达式
.:任意字符(不能匹配\n)
[...]:包含[]中的一个内容
\d:数字
\w:字母、数字、下划线
\s:空白字符
\S:非空字符
*:前一个字符出现0次或多次
?:0次或1次
+:1次或多次
{m}:前一个字符出现m次
贪婪匹配:在整个表达式匹配成功前提下,尽可能多的去匹配
非贪婪匹配:整个表达式匹配成功前提下,尽可能少的去匹配
4、示例:
import re s = """<div><p>仰天大笑出门去,我辈岂是篷篙人</p></div> <div><p>天生我材必有用,千金散尽还复来</p></div> """ #创建编译对象,贪婪匹配 p =re.compile("<div>.*</div>",re.S) result = p.findall(s) print(result) #['<div><p>仰天大笑出门去,我辈岂是篷篙人</p></div>\n\t <div><p>天生我材必有用,千金散尽还复来</p></div>'] #非贪婪匹配 p1 = re.compile("<div>.*?</div>",re.S) result1 = p1.findall(s) print(result1) #['<div><p>仰天大笑出门去,我辈岂是篷篙人</p></div>', '<div><p>天生我材必有用,千金散尽还复来</p></div>']
5、findall()的分组
解释:先按整体匹配出来,然后在匹配()中内容,如果有2个或多个(),则以元组方式显示
import re s = 'A B C D' p1 = re.compile("\w+\s+\w+") print(p1.findall(s))#['A B','C D'] #1、先按照整体去匹配['A B','C D'] #2、显示括号里面的人内容,['A','C'] p2 = re.compile("(\w+)\s+\w+") print(p2.findall(s))#['A','C'] #1、先按照整体匹配['A B','C D'] #2、有两个以上分组需要将匹配分组的内容写在小括号里面 #,显示括号内容:[('A','B'),('C','D')] p3 = re.compile("(\w+)\s+(\w+)") print(p3.findall(s)) #[('A','B'),('C','D')]
6、练习,猫眼电影榜单top100
# -*- coding: utf-8 -*- """ 1、爬取猫眼电影top100榜单 1、程序运行,直接爬取第一页 2、是否继续爬取(y/n) y:爬取第2页 n:爬取结束,谢谢使用 3、把每一页的内容保存到本地,第一页.html 第一页:http://maoyan.com/board/4?offset=0 第二页:http://maoyan.com/board/4?offset=10 4、解析:电影名,主演,上映时间 """ import urllib.request import re class MaoyanSpider: '''爬取猫眼电影top100榜单''' def __init__(self): self.baseurl = "http://maoyan.com/board/4?offset=" self.headers = {'User-Agent':'User-Agent:Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50'} def getPage(self,url): '''获取html页面''' #创建请求对象 res = urllib.request.Request(url,headers= self.headers) #发送请求 rep = urllib.request.urlopen(res) #得到响应结果 html = rep.read().decode("utf=8") return html def wirtePage(self,filename,html): '''保存至本地文件''' # with open(filename,'w',encoding="utf-8") as f: # f.write(html) content_list = self.match_contents(html) for content_tuple in content_list: movie_title = content_tuple[0].strip() movie_actors = content_tuple[1].strip()[3:] releasetime = content_tuple[2].strip()[5:15] with open(filename,'a',encoding='utf-8') as f: f.write(movie_title+"|" + movie_actors+"|" + releasetime+'\n') def match_contents(self,html): '''匹配电影名,主演,和上映时间''' #正则表达式 # ''' # <div class="movie-item-info"> # <p class="name"><a href="/films/1203" title="霸王别姬" data-act="boarditem-click" data-val="{movieId:1203}">霸王别姬</a></p> # <p class="star"> # 主演:张国荣,张丰毅,巩俐 # </p> # <p class="releasetime">上映时间:1993-01-01(中国香港)</p> </div> # ''' regex = r'<div class="movie-item-info">.*?<a.*? title="(.*?)".*?<p class="star">(.*?)</p>.*?<p class="releasetime">(.*?)</p>.*?</div>' p = re.compile(regex,re.S) content_list = p.findall(html) return content_list def workOn(self): '''主函数''' for page in range(0,10): #拼接URL url = self.baseurl + str(page*10) #filename = '猫眼/第' + str(page+1) + "页.html" filename = '猫眼/第' + str(page+1) + "页.txt" print("正在爬取%s页"%(page+1)) html = self.getPage(url) self.wirtePage(filename,html) #用于记录输入的命令 flag = False while True: msg = input("是否继续爬取(y/n)") if msg == "y": flag = True elif msg == "n": print("爬取结束,谢谢使用") flag = False else: print("您输入的命令无效") continue if flag : break else: return None print("所有内容爬取完成") if __name__ == "__main__": spider = MaoyanSpider() spider.workOn()
3、Xpath
4、BeautifulSoup
2、请求方式及方案
1、GET(查询参数都在URL地址中显示)
2、POST
1、特点:查询参数在Form表单里保存
2、使用:
urllib.request.urlopen(url,data = data ,headers = headers)
data:表单数据data必须以bytes类型提交,不能是字典
3、案例:有道翻译
1、利用Fiddler抓包工具抓取WebForms里表单数据
2、对POST数据进行处理bytes数据类型
3、发送请求获取响应
from urllib import request,parse import json #1、处理表单数据 #Form表单的数据放到字典中,然后在进行编码转换 word = input('请输入要翻译的内容:') data = {"i":word, "from":"AUTO", "to":"AUTO", "smartresult":"dict", "client":"fanyideskweb", "salt":"1536648367302", "sign":"f7f6b53876957660bf69994389fd0014", "doctype":"json", "version":"2.1", "keyfrom":"fanyi.web", "action":"FY_BY_REALTIME", "typoResult":"false"} #2、把data转换为bytes类型 data = parse.urlencode(data).encode('utf-8') #3、发请求获取响应 #此处德 URL为抓包工具抓到的POST的URL url = "http://fanyi.youdao.com/translate?smartresult=dict&smartresult=rule" headers = {'User-Agent':'User-Agent:Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50'} req = request.Request(url,data=data,headers=headers) res = request.urlopen(req) result = res.read().decode('utf-8') print(type(result))#<class 'str'> print(result)#result为json格式的字符串 '''{"type":"ZH_CN2EN", "errorCode":0, "elapsedTime":1, "translateResult":[ [{"src":"你好", "tgt":"hello" }] ] }''' #把json格式的字符串转换为Python字典 # dic = json.loads(result) print(dic["translateResult"][0][0]["tgt"])
4、json模块
json.loads('json格式的字符串')
作用:把json格式的字符串转换为Python字典
3、Cookie模拟登陆
1、Cookie 和 Session
cookie:通过在客户端记录的信息确定用户身份
session:通过在服务器端记录的信息确定用户身份
2、案例:使用cookie模拟登陆人人网
1、获取到登录信息的cookie(登录一次抓包)
2、发送请求得到响应
from urllib import request url = "http://www.renren.com/967982493/profile" headers = { 'Host': 'www.renren.com', 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:62.0) Gecko/20100101 Firefox/62.0', 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 'Accept-Language': 'zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2', #Accept-Encoding: gzip, deflate 'Referer': 'http://www.renren.com/SysHome.do', 'Cookie': 'anonymid=jlxfkyrx-jh2vcz; depovince=SC; _r01_=1; jebe_key=6aac48eb-05fb-4569-8b0d-5d71a4a7a3e4%7C911ac4448a97a17c4d3447cbdae800e4%7C1536714317279%7C1%7C1536714319337; jebecookies=a70e405c-c17a-4877-8164-00823b5e092c|||||; JSESSIONID=abcq8TskVWDMEgvjGslxw; ick_login=d1b4c959-7554-421e-8a7f-b97edd577b3a; ick=c6c7cac9-d9ac-49e5-9e74-9ac481136db1; XNESSESSIONID=e94666d4bdb8; wp_fold=0; BAIDU_SSP_lcr=https://www.baidu.com/link?url=n0NWyopmrKuQ6xUulfbYUud3nr02sIODSKI8sfzvS2G&wd=&eqid=e7cd8eed0003aeaa000000055b9864da; _de=5EE7F4A4EC35EE3510B8477EDD9F1F27; p=dc67b283c53b57a3c9f20e04cb9ca2d43; first_login_flag=1; ln_uact=13333759329; ln_hurl=http://head.xiaonei.com/photos/0/0/men_main.gif; t=cb96dfe9e344a2d817027a2c8f7f0c4c3; societyguester=cb96dfe9e344a2d817027a2c8f7f0c4c3; id=967982493; xnsid=34a50049; loginfrom=syshome', 'Connection': 'keep-alive', 'Upgrade-Insecure-Requests': '1', } req = request.Request(url,headers = headers) res = request.urlopen(req) html = res.read().decode('utf-8') print(html)
3、requests模块
1、安装(Conda prompt终端)
1、(base) ->conda install requests
2、常用方法
1、get():向网站发送请求,并获取响应对象
1、用法:resopnse = requests.get(url,headers = headers)
2、response的属性
1、response.text:获取响应内容(字符串)
说明:一般返回字符编码为ISO-8859-1,可以通过手动指定:response.encoding='utf-8'
2、response.content:获取响应内容(bytes)
1、应用场景:爬取图片,音频等非结构化数据
2、示例:爬取图片
3、response.status_code:返回服务器的响应码
import requests url = "http://www.baidu.com/" headers = {"User-Agent":"Mozilla5.0/"} #发送请求获取响应对象 response = requests.get(url,headers) #改变编码方式 response.encoding = 'utf-8' #获取响应内容,text返回字符串 print(response.text) #content返回bytes print(response.content) print(response.status_code)#200
3、get():查询参数 params(字典格式)
1、没有查询参数
res = requests.get(url,headers=headers)
2、有查询参数
params= {"wd":"python"}
res = requuests.get(url,params=params,headers=headers)
2、post():参数名data
1、data={} #data参数为字典,不用转为bytes数据类型
2、示例:
import requests import json #1、处理表单数据 word = input('请输入要翻译的内容:') data = {"i":word, "from":"AUTO", "to":"AUTO", "smartresult":"dict", "client":"fanyideskweb", "salt":"1536648367302", "sign":"f7f6b53876957660bf69994389fd0014", "doctype":"json", "version":"2.1", "keyfrom":"fanyi.web", "action":"FY_BY_REALTIME", "typoResult":"false"} #此处德 URL为抓包工具抓到的POST的URL url = "http://fanyi.youdao.com/translate?smartresult=dict&smartresult=rule" headers = {'User-Agent':'User-Agent:Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50'} response = requests.post(url,data=data,headers=headers) response.encoding = 'utf-8' result = response.text print(type(result))#<class 'str'> print(result)#result为json格式的字符串 '''{"type":"ZH_CN2EN", "errorCode":0, "elapsedTime":1, "translateResult":[ [{"src":"你好", "tgt":"hello" }] ] }''' #把json格式的字符串转换为Python字典 dic = json.loads(result) print(dic["translateResult"][0][0]["tgt"])
3、代理:proxies
1、爬虫和反爬虫斗争的第二步
获取代理IP的网站
1、西刺代理
2、快代理
3、全国代理
2、普通代理:proxies={"协议":"IP地址:端口号"}
proxies = {'HTTP':"123.161.237.114:45327"}
import requests url = "http://www.taobao.com" proxies = {"HTTP":"123.161.237.114:45327"} headers = {"User-Agent":"Mozilla5.0/"} response = requests.get(url,proxies=proxies,headers=headers) response.encoding = 'utf-8' print(response.text)
3、私密代理:proxies={"协议":"http://用户名:密码@IP地址:端口号"}
proxies={'HTTP':'http://309435365:szayclhp@114.67.228.126:16819'}
import requests url = "http://www.taobao.com/" proxies = {'HTTP':'http://309435365:szayclhp@114.67.228.126:16819'} headers = {"User-Agent":"Mozilla5.0/"} response = requests.get(url,proxies=proxies,headers=headers) response.encoding = 'utf-8' print(response.text)
4、案例:爬取链家地产二手房信息
1、存入mysql数据库
import pymysql db = pymysql.connect("localhost","root","123456",charset='utf8') cusor = db.cursor() cursor.execute("create database if not exists testspider;") cursor.execute("use testspider;") cursor.execute("create table if not exists t1(id int);") cursor.execute("insert into t1 values(100);") db.commit() cursor.close() db.close()
2、存入MongoDB数据库
import pymongo #链接mongoDB数据库 conn = pymongo.MongoClient('localhost',27017) #创建数据库并得到数据库对象 db = conn.testpymongo #创建集合并得到集合对象 myset = db.t1 #向集合中插入一个数据 myset.insert({"name":"Tom"})
""" 爬取链家地产二手房信息(用私密代理实现) 目标:爬取小区名称,总价 步骤: 1、获取url https://cd.lianjia.com/ershoufang/pg1/ https://cd.lianjia.com/ershoufang/pg2/ 2、正则匹配 3、写入到本地文件 """ import requests import re import multiprocessing as mp BASE_URL = "https://cd.lianjia.com/ershoufang/pg" proxies = {'HTTP':'http://309435365:szayclhp@114.67.228.126:16819'} headers = {"User-Agent":"Mozilla5.0/"} regex = '<div class="houseInfo">.*?<a.*?data-el="region">(.*?)</a>.*?<div class="totalPrice">.*?<span>(.*?)</span>' def getText(BASE_URL,proxies,headers,page): url = BASE_URL+str(page) res = requests.get(url,proxies=proxies,headers=headers) res.encoding = 'utf-8' html = res.text return html def saveFile(page,regex=regex): html = getText(BASE_URL,proxies,headers,page) p = re.compile(regex,re.S) content_list = p.findall(html) for content_tuple in content_list: cell = content_tuple[0].strip() price = content_tuple[1].strip() with open('链家.txt','a') as f: f.write(cell+" "+price+"\n") if __name__ == "__main__": pool = mp.Pool(processes = 10) pool.map(saveFile,[page for page in range(1,101)])
import requests import re import multiprocessing as mp import pymysql import warnings BASE_URL = "https://cd.lianjia.com/ershoufang/pg" proxies = {'HTTP':'http://309435365:szayclhp@114.67.228.126:16819'} headers = {"User-Agent":"Mozilla5.0/"} regex = '<div class="houseInfo">.*?<a.*?data-el="region">(.*?)</a>.*?<div class="totalPrice">.*?<span>(.*?)</span>' c_db = "create database if not exists spider;" u_db = "use spider;" c_tab = "create table if not exists lianjia(id int primary key auto_increment,\ name varchar(30),\ price decimal(20,2))charset=utf8;" db = pymysql.connect("localhost","root",'123456',charset="utf8") cursor = db.cursor() warnings.filterwarnings("error") try: cursor.execute(c_db) except Warning: pass cursor.execute(u_db) try: cursor.execute(c_tab) except Warning: pass def getText(BASE_URL,proxies,headers,page): url = BASE_URL+str(page) res = requests.get(url,proxies=proxies,headers=headers) res.encoding = 'utf-8' html = res.text return html def writeToMySQL(page,regex=regex): html = getText(BASE_URL,proxies,headers,page) p = re.compile(regex,re.S) content_list = p.findall(html) for content_tuple in content_list: cell = content_tuple[0].strip() price = float(content_tuple[1].strip())*10000 s_insert = "insert into lianjia(name,price) values('%s','%s');"%(cell,price) cursor.execute(s_insert) db.commit() if __name__ == "__main__": pool = mp.Pool(processes = 20) pool.map(writeToMySQL,[page for page in range(1,101)])
import requests import re import multiprocessing as mp import pymongo BASE_URL = "https://cd.lianjia.com/ershoufang/pg" proxies = {'HTTP':'http://309435365:szayclhp@114.67.228.126:16819'} headers = {"User-Agent":"Mozilla5.0/"} regex = '<div class="houseInfo">.*?<a.*?data-el="region">(.*?)</a>.*?<div class="totalPrice">.*?<span>(.*?)</span>' #链接mongoDB数据库 conn = pymongo.MongoClient('localhost',27017) #创建数据库并得到数据库对象 db = conn.spider; #创建集合并得到集合对象 myset = db.lianjia def getText(BASE_URL,proxies,headers,page): url = BASE_URL+str(page) res = requests.get(url,proxies=proxies,headers=headers) res.encoding = 'utf-8' html = res.text return html def writeToMongoDB(page,regex=regex): html = getText(BASE_URL,proxies,headers,page) p = re.compile(regex,re.S) content_list = p.findall(html) for content_tuple in content_list: cell = content_tuple[0].strip() price = float(content_tuple[1].strip())*10000 d = {"houseName":cell,"housePrice":price} #向集合中插入一个数据 myset.insert(d) if __name__ == "__main__": pool = mp.Pool(processes = 20) pool.map(writeToMongoDB,[page for page in range(1,101)])
4、WEB客户端验证(有些网站需要先登录才可以访问):auth
1、auth = ("用户名","密码"),是一个元组
import requests import re regex = r'<a.*?>(.*?)</a>' class NoteSpider: def __init__(self): self.headers = {"User-Agent":"Mozilla5.0/"} #auth参数为元组 self.auth = ("tarenacode","code_2013") self.url = "http://code.tarena.com.cn/" def getParsePage(self): res = requests.get(self.url,auth=self.auth, headers=self.headers) res.encoding = "utf-8" html = res.text p = re.compile(regex,re.S) r_list = p.findall(html) #调用writePage()方法 self.writePage(r_list) def writePage(self,r_list): print("开始写入") for r_str in r_list: with open('笔记.txt','a') as f: f.write(r_str + "\n") print("写入完成") if __name__=="__main__": obj = NoteSpider() obj.getParsePage()
5、SSL证书认证:verify
1、verify=True:默认,做SSL证书认证
2、verify=False: 忽略证书认证
import requests url = "http://www.12306.cn/mormhweb/" headers = {"User-Agent":"Mozilla5.0/"} res = requests.get(url,verify=False,headers=headers) res.encoding = "utf-8" print(res.text)
4、Handler处理器(urllib.request,了解)
1、定义
自定义的urlopen()方法,urlopen方法是一个特殊的opener
2、常用方法
1、build_opener(Handler处理器对象)
2、opener.open(url),相当于执行了urlopen
3、使用流程
1、创建相关Handler处理器对象
http_handler = urllib.request.HTTPHandler()
2、创建自定义opener对象
opener = urllib.request.build_opener(http_handler)
3、利用opener对象的open方法发送请求
4、Handler处理器分类
1、HTTPHandler()
import urllib.request url = "http://www.baidu.com/" #1、创建HTTPHandler处理器对象 http_handler = urllib.request.HTTPHandler() #2、创建自定义的opener对象 opener = urllib.request.build_opener(http_handler) #3、利用opener对象的open方法发送请求 req = urllib.request.Request(url) res = opener.open(req) print(res.read().decode("utf-8"))
2、ProxyHandler(代理IP):普通代理
import urllib.request url = "http://www.baidu.com" #1、创建handler proxy_handler = urllib.request.ProxyHandler({"HTTP":"123.161.237.114:45327"}) #2、创建自定义opener opener = urllib.request.build_opener(proxy_handler) #3、利用opener的open方法发送请求 req = urllib.request.Request(url) res = opener.open(req) print(res.read().decode("utf-8"))
3、ProxyBasicAuthHandler(密码管理器对象):私密代理
1、密码管理器使用流程
1、创建密码管理器对象
pwd = urllib.request.HTTPPasswordMgrWithDefaultRealm()
2、添加私密代理用户名,密码,IP地址,端口号
pwd.add_password(None,"IP:端口","用户名","密码")
2、urllib.request.ProxyBasicAuthHandler(密码管理器对象)
1、CSV模块使用流程
1、Python语句打开CSV文件:
with open('test.csv','a',newline='',encoding='utf-8') as f:
pass
2、初始化写入对象使用writer(方法:
writer = csv.writer(f)
3、写入数据使用writerow()方法
writer.writerow(["霸王别姬",1993])
4、示例:
import csv #打开csv文件,如果不写newline=‘’,则每一条数据中间会出现一条空行 with open("test.csv",'a',newline='') as f: #初始化写入对象 writer = csv.writer(f) #写入数据 writer.writerow(['id','name','age']) writer.writerow([1,'Lucy',20]) writer.writerow([2,'Tom',25])
import csv with open("猫眼/第一页.csv",'w',newline="") as f: writer = csv.writer(f) writer.writerow(['电影名','主演','上映时间']) ''' 如果使用utf-8会出现['\ufeff霸王别姬', '张国荣,张丰毅,巩俐', '1993-01-01'] 使用utf-8-sig['霸王别姬', '张国荣,张丰毅,巩俐', '1993-01-01'] 两者的区别: UTF-8以字节为编码单元,它的字节顺序在所有系统中都是一様的,没有字节序的问题, 也因此它实际上并不需要BOM(“ByteOrder Mark”)。 但是UTF-8 with BOM即utf-8-sig需要提供BOM。 ''' with open("猫眼/第1页.txt",'r',encoding="utf-8-sig") as file: while True: data_list = file.readline().strip().split("|") print(data_list) writer.writerow(data_list) if data_list[0]=='': break
2、Xpath工具(解析HTML)
1、Xpath
在XML文档中查找信息的语言,同样适用于HTML文档的检索
2、Xpath辅助工具
1、Chrome插件:Xpath Helper
打开/关闭:Ctrl + Shift + 大写X
2、FireFox插件:XPath checker
3、Xpath表达式编辑工具:XML Quire
3、Xpath匹配规则
<?xml version="1.0" encoding="ISO-8859-1"?> <bookstore> <book> <title lang="en">Harry Potter</title> <author>J K. Rowling</author> <year>2005</year> <price>29.99</price> </book> <book> <title lang="chs">Python</title> <author>Joe</author> <year>2018</year> <price>49.99</price> </book> </bookstore>
1、匹配演示
1、查找bookstore下面的所有节点:/bookstore
2、查找所有的book节点://book
3、查找所有book节点下title节点中,lang属性为‘en’的节点://book/title[@lang='en']
2、选取节点
/:从根节点开始选取 /bookstore,表示“/‘前面的节点的子节点
//:从整个文档中查找某个节点 //price,表示“//”前面节点的所有后代节点
@:选取某个节点的属性 //title[@lang="en"]
3、@使用
1、选取1个节点://title[@lang='en']
2、选取N个节点://title[@lang]
3、选取节点属性值://title/@lang
4、匹配多路径
1、符号: |
2、示例:
获取所有book节点下的title节点和price节点
//book/title|//book/price
5、函数
contains():匹配一个属性值中包含某些字符串的节点
//title[contains(@lang,'e')]
text():获取文本
last():获取最后一个元素
//ul[@class='pagination']/li[last()]
not():取反
//*[@id="content"]/div[2]//p[not(@class='otitle')]
6、可以通过解析出来的标签对象继续调用xpath函数往下寻找标签
语法:获取的标签对象.xpath(“./div/span”)
""" 糗事百科https://www.qiushibaike.com/8hr/page/1/ 匹配内容 1、用户昵称,div/div/a/h2.text 2、内容,div/a/div/span.text 3、点赞数,div/div/span/i.text 4、评论数,div/div/span/a/i.text """ import requests from lxml import etree url = "https://www.qiushibaike.com/8hr/page/1/" headers = {'User-Agent':"Mozilla5.0/"} res = requests.get(url,headers=headers) res.encoding = "utf-8" html = res.text #先获取所有段子的div列表 parseHtml = etree.HTML(html) div_list = parseHtml.xpath("//div[contains(@id,'qiushi_tag_')]") print(len(div_list)) #遍历列表 for div in div_list: #获取用户昵称 username = div.xpath('./div/a/h2')[0].text print(username) #获取内容 content = div.xpath('.//div[@class="content"]/span')[0].text print(content) #获取点赞 laughNum = div.xpath('./div/span/i')[0].text print(laughNum) #获取评论数 pingNum = div.xpath('./div/span/a/i')[0].text print(pingNum)
3、解析HTML源码
1、lxml库:HTML/XML解析库
1、安装
conda install lxml
pip install lxml
2、使用流程
1、利用lxml库的etree模块构建解析对象
2、解析对象调用xpath工具定位节点信息
3、使用
1、导入模块from lxml import etree
2、创建解析对象:parseHtml = etree.HTML(html)
3、调用xpath进行解析:r_list = parseHtml.xpath("//title[@lang='en']")
说明:只要调用了xpath,则结果一定是列表
from lxml import etree html = """<div class="wrapper"> <i class="iconfont icon-back" id="back"></i> <a href="/" id="channel">新浪社会</a> <ul id="nav"> <li><a href="http://domestic.firefox.sina.com/" title="国内">国内</a></li> <li><a href="http://world.firefox.sina.com/" title="国际">国际</a></li> <li><a href="http://mil.firefox.sina.com/" title="军事">军事</a></li> <li><a href="http://photo.firefox.sina.com/" title="图片">图片</a></li> <li><a href="http://society.firefox.sina.com/" title="社会">社会</a></li> <li><a href="http://ent.firefox.sina.com/" title="娱乐">娱乐</a></li> <li><a href="http://tech.firefox.sina.com/" title="科技">科技</a></li> <li><a href="http://sports.firefox.sina.com/" title="体育">体育</a></li> <li><a href="http://finance.firefox.sina.com/" title="财经">财经</a></li> <li><a href="http://auto.firefox.sina.com/" title="汽车">汽车</a></li> </ul> <i class="iconfont icon-liebiao" id="menu"></i> </div>""" #1、创建解析对象 parseHtml = etree.HTML(html) #2、利用解析对象调用xpath工具, #获取a标签中href的值 s1 = "//a/@href" #获取单独的/ s2 = "//a[@id='channel']/@href" #获取后面的a标签中href的值 s3 = "//li/a/@href" s3 = "//ul[@id='nav']/li/a/@href"#更准确 #获取所有a标签的内容,1、首相获取标签对象,2、遍历对象列表,在通过对象.text属性获取文本值 s4 = "//a" #获取新浪社会 s5 = "//a[@id='channel']" #获取国内,国际,....... s6 = "//ul[@id='nav']//a" r_list = parseHtml.xpath(s6) print(r_list) for i in r_list: print(i.text)
4、案例:抓取百度贴吧帖子里面的图片
1、目标:抓取贴吧中帖子图片
2、思路
1、先获取贴吧主页的URL:河南大学,下一页的URL规律
2、获取河南大学吧中每个帖子的URL
3、对每个帖子发送请求,获取帖子里面所有图片的URL
4、对图片URL发送请求,以wb的范式写入本地文件
""" 步骤 1、获取贴吧主页的URL http://tieba.baidu.com/f?kw=河南大学&pn=0 http://tieba.baidu.com/f?kw=河南大学&pn=50 2、获取每个帖子的URL,//div[@class='t_con cleafix']/div/div/div/a/@href https://tieba.baidu.com/p/5878699216 3、打开每个帖子,找到图片的URL,//img[@class='BDE_Image']/@src http://imgsrc.baidu.com/forum/w%3D580/sign=da37aaca6fd9f2d3201124e799ed8a53/27985266d01609240adb3730d90735fae7cd3480.jpg 4、保存到本地 """ import requests from lxml import etree class TiebaPicture: def __init__(self): self.baseurl = "http://tieba.baidu.com" self.pageurl = "http://tieba.baidu.com/f" self.headers = {'User-Agent':"Mozilla5.0/"} def getPageUrl(self,url,params): '''获取每个帖子的URL''' res = requests.get(url,params=params,headers = self.headers) res.encoding = 'utf-8' html = res.text #从HTML页面获取每个帖子的URL parseHtml = etree.HTML(html) t_list = parseHtml.xpath("//div[@class='t_con cleafix']/div/div/div/a/@href") print(t_list) for t in t_list: t_url = self.baseurl + t self.getImgUrl(t_url) def getImgUrl(self,t_url): '''获取帖子中所有图片的URL''' res = requests.get(t_url,headers=self.headers) res.encoding = "utf-8" html = res.text parseHtml = etree.HTML(html) img_url_list = parseHtml.xpath("//img[@class='BDE_Image']/@src") for img_url in img_url_list: self.writeImg(img_url) def writeImg(self,img_url): '''将图片保存如文件''' res = requests.get(img_url,headers=self.headers) html = res.content #保存到本地,将图片的URL的后10位作为文件名 filename = img_url[-10:] with open(filename,'wb') as f: print("%s正在下载"%filename) f.write(html) print("%s下载完成"%filename) def workOn(self): '''主函数''' kw = input("请输入你要爬取的贴吧名") begin = int(input("请输入起始页")) end = int(input("请输入终止页")) for page in range(begin,end+1): pn = (page-1)*50 #拼接某个贴吧的URl params = {"kw":kw,"pn":pn} self.getPageUrl(self.pageurl,params=params) if __name__ == "__main__": spider = TiebaPicture() spider.workOn()
1、动态网站数据抓取 - Ajax
1、Ajax动态加载
1、特点:动态加载(滚动鼠标滑轮时加载)
2、抓包工具:查询参数在WebForms -> QueryString
2、案例:豆瓣电影top100榜单
import requests import json import csv url = "https://movie.douban.com/j/chart/top_list" headers = {'User-Agent':"Mozilla5.0/"} params = {"type":"11", "interval_id":"100:90", "action":"", "start":"0", "limit":"100"} res = requests.get(url,params=params,headers=headers) res.encoding="utf-8" #得到json格式的数组[] html = res.text #把json格式的数组转为python的列表 ls = json.loads(html) with open("豆瓣100.csv",'a',newline="") as f: writer = csv.writer(f) writer.writerow(["name","score"]) for dic in ls: name = dic['title'] score = dic['rating'][1] writer.writerow([name,score])
2、json模块
1、作用:json格式类型 和 Python数据类型相互转换
2、常用方法
1、json.loads():json格式 --> Python数据类型
json python
对象 字典
数组 列表
2、json.dumps():
3、selenium + phantomjs 强大的网络爬虫
1、selenium
1、定义:WEB自动化测试工具,应用于WEB自动化测试
2、特点:
1、可运行在浏览器上,根据指令操作浏览器,让浏览器自动加载页面
2、只是一个工具,不支持浏览器功能,只能与第三方浏览器结合使用
3、安装
conda install selenium
pip install selenium
2、phantomjs
1、Windowds
1、定义:无界面浏览器(无头浏览器)
2、特点:
1、把网站加载到内存执行页面加载
2、运行高效
3、安装
1、把安装包拷贝到Python安装路径Script...
2、Ubuntu
1、下载phantomjs安装包放到一个路径下
2、用户主目录:vi .bashrc
export PHANTOM_JS = /home/.../phantomjs-...
export PATH=$PHANTOM_JS/bin:$PATH
3、source .bashrc
4、终端:phantomjs
3、示例代码
#导入selenium库中的文本driver from selenium import webdriver #创建打开phantomjs的对象 driver = webdriver.PhantomJS() #访问百度 driver.get("http://www.baidu.com/") #获取网页截图 driver.save_screenshot("百度.png")
4、常用方法
1、driver.get(url)
2、driver.page_source.find("内容"):
作用:从html源码中搜索字符串,搜索成功返回非-1,搜索失败返回-1
from selenium import webdriver driver = webdriver.PhantomJS() driver.get("http://www.baidu.com/") r1 = driver.page_source.find("kw") r2 = driver.page_source.find("aaaa") print(r1,r2)#1053 -1
3、driver.find_element_by_id("id值").text
4、driver.find_element_by_name("属性值")
5、driver.find_element_by_class_name("属性值")
6、对象名.send_keys("内容")
7、对象名.click()
8、driver.quit()
5、案例:登录豆瓣网站
4、BeautifulSoup
1、定义:HTML或XML的解析,依赖于lxml库
2、安装并导入
安装:
pip install beautifulsoup4
conda install beautifulsoup4
导入模块:from bs4 import BeautifulSoup as bs
3、示例
4、BeautifulSoup支持的解析库
1、lxml HTML解析器, 'lxml'速度快,文档容错能力强
2、Python标准库 'html.parser',速度一般
3、lxml XML解析器 'xml':速度快
from selenium import webdriver from bs4 import BeautifulSoup as bs import time driver = webdriver.PhantomJS() driver.get("https://www.douyu.com/directory/all") while True: html = driver.page_source #创建解析对象 soup = bs(html,'lxml') #直接调用方法去查找元素 #存放所有主播的元素对象 names = soup.find_all("span",{"class":"dy-name ellipsis fl"}) numbers = soup.find_all("span",{"class":"dy-num fr"}) #name ,number 都是对象,有get_text() for name , number in zip(names,numbers): print("观众人数:",number.get_text(),"主播",name.get_text()) if html.find("shark-pager-disable-next") ==-1: driver.find_element_by_class_name("shark-pager-next").click() time.sleep(4) else: break
使用pytesseract识别验证码
1、安装 sudo pip3 install pytesseract
2、使用步骤:
1、打开验证码图片:Image.open(‘验证码图片路径’)
2、使用pytesseract模块中的image_to_string()方法进行识别
from PIL import Image from pytesseract import * #1、加载图片 image = Image.open('t1.png') #2、识别过程 text = image_to_string(image) print(text)
使用captcha模块生成验证码
1、安装 sudo pip3 install captcha
import random from PIL import Image import numpy as np from captcha.image import ImageCaptcha digit = ['0','1','2','3','4','5','6','7','8','9'] alphabet = [chr(i) for i in range(97,123)]+[chr(i) for i in range(65,91)] char_set = digit + alphabet #print(char_set) def random_captcha_text(char_set=char_set,captcha_size=4): '''默认获取一个随机的含有四个元素的列表''' captcha_text = [] for i in range(captcha_size): ele = random.choice(char_set) captcha_text.append(ele) return captcha_text def gen_captcha_text_and_inage(): '''默认随机得到一个包含四个字符的图片验证码并返回字符集''' image = ImageCaptcha() captcha_text = random_captcha_text() #将列表转为字符串 captcha_text = ''.join(captcha_text) captchaInfo = image.generate(captcha_text) #生成验证码图片 captcha_imge = Image.open(captchaInfo) captcha_imge = np.array(captcha_imge) im = Image.fromarray(captcha_imge) im.save('captcha.png') return captcha_text if __name__ == '__main__': gen_captcha_text_and_inage()
去重
1、去重分为两个步骤,创建两个队列(列表)
1、一个队列存放已经爬取过了url,存放之前先判断这个url是否已经存在于已爬队列中,通过这样的方式去重
2、另外一个队列存放待爬取的url,如果该url不在已爬队列中则放入到带爬取队列中
使用去重和广度优先遍历爬取豆瓣网
import re from bs4 import BeautifulSoup import basicspider import hashlibHelper def get_html(url): """ 获取一页的网页源码信息 """ headers = [("User-Agent","Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36")] html = basicspider.downloadHtml(url, headers=headers) return html def get_movie_all(html): """ 获取当前页面中所有的电影的列表信息 """ soup = BeautifulSoup(html, "html.parser") movie_list = soup.find_all('div', class_='bd doulist-subject') #print(movie_list) return movie_list def get_movie_one(movie): """ 获取一部电影的精细信息,最终拼成一个大的字符串 """ result = "" soup = BeautifulSoup(str(movie),"html.parser") title = soup.find_all('div', class_="title") soup_title = BeautifulSoup(str(title[0]), "html.parser") for line in soup_title.stripped_strings: result += line try: score = soup.find_all('span', class_='rating_nums') score_ = BeautifulSoup(str(score[0]), "html.parser") for line in score_.stripped_strings: result += "|| 评分:" result += line except: result += "|| 评分:5.0" abstract = soup.find_all('div', class_='abstract') abstract_info = BeautifulSoup(str(abstract[0]), "html.parser") for line in abstract_info.stripped_strings: result += "|| " result += line result += '\n' print(result) return result def save_file(movieInfo): """ 写文件的操作,这里使用的追加的方式来写文件 """ with open("doubanMovie.txt","ab") as f: #lock.acquire() f.write(movieInfo.encode("utf-8")) #lock.release() crawl_queue = []#待爬取队列 crawled_queue = []#已爬取队列 def crawlMovieInfo(url): '''抓取一页数据''' 'https://www.douban.com/doulist/3516235/' global crawl_queue global crawled_queue html = get_html(url) regex = r'https://www\.douban\.com/doulist/3516235/\?start=\d+&sort=seq&playable=0&sub_type=' p = re.compile(regex,re.S) itemUrls = p.findall(html) #两步去重过程 for item in itemUrls: #将item进行hash然后判断是否已经在已爬队列中 hash_irem = hashlibHelper.hashStr(item) if hash_irem not in crawled_queue:#已爬队列去重 crawl_queue.append(item) crawl_queue = list(set(crawl_queue))#将待爬队列去重 #处理当前页面 movie_list = get_movie_all(html) for movie in movie_list: save_file(get_movie_one(movie)) #将url转为hash值并存入已爬队列中 hash_url = hashlibHelper.hashStr(url) crawled_queue.append(hash_url) if __name__ == "__main__": #广度优先遍历 seed_url = 'https://www.douban.com/doulist/3516235/?start=0&sort=seq&playable=0&sub_type=' crawl_queue.append(seed_url) while crawl_queue: url = crawl_queue.pop(0) crawlMovieInfo(url) print(crawled_queue) print(len(crawled_queue))
import hashlib def hashStr(strInfo): '''对字符串进行hash''' hashObj = hashlib.sha256() hashObj.update(strInfo.encode('utf-8')) return hashObj.hexdigest() def hashFile(fileName): '''对文件进行hash''' hashObj = hashlib.md5() with open(fileName,'rb') as f: while True: #不要一次性全部读取出来,如果文件太大,内存不够 data = f.read(2048) if not data: break hashObj.update(data) return hashObj.hexdigest() if __name__ == "__main__": print(hashStr("hello")) print(hashFile('猫眼电影.txt'))
from urllib import request from urllib import parse from urllib import error import random import time def downloadHtml(url,headers=[()],proxy={},timeout=None,decodeInfo='utf-8',num_tries=10,useProxyRatio=11): ''' 支持user-agent等Http,Request,Headers 支持proxy 超时的考虑 编码的问题,如果不是UTF-8编码怎么办 服务器错误返回5XX怎么办 客户端错误返回4XX怎么办 考虑延时的问题 ''' time.sleep(random.randint(1,2))#控制访问,不要太快 #通过useProxyRatio设置是否使用代理 if random.randint(1,10) >useProxyRatio: proxy = None #创建ProxuHandler proxy_support = request.ProxyHandler(proxy) #创建opener opener = request.build_opener(proxy_support) #设置user-agent opener.addheaders = headers #安装opener request.install_opener(opener) html = None try: #这里可能出现很多异常 #可能会出现编码异常 #可能会出现网络下载异常:客户端的异常404,403 # 服务器的异常5XX res = request.urlopen(url) html = res.read().decode(decodeInfo) except UnicodeDecodeError: print("UnicodeDecodeError") except error.URLError or error.HTTPError as e: #客户端的异常404,403(可能被反爬了) if hasattr(e,'code') and 400 <= e.code < 500: print("Client Error"+e.code) elif hasattr(e,'code') and 500 <= e.code < 600: if num_tries > 0: time.sleep(random.randint(1,3))#设置等待的时间 downloadHtml(url,headers,proxy,timeout,decodeInfo,num_tries-1) return html if __name__ == "__main__": url = "http://maoyan.com/board/4?offset=0" headers = [("User-Agent","User-Agent:Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50")] print(downloadHtml(url,headers=headers))
Scrapy框架
在终端直接输入scrapy查看可以使用的命令
bench Run quick benchmark test
fetch Fetch a URL using the Scrapy downloader
genspider Generate new spider using pre-defined templates
runspider Run a self-contained spider (without creating a project)
settings Get settings values
shell Interactive scraping console
startproject Create new project
version Print Scrapy version
view Open URL in browser, as seen by Scrapy
使用步骤:
1、创建一个项目:scrapy startproject 项目名称
scrapy startproject tencentSpider
2、进入到项目中,创建一个爬虫
cd tencentSpider
scrapy genspider tencent hr.tencent.com #tencent表示创建爬虫的名字,hr.tencent.com表示入口,要爬取的数据必须在这个域名之下
3、修改程序的逻辑
1、settings.py
1、设置ua
2、关闭robots协议
3、关闭cookie
4、打开ItemPipelines
5、设置日志日别: LOG_LEVEL = 'WARNING'
# -*- coding: utf-8 -*- # Scrapy settings for tencentSpider project # # For simplicity, this file contains only settings considered important or # commonly used. You can find more settings consulting the documentation: # # https://doc.scrapy.org/en/latest/topics/settings.html # https://doc.scrapy.org/en/latest/topics/downloader-middleware.html # https://doc.scrapy.org/en/latest/topics/spider-middleware.html BOT_NAME = 'tencentSpider' SPIDER_MODULES = ['tencentSpider.spiders'] NEWSPIDER_MODULE = 'tencentSpider.spiders' # Crawl responsibly by identifying yourself (and your website) on the user-agent #USER_AGENT = 'tencentSpider (+http://www.yourdomain.com)' USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:62.0) Gecko/20100101 Firefox/62.0' # Obey robots.txt rules ROBOTSTXT_OBEY = False #是否遵循robots协议 # Configure maximum concurrent requests performed by Scrapy (default: 16) #CONCURRENT_REQUESTS = 32 # Configure a delay for requests for the same website (default: 0) # See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay # See also autothrottle settings and docs #DOWNLOAD_DELAY = 3 # The download delay setting will honor only one of: #CONCURRENT_REQUESTS_PER_DOMAIN = 16 #CONCURRENT_REQUESTS_PER_IP = 16 # Disable cookies (enabled by default) COOKIES_ENABLED = False # Disable Telnet Console (enabled by default) #TELNETCONSOLE_ENABLED = False # Override the default request headers: #DEFAULT_REQUEST_HEADERS = { # 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', # 'Accept-Language': 'en', #} # Enable or disable spider middlewares # See https://doc.scrapy.org/en/latest/topics/spider-middleware.html #SPIDER_MIDDLEWARES = { # 'tencentSpider.middlewares.TencentspiderSpiderMiddleware': 543, #} # Enable or disable downloader middlewares # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html #DOWNLOADER_MIDDLEWARES = { # 'tencentSpider.middlewares.TencentspiderDownloaderMiddleware': 543, #} # Enable or disable extensions # See https://doc.scrapy.org/en/latest/topics/extensions.html #EXTENSIONS = { # 'scrapy.extensions.telnet.TelnetConsole': None, #} # Configure item pipelines # See https://doc.scrapy.org/en/latest/topics/item-pipeline.html ITEM_PIPELINES = { 'tencentSpider.pipelines.TencentspiderPipeline': 300,#值表示优先级 } # Enable and configure the AutoThrottle extension (disabled by default) # See https://doc.scrapy.org/en/latest/topics/autothrottle.html #AUTOTHROTTLE_ENABLED = True # The initial download delay #AUTOTHROTTLE_START_DELAY = 5 # The maximum download delay to be set in case of high latencies #AUTOTHROTTLE_MAX_DELAY = 60 # The average number of requests Scrapy should be sending in parallel to # each remote server #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0 # Enable showing throttling stats for every response received: #AUTOTHROTTLE_DEBUG = False # Enable and configure HTTP caching (disabled by default) # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings #HTTPCACHE_ENABLED = True #HTTPCACHE_EXPIRATION_SECS = 0 #HTTPCACHE_DIR = 'httpcache' #HTTPCACHE_IGNORE_HTTP_CODES = [] #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
2、items.py:ORM
import scrapy class TencentspiderItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() #抓取招聘的职位,连接,岗位类型 positionName = scrapy.Field() positionLink = scrapy.Field() positionType = scrapy.Field()
3、pipelines.py:保存数据的逻辑
import json class TencentspiderPipeline(object): def process_item(self, item, spider): with open('tencent.json','ab') as f: text = json.dumps(dict(item),ensure_ascii=False)+'\n' f.write(text.encode('utf-8')) return item
4、spiders/tencent.py:主体的逻辑
import scrapy from tencentSpider.items import TencentspiderItem class TencentSpider(scrapy.Spider): name = 'tencent' allowed_domains = ['hr.tencent.com'] #start_urls = ['http://hr.tencent.com/'] # start_urls = [] # for i in range(0,530,10): # url = "https://hr.tencent.com/position.php?keywords=python&start=" # url += str(i)+"#a" # start_urls.append(url) url = "https://hr.tencent.com/position.php?keywords=python&start=" offset = 0 start_urls = [url + str(offset)+"#a"] def parse(self, response): for each in response.xpath('//tr[@class="even"]|//tr[@class="odd"]'): item = TencentspiderItem()#item是一个空字典 item['positionName'] = each.xpath('./td[1]/a/text()').extract()[0] item['positionLink'] = "https://hr.tencent.com/"+each.xpath('./td[1]/a/@href').extract()[0] item['positionType'] = each.xpath('./td[2]/text()').extract()[0] yield item #提取链接 if self.offset < 530: self.offset += 10 nextPageUrl = self.url+str(self.offset)+"#a" else: return #对下一页发起请求 yield scrapy.Request(nextPageUrl,callback = self.parse)
4、运行爬虫
scrapy crawl tencent
5、运行爬虫 并将数据保存到指定文件中
scrapy crawl tencent -o 文件名
如何在scrapy框架中设置代理服务器
1、可以在middlewares.py文件中的DownloaderMiddleware类中的process_request()方法中,来完成代理服务器的设置
2、然后将代理服务器的池放在setting.py文件中定义一个proxyList = [.....]
3、process_request()方法里面通过random.choice(proxyList)随机选一个代理服务器
注意:
1、这里的代理服务器如果是私密的,有用户名和密码时,需要做一层简单的加密处理Base64
2、在scrapy生成一个基础爬虫时使用:scrapy genspider tencent hr.tencent.com,如果要想生成一个高级的爬虫CrawlSpider
scrapy genspider -t crawl tencent2 hr.tencent.com
CrawSpider这个爬虫可以更加灵活的提取URL等信息,需要了解URL,LinkExtractor
Scrapy-Redis搭建分布式爬虫
Redis是一种内存数据库(提供了接口将数据保存到磁盘数据库中);