python2.7下同步华为云照片的爬虫程序实现

1、背景

随着华为手机的销量加大，华为云的捆绑服务使用量也越来越广泛，华为云支持自动同步照片、通讯录、记事本等，用着确实也挺方便的，云服务带来方便的同时，也带来了数据管理风险。
华为目前只提供一个www.hicloud.com网站来管理数据，不提供windows平台的同步工具，数据管理和同步非常不方便。

2、功能描述

进过几天的摸索，目前的代码实现以下功能：
1、自动调用登录网址，并显示验证码，等待手动输入验证码；
2、验证码或者密码出错，自动重新调用登录网址，最多3次出错机会；
3、自动进入相册文件夹，按照相册列表获取相片、视频的真实地址；
4、方案1：把文件真实地址保存到文本文件中，然后手动调用迅雷等工具进行批量下载；
方案2：建立本地文件夹，单线程的逐个将服务器上的相片、视频等文件自动同步到本地。
方案3：优化方案2，采取多线程的方式获取文件。

3、代码说明

A、登录过程

访问http://www.hicloud.com，系统会自动执行多步跳转
1、先直接在页面中refresh跳转到http://www.hicloud.com/others/login.action
2、再直接redirect到https://hwid1.vmall.com/casserver/logout?service=https://www.hicloud.com:443/logout
3、再redirect到https://hwid1.vmall.com/casserver/remoteLogin?service=https://www.hicloud.com:443/others/login.action&loginChannel=1000002&reqClientType=1&loginUrl=https://hwid1.vmall.com/oauth2/account/login?reqClientType=1&lang=zh-cn&adUrl=https://www.hicloud.com:443/others/show_advert.action
4、再redirect到https://hwid1.vmall.com/oauth2/account/login?reqClientType=1&validated=true&service=https://www.hicloud.com:443/others/login.action&loginChannel=1000002&reqClientType=1&adUrl=https://www.hicloud.com:443/others/show_advert.action&lang=zh-cn
这个链接会刷新出来登录界面，本程序直接使用链接4进行登陆。
（啃爹吧，搞这么多跳转，大概华为管理员以为这样就可以防爬虫？嗯，一开始在firefox里抓报文，跳转给报文跟踪增加了很多难度，后来祭出Fiddler4，搞定！！！）。
5、在链接4中包含一个刷新验证码的request:
https://hwid1.vmall.com/casserver/randomcode?randomCodeType=emui4_login&_t=1462786575782
其中参数t是系统本地时间
6、接下来调用https://hwid1.vmall.com/casserver/remoteLogin进行post提交
7、登录成功后会再次执行3次redirect，分别是:
https://www.hicloud.com:443/others/login.action?lang=zh-cn&ticket=1ST-157502-OV1212126aV9BcM9Sh2Dpe-cas
https://www.hicloud.com:443/others/login.action?lang=zh-cn
https://www.hicloud.com:443/home
若是登录失败（下面是验证码错误时的跳转链接），会redirect到链接4，因此本文直接使用链接4进行登录。
https://hwid1.vmall.com/oauth2/account/login?validated=true&errorMessage=random_code_error|user_pwd_continue_error&service=https%3A%2F%2Fwww.hicloud.com%3A443%2Fothers%2Flogin.action%3Flang%3Dzh-cn&loginChannel=1000002&reqClientType=1&adUrl=https%3A%2F%2Fwww.hicloud.com%3A443%2Fothers%2Fshow_advert.action%3Flang%3Dzh-cn&lang=zh-cn&viewT

B、函数说明

1、hw.enableCookies()
主要是设置全局的urllib2的一些属性，譬如打开调试开关，打开cookie管理，注意全局二字，这是urllib2的特性；

2、hw.getLoginPage()
主要实现访问前文的链接4，并获取应答报文，注意应答报文在后面将进行处理。
可以得到密码校验submit时需要的一些参数。

3、hw.getRadomCode()
调用服务器端验证码算法生成验证码图片，并调用系统shell显示图片。
显示图片后，阻塞进程，等待用户手动输入验证码（曾经想过调用ocr包进行字符识别，不过发现网上几个公开的包，在识别华为验证码时都基本不好用，遂放弃）。

4、hw.genLoginData(content)
基于2、3的返回，拼装验证密码submit的post字符串

5、hw.checkUserPwd(postdata)
正式开始调用验证密码的链接进行密码校验；
从校验成功的应答报文中使用正则表达式获取CSRFToken，这个值很关键，后续在很多地方用到；

6、hw.getAlbumPage()
直接访问华为云的照片主页https://www.hicloud.com:443/album
其实正常情况下，登录成功后，用户需要点击好几个动作才能打开照片主页，后台相当于有多次交互。写爬虫的话，就略过这些无关紧要的访问了。

7、hw.getAlbumList()
相册主页有两种展示方式：一种按时间分组，一种按相册名分组，我们采取后一种方式。
所以先获取相册列表，注意这个交互，服务器端返回的是json应答报文。

8、hw.getFileList(page,'albumList','albumId')
依据步骤7返回的json报文内容，循环获取各相册里相册文件的地址；
这个交互返回还是json报文，需要说明是这个json报文还是gzip压缩的，而且发现Fiddler4竟然支持自动解压。
（在测试的时候，通过Fiddler代理收到的应答报文已经被自动解压了，正式部署运行时发现报错……不过在写本文时，又发现Fiddler是有开关来控制是否自动对gzip报文解压，Fiddler很强大，挖个坑后面再写Fiddler怎么用）

9、hw.getFileList(page,'ownShareList','shareId')
这个跟步骤8是一样的功能，主要是华为云里头比较搞，针对微信单独设置了一个相册目录，其json节点是ownShareList，步骤8中是albumList。

8,9两个函数中在下载文件时有三种方案，需要选择那个方案对应打开对应代码注释行：
#方案1：保存下载地址到文本文件中，但不下载文件
#icurrentnum += self.saveFileList2Txt(each[childkey],page,icurrentnum)
#方案2：单线程下载文件到本地
#icurrentnum += self.downFileList(each[childkey],page)
#方案3：多线程下载文件到本地
#unicode码格式
#print each[childkey].encode('gbk')
icurrentnum += self.downFileListMultiThread(each[childkey],page)

程序说明至此结束，具体大家看代码吧，都不算复杂。
另外得说明异常抛出这块，我并没有去充分考虑和完善，但可以确定代码肯定是好用的。
以本人举例，使用华为半年，在服务器上总共存了2536个文件，一共9.24G数据。在2016-5-14日晚，通过家里的20M联通宽带全部同步到本地，具体耗时有点忘了，不过程序运行并没有异常退出，不得不表扬python的稳定性。
不过不保证华为官方看到这个之后，不去调整他的后台逻辑，但是思路基本问题不大。
目前来看在防爬虫这块，淘宝是做的相对较好了，主要是逻辑变化比较快，其次是复杂。

4、总结

a、学习python以及爬虫时间都不长，断断续续加起来不到1个月的样子，借鉴了很多网络资料，有艰辛也有收获。
b、python确实很强大，入门难度不高，网络资料非常丰富，官方在官方类的管理上，做得相当不错，利用pip安装挺简单也挺方便。
c、python的官方类都有是有源码（目录在c:\python27\lib下，c:\python是我的python安装目录），遇到把握不准的问题，其实看源码是最好的办法，网上的资料也有很多缪误。
不需要完全看懂，一是学习本身需要过程，二是源码太长，类太多。可以以点带面，慢慢提高，而且看源码还可以学习源码中的一些写法。
d、另外，不得不吐槽python的字符编码处理这块，坑太多了。
曾经在encode，decode这块困扰了近一个礼拜，到目前算是基本理解、会用吧。

5、源码

synchuaweiphoto.py

  1 # -*- coding=utf-8 -*-
  2 __author__='zhongtang'
  3 
  4 
  5 import urllib
  6 import urllib2
  7 import cookielib
  8 import time,datetime
  9 from PIL import Image
 10 from lxml import etree
 11 from ordereddict import OrderedDict
 12 import re
 13 import json
 14 import htmltool
 15 import os
 16 import threading
 17 import gzip
 18 import StringIO
 19 import requests
 20 
 21 class HuaWei:
 22     #华为云服务登录
 23     '''
 24     访问http://www.hicloud.com 执行多步跳转
 25     1、先直接在页面中refresh跳转到http://www.hicloud.com/others/login.action
 26     2、再直接redirect到https://hwid1.vmall.com/casserver/logout?service=https://www.hicloud.com:443/logout
 27     3、再redirect到https://hwid1.vmall.com/casserver/remoteLogin?service=https://www.hicloud.com:443/others/login.action&loginChannel=1000002&reqClientType=1&loginUrl=https://hwid1.vmall.com/oauth2/account/login?reqClientType=1&lang=zh-cn&adUrl=https://www.hicloud.com:443/others/show_advert.action
 28     4、再redirect到https://hwid1.vmall.com/oauth2/account/login?reqClientType=1&validated=true&service=https://www.hicloud.com:443/others/login.action&loginChannel=1000002&reqClientType=1&adUrl=https://www.hicloud.com:443/others/show_advert.action&lang=zh-cn
 29     这个链接会刷新出来登录界面，本程序直接使用链接4进行登陆。
 30     5、在链接4中包含一个刷新验证码的request: https://hwid1.vmall.com/casserver/randomcode?randomCodeType=emui4_login&_t=1462786575782
 31     6、接下来调用https://hwid1.vmall.com/casserver/remoteLogin进行post提交
 32     7、登录成功后会再次执行3次redirect，分别是:
 33     https://www.hicloud.com:443/others/login.action?lang=zh-cn&ticket=1ST-157502-OVRaMo6aV232229Sh2Dpe-cas
 34     https://www.hicloud.com:443/others/login.action?lang=zh-cn
 35     https://www.hicloud.com:443/home
 36     若是登录失败，会redirect到链接4，因此本文直接使用链接4进行登录。
 37     https://hwid1.vmall.com/oauth2/account/login?validated=true&errorMessage=random_code_error|user_pwd_continue_error&service=https%3A%2F%2Fwww.hicloud.com%3A443%2Fothers%2Flogin.action%3Flang%3Dzh-cn&loginChannel=1000002&reqClientType=1&adUrl=https%3A%2F%2Fwww.hicloud.com%3A443%2Fothers%2Fshow_advert.action%3Flang%3Dzh-cn&lang=zh-cn&viewT
 38     '''
 39 
 40     def __init__(self):
 41         self.username='username@yeah.net' #用户名
 42         self.passwd='userpassword' #用户密码
 43         self.authcode='' #验证码
 44         self.baseUrl='https://hwid1.vmall.com'
 45         self.loginUrl=self.baseUrl+'/oauth2/account/login?reqClientType=1&validated=true&service=https://www.hicloud.com:443/others/login.action&loginChannel=1000002&reqClientType=1&adUrl=https://www.hicloud.com:443/others/show_advert.action&lang=zh-cn'
 46         #self.loginUrl='https://www.hicloud.com'
 47         self.randomUrl=self.baseUrl+'/casserver/randomcode'
 48         self.checkpwdUrl=self.baseUrl+'/casserver/remoteLogin'
 49         self.successUrl='https://www.hicloud.com:443/album'
 50         self.getalbumsUrl= 'https://www.hicloud.com/album/getCloudAlbums.action'
 51         self.getalbumfileUrl = 'https://www.hicloud.com/album/getCloudFiles.action'
 52         self.loginHeaders = {
 53             'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:45.0) Gecko/20100101 Firefox/45.0',
 54             'Connection' : 'keep-alive'
 55         }
 56         self.CSRFToken=''
 57         self.OnceMaxFile=100 #单次最大获取文件数量
 58         self.FileList={} #照片列表
 59         self.ht=htmltool.htmltool()
 60         self.curPath= self.ht.getPyFileDir()
 61         self.FileNum=0
 62         
 63     #设置urllib2 cookie
 64     def enableCookies(self):
 65         #建立一个cookies 容器
 66         self.cookies = cookielib.CookieJar()
 67         #将一个cookies容器和一个HTTP的cookie的处理器绑定
 68         cookieHandler = urllib2.HTTPCookieProcessor(self.cookies)
 69         #创建一个opener,设置一个handler用于处理http的url打开
 70         #self.opener = urllib2.build_opener(self.handler)
 71         httpHandler=urllib2.HTTPHandler(debuglevel=1)
 72         httpsHandler=urllib2.HTTPSHandler(debuglevel=1)
 73         self.opener = urllib2.build_opener(cookieHandler,httpHandler,httpsHandler)
 74         #安装opener，此后调用urlopen()时会使用安装过的opener对象
 75         urllib2.install_opener(self.opener)
 76         
 77     #获取当前时间
 78     def getJstime(self):
 79        itime= int(time.time() * 1000)
 80        return str(itime)
 81 
 82     #获取验证码
 83     def getRadomCode(self,repeat=2):
 84         '''
 85         -- js 
 86         function chgRandomCode(ImgObj, randomCodeImgSrc) {
 87         ImgObj.src = randomCodeImgSrc+"?randomCodeType=emui4_login&_t=" + new Date().getTime();
 88         };
 89         -- http 
 90         GET /casserver/randomcode?randomCodeType=emui4_login&_t=1462786575782 HTTP/1.1
 91         '''
 92         data =''
 93         ostime=self.getJstime()
 94         filename=self.curPath+'\\'+ostime+'.png'
 95         url= self.randomUrl+"?randomCodeType=emui4_login&_t="+ostime
 96         #print url
 97         try:
 98             request = urllib2.Request(url,headers=self.loginHeaders)
 99             response = urllib2.urlopen(request)
100             data = response.read()
101         except :
102             time.sleep(5)
103             print u'保存验证码图片[%s]出错，尝试:\n[%s]' %(url,2-repeat)
104             if repeat>0:
105                  return self.getRadomCode(repeat-1)              
106         if len(data)<= 0 : return 
107         f = open(filename, 'wb')
108         f.write(data)
109         #print u"保存图片:",fileName
110         f.close()
111         im = Image.open(filename)
112         im.show()
113         self.authcode=''
114         self.authcode = raw_input(u'请输入4位验证码:')
115         #删除验证码文件
116         os.remove(filename)
117         return
118     
119     def genLoginData(self,content):
120         '''
121         1<input type="hidden" id="form_submit" name="submit" value="true">
122         2<input type="hidden" id="form_loginUrl" name="loginUrl" value="https://hwid1.vmall.com/oauth2/account/login" />
123         3<input type="hidden" id="form_service" name="service" value="https://www.hicloud.com:443/others/login.action?lang=zh-cn" />
124         4<input type="hidden" id="form_loginChannel" name="loginChannel" value="1000002" />
125         5<input type="hidden" id="form_reqClientType" name="reqClientType" value="1" />
126         6<input type="hidden" id="form_deviceID" name="deviceID" value="" />
127         7<input type="hidden" id="form_adUrl" name="adUrl" value="https://www.hicloud.com:443/others/show_advert.action?lang=zh-cn" />
128         8<input type="hidden" id="form_lang" name="lang" value="zh-cn" />
129         9<input type="hidden" id="form_inviterUserID" name="inviterUserID" value="" /> 
130         10<input type="hidden" id="form_inviter" name="inviter" value="" /> 
131         11<input type="hidden" id="form_viewType" name="viewType" value="0" /> 
132         12<input type="hidden" id="form_quickAuth" name="quickAuth" value="" /> 
133         <input type="hidden" id="form_loginUrlForBind"  value="https://hwid1.vmall.com/oauth2/portal/thirdAccountBindByPhoneForPCWeb.jsp?themeName=cloudTheme" />
134         '''
135         tree = etree.HTML(content)
136         form= tree.xpath('//div[@class="login-box"]')[0]
137         #print len(form)
138         params=OrderedDict()
139         params['submit']=form.xpath('//*[@name="submit"]/@value')[0] #1
140         params['loginUrl']= form.xpath('//*[@name="loginUrl"]/@value')[0] 
141         params['service'] = form.xpath('//*[@name="service"]/@value')[0] 
142         params['loginChannel']= form.xpath('//*[@name="loginChannel"]/@value')[0] 
143         params['reqClientType'] = form.xpath('//*[@name="reqClientType"]/@value')[0] 
144         params['deviceID']= form.xpath('//*[@name="deviceID"]/@value')[0]#6
145         params['adUrl']= form.xpath('//*[@name="adUrl"]/@value')[0]
146         params['lang'] = form.xpath('//*[@name="lang"]/@value')[0]
147         params['inviterUserID']= form.xpath('//*[@name="inviterUserID"]/@value')[0]
148         params['inviter'] = form.xpath('//*[@name="inviter"]/@value')[0]
149         params['viewType']= form.xpath('//*[@name="viewType"]/@value')[0]#11
150         params['quickAuth'] = form.xpath('//*[@name="quickAuth"]/@value')[0]
151         params['userAccount']= self.username
152         params['password'] = self.passwd
153         params['authcode'] = self.authcode
154         params=urllib.urlencode(params)
155         return params
156            
157     def getLoginPage(self):
158         request = urllib2.Request(self.loginUrl,headers=self.loginHeaders)
159         response = urllib2.urlopen(request)
160         page =''
161         page= response.read()
162         redUrl=response.geturl()
163         return page.decode('utf-8')
164 
165         
166     def checkUserPwd(self,postdata):
167         '''
168         <input type="hidden" value="" id="userHeadPic">
169         <input type="hidden" value="1" id="activeUserState"/>
170         <input type="hidden" value='[{"deviceType":0,"deviceID":"1231231231212312312312","terminalType":"huawei mt7-tl00","deviceAliasName":"HUAWEI MT7-TL00"}]' id="deviceList" />
171         <input type="hidden" value='www.hicloud.com' id="server" />
172         <input type="hidden" value='1' id="biFlag" />
173         <input type="hidden" value='https://dc.hicloud.com' id="biUrl" />
174         <script>
175                 var CSRFToken = "9b64dcad38d269147f2c27dc12171e60aade2a22316de213";
176                 var accountType = "1";
177                 var accountTypeLh = "4";
178         </script>
179         '''
180         self.CSRFToken=''
181         pattern = re.compile('CSRFToken = "(.*?)"',re.S)
182         #保存CSRFToken
183         content = re.search(pattern,page)
184         if content :
185             self.CSRFToken = content.group(1)
186             return '1'
187         else:
188             return '0'
189 
190     #打开相册页，获取CSRFToken字符，这个是关键字，在后续报文都将用到。
191     def getAlbumPage(self):
192         request=urllib2.Request(self.successUrl,headers=self.loginHeaders)
193         response = urllib2.urlopen(request)
194         rheader = response.info()
195         page= response.read()
196         redUrl=response.geturl()
197         return self.getCSRFToken(page.decode('utf-8'))
198 
199 
200 
201     """
202     Description    : 将网页图片保存本地
203     @param imgUrl  : 待保存图片URL
204     @param imgName : 待保存图片名称
205     @return 无
206     """
207     def saveImage( self,imgUrl,imgName ="default.jpg" ):
208         #使用requests的get方法直接下载文件，注意因为url是https，所以加了verify=False
209         response = requests.get(imgUrl, stream=True,verify=False)
210         image = response.content
211         filename= imgName
212         print("保存文件"+filename+"\n")
213         try:
214             with open(filename ,"wb") as jpg:
215                 jpg.write( image)     
216                 return
217         except IOError:
218             print("IO Error\n")
219             return
220         finally:
221             jpg.close        
222 
223     """
224     Description    : 开启多线程执行下载任务,注意没有限制线程数
225     @param filelist:待下载图片URL列表
226     @return 无
227     """
228     def downFileMultiThread( self,urllist,namelist ):
229         task_threads=[]  #存储线程
230         count=1
231         i = 0
232         for i in range(0,len(urllist)):
233             fileurl = urllist[i]
234             filename= namelist[i]
235             t = threading.Thread(target=self.saveImage,args=(fileurl,filename))
236             count = count+1
237             task_threads.append(t)
238         for task in task_threads:
239             task.start()
240         for task in task_threads:
241             task.join()
242 
243     #多线程下载相册照片到目录 ,不同相册保存到不同的目录
244     def downFileListMultiThread(self,dirname,hjsondata):
245         if len(hjsondata)<= 0 : return 0
246         hjson2 = {}
247         hjson2 = json.loads(hjsondata)
248         #新建目录，并切换到目录
249         self.ht.mkdir(dirname)
250         i = 0
251         urllist=[]
252         namelist=[]
253         if hjson2.has_key("fileList"):
254             for each in hjson2["fileList"]:
255                 urllist.append(hjson2["fileList"][i]["fileUrl"].encode('gbk'))
256                 namelist.append(hjson2["fileList"][i]["fileName"].encode('gbk'))
257                 self.FileNum += 1
258                 i += 1
259                 #每25个文件开始并发下载，并清空数组，或者最后一组
260                 if i%25==0 or i == len(hjson2["fileList"]):                    
261                     self.downFileMultiThread(urllist,namelist)
262                     urllist=[]
263                     namelist=[]
264         return i
265 
266     #下载相册照片到目录 ,不同相册保存到不同的目录
267     def downFileList(self,dirname,hjsondata):
268         if len(hjsondata)<= 0 : return
269         hjson2 = {}
270         hjson2 = json.loads(hjsondata)
271         #新建目录，并切换到目录
272         self.ht.mkdir(dirname)
273         i = 0             
274         if hjson2.has_key("fileList"):
275             for each in hjson2["fileList"]:
276                 self.saveImage(hjson2["fileList"][i]["fileUrl"].encode('gbk'),hjson2["fileList"][i]["fileName"].encode('gbk'))
277                 #每5个文件休息2秒
278                 self.FileNum += 1
279                 if i%5 ==0 : time.sleep(2)
280                 i += 1
281         return i
282     
283 
284     #保存相册照片地址到文件 ,不同相册保存到不同的文件
285     def saveFileList2Txt(self,filename,hjsondata,flag):
286         if len(hjsondata)<= 0 : return
287         hjson2 = {}
288         hjson2 = json.loads(hjsondata)
289         lfilename = filename+u".txt"
290         if flag == 0 : #新建文件
291             print u'创建相册文件'+lfilename+"\n"
292             #新建文件，代表新的相册重新开始计数
293             self.FileNum = 0
294             f = open(lfilename, 'wb')
295         else: #追加文件
296             f = open(lfilename, 'a')
297         i = 0             
298         if hjson2.has_key("fileList"):
299             for each in hjson2["fileList"]:
300                 f.write(hjson2["fileList"][i]["fileUrl"].encode('gbk')+"\n")
301                 #每一千行分页
302                 self.FileNum += 1
303                 if self.FileNum%1000 ==0 :f.write('\n\n\n\n\n\n--------------------page %s ------------------\n\n\n\n\n\n' %(int(self.FileNum/1000)))
304                 i += 1
305         f.close()
306         return i
307     
308     #循环读取相册文件
309     def getFileList(self,hjsondata,parentkey,childkey):
310         #step 3 getCoverFiles.action,循环取相册文件列表，单次最多取100条记录。
311         #每次count都是最大数量49，不管实际数量是否够，每次currentnum递增，直到返回空列表。
312         #最后一次返回 空列表
313         #{"albumSortFlag":true,"code":0,"info":"success!","fileList":[]}
314         #第一次取文件时，例如文件总数量只有2个，count也是放最大值49。
315         #albumIds[]=default-album-102-221216000029851117&ownerId=220012300029851117&height=300&width=300&count=49&currentNum=0&thumbType=imgcropa&fileType=0        
316         #[{u'photoNum': 2518, u'albumName': u'default-album-1', u'iversion': -1, u'albumId': u'default-album-1', u'flversion': -1, u'createTime': 1448065264550L, u'size': 0},
317         #{u'photoNum': 100, u'albumName': u'default-album-2', u'iversion': -1, u'albumId': u'default-album-2', u'flversion': -1, u'createTime': 1453090781646L, u'size': 0}]
318         hsjon={}
319         hjson = json.loads(hjsondata.decode('utf-8'))
320         paraAlbum=OrderedDict()
321         if hjson.has_key(parentkey):
322             for each in hjson[parentkey]:
323                 paraAlbum={}
324                 paraAlbum['albumIds[]'] = each[childkey]
325                 paraAlbum['ownerId'] = hjson['ownerId']
326                 paraAlbum['height'] = '300'
327                 paraAlbum['width'] = '300'
328                 paraAlbum['count'] = self.OnceMaxFile
329                 paraAlbum['thumbType'] = 'imgcropa'
330                 paraAlbum['fileType'] = '0'            
331                 itotal= each['photoNum']
332                 icurrentnum=0       
333                 while icurrentnum<itotal:                
334                     paraAlbum['currentNum'] = icurrentnum
335                     paraAlbumstr = urllib.urlencode(paraAlbum)
336                     request=urllib2.Request(self.getalbumfileUrl,headers=self.loginHeaders,data=paraAlbumstr)
337                     response = urllib2.urlopen(request)
338                     rheader = response.info()
339                     page = response.read()
340                     #调用gzip进行解压
341                     if rheader.get('Content-Encoding')=='gzip':
342                         data = StringIO.StringIO(page)
343                         gz = gzip.GzipFile(fileobj=data)
344                         page = gz.read()
345                         gz.close()
346                     page= page.decode('utf-8')
347                     #print page.decode('utf-8')
348                     #方案1：保存下载地址到文本文件中，但不下载文件
349                     #icurrentnum += self.saveFileList2Txt(each[childkey],page,icurrentnum)
350                     #方案2：单线程下载文件到本地
351                     #icurrentnum += self.downFileList(each[childkey],page)
352                     #方案3：多线程下载文件到本地
353                     #unicode码格式
354                     #print each[childkey].encode('gbk')
355                     icurrentnum += self.downFileListMultiThread(each[childkey],page)
356         return 
357 
358     #step 1 getCloudAlbums,取相册列表
359     def getAlbumList(self):
360         self.loginHeaders={
361         'Host': 'www.hicloud.com',
362         'Connection': 'keep-alive',
363         'Accept': 'application/json, text/javascript, */*; q=0.01',
364         'Origin': 'https://www.hicloud.com',
365         'X-Requested-With': 'XMLHttpRequest',
366         'CSRFToken': self.CSRFToken,
367         'User-Agent': 'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Maxthon/4.0 Chrome/30.0.1599.101 Safari/537.36',
368         'DNT': '1',
369         'Referer': 'https://www.hicloud.com/album',
370         'Accept-Encoding': 'gzip,deflate',
371         'Accept-Language': 'zh-CN',
372         'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8'
373         }        
374         request=urllib2.Request(self.getalbumsUrl,headers=self.loginHeaders)
375         response = urllib2.urlopen(request)
376         page=''
377         page= response.read()
378         '''#返回报文
379         {"ownerId":"220012300029851117","code":0,
380         "albumList":[{"albumId":"default-album-1","albumName":"default-album-1","createTime":1448065264550,"photoNum":2521,"flversion":-1,"iversion":-1,"size":0},
381                      {"albumId":"default-album-2","albumName":"default-album-2","createTime":1453090781646,"photoNum":101,"flversion":-1,"iversion":-1,"size":0}],
382         "ownShareList":[{"ownerId":"220012300029851117","resource":"album","shareId":"default-album-102-220123000029851117","shareName":"微信","photoNum":2,"flversion":-1,"iversion":-1,"createTime":1448070407055,"source":"HUAWEI MT7-TL00","size":0,"ownerAcc":"jdstkxx@yeah.net","receiverList":[]}],
383         "recShareList":[]}'
384         '''
385         if len(page)<=0 :
386             print u'取相册列表出错，无返回报文!!!\n\n%s\n\n',page.decode('utf-8')
387         return page
388 
389 #主程序开始
390 hw=HuaWei()
391 hw.enableCookies()
392 count =0 
393 while (count <3):
394     count += 1
395     content= hw.getLoginPage()
396     if content == '' :
397         print '获取登录信息出错，立即退出！！！\n\n[%s]\n\n' %(content)
398         break
399     #获取验证码
400     hw.getRadomCode()
401     #生成checkuserpwd提交时需要的POST data
402     postdata=hw.genLoginData(content)
403     #print postdata
404     reUrl = hw.checkUserPwd(postdata)
405     if reUrl.find("user_pwd_error") <> -1 :
406         print u'用户名或用户密码错误，立即退出！！！\n\n[%s]\n\n' %(reUrl)
407         break
408     elif reUrl.find("random_code_error") <> -1 :
409         print u'验证码错误，重试！！！\n\n[%s]\n\n' %(reUrl)
410         continue
411     else:
412         print '恭喜恭喜，登录华为云成功！！！\n\n'
413         iRet = hw.getAlbumPage()        
414         if iRet == 0 :
415             print '打开相册页失败，未获取到CSRFToken！！！\n\n'
416             break 
417         print '打开相册主页成功，获取到CSRFToken！！！\n\n'
418         page = hw.getAlbumList()
419         if page=='' :
420             print '获取到相册列表失败！！！\n\n'
421             break
422         #保存相册列表
423         hw.getFileList(page,'albumList','albumId')
424         #保存公共相册列表
425         hw.getFileList(page,'ownShareList','shareId')
426         print '运行结束，可以用迅雷打开相册文件进行批量下载到本地！！！\n\n'
427         break

htmltool.py

 1 # -*- coding:utf-8 -*-
 2 __author__ = 'zhongtang'
 3 
 4 import re
 5 import HTMLParser
 6 import cgi
 7 import sys
 8 import os
 9 
10 #处理页面标签类
11 class htmltool:
12     #去除img标签,1-7位空格,&nbsp;
13     removeImg = re.compile('<img.*?>| {1,7}|&nbsp;')
14     #删除超链接标签
15     removeAddr = re.compile('<a.*?>|</a>')
16     #把换行的标签换为\n
17     replaceLine = re.compile('<tr>|<div>|</div>|</p>')
18     #将表格制表<td>替换为\t
19     replaceTD= re.compile('<td>')
20     #将换行符或双换行符替换为\n
21     replaceBR = re.compile('<br><br>|<br>')
22     #将其余标签剔除
23     removeExtraTag = re.compile('<.*?>')
24     #将多行空行删除
25     removeNoneLine = re.compile('\n+')
26     
27     #html 转换成txt
28     #譬如 '&lt;abc&gt;' --> '<abc>'
29     def html2txt(self,html):
30         html_parser = HTMLParser.HTMLParser()
31         txt = html_parser.unescape(html)
32         return txt.strip()
33     
34     #html 转换成txt
35     #譬如 '<abc>' --> '&lt;abc&gt;' 
36     def txt2html(self,txt):
37         html = cgi.escape(txt) 
38         return html.strip()
39     
40     def replace(self,x):
41         x = re.sub(self.removeImg,"",x)
42         x = re.sub(self.removeAddr,"",x)
43         x = re.sub(self.replaceLine,"\n",x)
44         x = re.sub(self.replaceTD,"\t",x)
45         x = re.sub(self.replaceBR,"\n",x)
46         x = re.sub(self.removeExtraTag,"",x)
47         x = re.sub(self.removeNoneLine,"\n",x)
48         #strip()将前后多余内容删除
49         return x.strip()    
50 
51     #获取脚本文件的当前路径，返回utf-8格式
52     def getPyFileDir(self):
53         #获取脚本路径
54         path = sys.path[0]
55         #判断为脚本文件还是py2exe编译后的文件，如果是脚本文件，则返回的是脚本的目录，如果是py2exe编译后的文件，则返回的是编译后的文件路径
56         if os.path.isdir(path):
57             return path.decode('utf-8')
58         elif os.path.isfile(path):
59             return os.path.dirname(path).decode('utf-8')
60 
61     #创建新目录
62     def mkdir(self,path):
63         path = path.strip()
64         pathDir = self.getPyFileDir()
65         #print path
66         #print pathDir
67         #unicode格式
68         path = u'%s\\%s' %(pathDir,path) 
69         # 判断路径是否存在
70         # 存在     True
71         # 不存在   False
72         isExists=os.path.exists(path)
73         # 判断结果
74         if not isExists:
75             # 如果不存在则创建目录
76             #print u'新建[%s]的文件夹\n' %(path)
77             # 创建目录操作函数
78             os.makedirs(path)
79         #else:
80            # 如果目录存在则不创建，并提示目录已存在
81            #print u'文件夹[%s]已存在\n'  %(path)
82         os.chdir(path)
83         return  path

posted @ 2016-05-19 10:28 黯然销魂掌2015 阅读(10252) 评论(4) 收藏举报

刷新页面返回顶部

朝花夕拾

朝闻道，夕死足矣…… The first step is the only diffculty！
http://www.cnblogs.com/zhongtang

python2.7下同步华为云照片的爬虫程序实现

1、背景

2、功能描述

3、代码说明

A、登录过程

B、函数说明

4、总结

5、源码

公告

联系方式：qq 16906913

朝花夕拾

朝闻道，夕死足矣…… The first step is the only diffculty！ http://www.cnblogs.com/zhongtang

python2.7下同步华为云照片的爬虫程序实现

1、背景

2、功能描述

3、代码说明

A、登录过程

B、函数说明

4、总结

5、源码

公告

联系方式：qq 16906913

朝闻道，夕死足矣…… The first step is the only diffculty！
http://www.cnblogs.com/zhongtang