爬取淘宝高清图片

老婆总是为每天搭配什么衣服烦恼，每天早上对穿什么衣服是各种纠结，我就在想，何不看一下淘宝上的模特都是怎么穿的呢，正好在学python scrapy 爬虫。何不把淘宝上的高清图爬下来呢。
环境配置：python3+scrapy 

一　写 spider下ｔｂ．ｐｙ

　　１，写start_requests函数

1     def start_requests(self):
2         return [scrapy.Request(url="https://www.taobao.com/",  callback=self

    def start_requests(self):
        return [scrapy.Request(url="https://www.taobao.com/",  callback=self.start_search)]

从淘宝首页开始，这里我没有写headers是因为我会在middlewares中写随机更换UA和IP的middlewares.py，回调函数是start_search，这一步比较简单

　　２、下一步：start_search函数，这一步会让我选择爬取的关键字，然后进入淘宝的搜索列表页面，

    def start_search(self,response):
        keyword = input("please input what do you what to seach ?").strip()
        keyword = urllib.request.quote(keyword)
        for i in range(1,2): # 这里可以优化，可以写一个自动判断是否还有下一页的函数
            url = "https://s.taobao.com/search?q=" + keyword + "&s=" + str(i * 44)
            yield scrapy.Request(url=url,callback=self.parse_search_page)
            time.sleep(20)

这一步就遇到困难了，因难一，淘宝会不定时跳转到登录页面。我尝试了很多方法都没有完成淘宝的登录，这个后续要继续学习，困难二，淘宝的网页大部分是非常动太加载，得到的response 中根本根本不能用xpath和css做选择，不过可以用到正则，

下面是淘宝部分网页

<link rel="dns-prefetch" href="//res.mmstat.com" />
<link href="//img.alicdn.com/tps/i3/T1OjaVFl4dXXa.JOZB-114-114.png" rel="apple-touch-icon-precomposed" />
<style>
  blockquote,body,button,dd,dl,dt,fieldset,form,h1,h2,h3,h4,h5,h6,hr,input,legend,li,ol,p,pre,td,textarea,th,ul{margin:0;padding:0}body,button,

全是动态加载，不过这样也好，直接用正刚提取，我发现详情页面是用uid 来标示的，所以我直接以正则表达式提取Uｉｄ

淘宝原页面代码如下：

　　从上图可以看出taobao把这一页的商品的Nid都放在一个列表中，这就好办了啊，用uids = re.compile('auctionNids\"\:\[\"(.*?)\"\]').findall(html)[0].split(",")这个正则把所有的列表取出来，然后拼接商品详情页面

for uid in uids 
　　detailUrl = "https://detail.tmall.com/item.htm?id=" + uid

　　在这里就出错了，部是返回500的错误，排查了好久，终于发现，淘宝详情页面分两种，一种是淘宝，一种是天猫，他俩的详情页面是不相同的，这就必须要到源码去找了

下面是部分源码：

从源码去可以看到，这里面 isTmall 就是指是淘宝还是天猫，当然可以用正则表达式把这个字段提取出来，但是这个提取出来后，怎么会和前面提取的Nid 一一对应呢，不一一对应也是会出错的，而且源码中没有isTmall 这样一个列表，所以只能重新把新的方法，不能由上方的那个列表来获取Nid，通过几次的试验，下面这个正刚可以取出来，

uids_and_isTmail = re.compile(r'"nid":"(.*?)".*?"isTmall":(.*?),').findall(html)

这个正则可以取出nid号，还可以取出isTmall的值，这样就可以把详情页面的url拼接起来

uids_and_isTmail = re.compile(r'"nid":"(.*?)".*?"isTmall":(.*?),').findall(html)
            for uid_and_isTmail in uids_and_isTmail:
                if uid_and_isTmail[1] == "true":
                    detailUrl = "https://detail.tmall.com/item.htm?id=" + str(uid_and_isTmail[0])
                else:
                    detailUrl = "https://item.taobao.com/item.htm?id=" + str(uid_and_isTmail[0])

　　这样把详情页面的代码拼接好后，加上异常处理，就可以让下一个函数来处理这些详情页面

　　　　parse_search_page函数代码如下：

    def parse_search_page(self, response):
        """处理搜索页面"""
        html = response.body.decode("utf8", "ignore")
        try:
            # 查找uid 和是否属于天猫，因为淘宝和天猫的详情页面不一样，得到是一个tupe
            uids_and_isTmail = re.compile(r'"nid":"(.*?)".*?"isTmall":(.*?),').findall(html)
            for uid_and_isTmail in uids_and_isTmail:
                if uid_and_isTmail[1] == "true":
                    detailUrl = "https://detail.tmall.com/item.htm?id=" + str(uid_and_isTmail[0])
                else:
                    detailUrl = "https://item.taobao.com/item.htm?id=" + str(uid_and_isTmail[0])
                yield scrapy.Request(detailUrl, callback=self.parsePictureUrl)
                time.sleep(10)#友好的爬虫
        except Exception as e:
            print(e)

接下就要编写详情页面返回的数据的函数：

我想要的是高清大图，也就是淘宝中商品详情的图片，通过查源码发现，这些图片也是动态加载的，源码中根本找不到这些高清大图的url，经过抓包分析后，发现加载动态高清图的网页存在源码中，

这里descUrl就是高清大图的url，提取到这一步就简单了，直接用re 提取就行了，pictureUrl = re.compile('descUrl.*?:.*?//(.*?)\'').findall(html)[0]

　　结里这里去访问时又出错了，在浏览器里打开网页能打开，scrapy 就是会报500的错，排查了好久发现，浏览器会自动加一个http://,加的这个http://不会在地址栏中显示，但是实际请求的网页会加上这个，所以又要拼接Url，代码如下：

    def parsePictureUrl(self, response):
        """通过详情页面得到存放高清图片的网址"""
        html = response.body.decode("utf8", "ignore")
        try:
            pictureUrl = re.compile('descUrl.*?:.*?//(.*?)\'').findall(html)[0]
            #必须加http才能访问
            pictureUrl = "http://" + pictureUrl
            yield scrapy.Request(pictureUrl, callback=self.parsePicture)
        except Exception as e:
            print(e)

这个函数返回一个的数据，里面就有各个高清图的详细下载网址，源码截图

img src 里面就是存放的各个图片的下载网址，这里可以用xpath 或css进行提取，我这里还是用的正则进行提取：

    def parsePicture(self, response):
        """打开存放高清图片的网址后得到是一个json文件，里面有各个高清图片的详细网址，得到这些详细网址，然后交由scrapy下载"""
        item = TbItem()
        html = response.body.decode("utf8","ignore")
        try:
            downPictureUrlList = re.compile('src=.*?\"(.*?)\"').findall(html)
            for downPictureUrl in downPictureUrlList:
                item["img"] = [downPictureUrl]
                yield item
        except:
            print("can not find down page")

到此spider 的代码写完了，淘宝的高清图片隐藏很深，需要进行三层才能到真正的下载地址，这里面也有很多坑，接下来就是Item.py的代码

item.py 很简单，我暂时只保存图片，就只有一个字段，之后可以添加

class TbItem(scrapy.Item):
    img=scrapy.Field()

二　、接下来是settings 中代码

1，设置自动下载的字段和保存的位置，

import os
img_dir=os.path.join(os.path.abspath(os.path.dirname(__file__)),"images")
print(img_dir)
IMAGES_URLS_FIELD='img'
IMAGES_STORE=img_dir

2，加下自动下载图片的类

ITEM_PIPELINES = {
   # 'taobao.pipelines.TaobaoPipeline': 300,
    'scrapy.pipelines.images.ImagesPipeline':1
}

3，ROBOTSTXT_OBEY = False

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

4，设置随机更换UA和IP的类

DOWNLOADER_MIDDLEWARES = {
   'taobao.middlewares.RandomIpAndUserAgentMiddleware': 543,
   'taobao.middlewares.TaobaoSpiderMiddleware': None,
}

5,因为是scrapy 自动下载图片，所以不用自已写pipelines,但是要加上scrapy 的自动下载类

ITEM_PIPELINES = {
   # 'taobao.pipelines.TaobaoPipeline': 300,
    'scrapy.pipelines.images.ImagesPipeline':1
}

到此settings设置完成，

在middleware中设置随机更换UA和IP的类会在另外一篇博客中写到，这里不赘述。

三　、到些整个爬虫代码完结，下面把整个spider.py附上，方便查看：

 1 # -*- coding: utf-8 -*-
 2 import scrapy
 3 import re
 4 import time
 5 import urllib.request
 6 from taobao.items import TbItem
 7 
 8 
 9 class TbSpider(scrapy.Spider):
10     name = 'tb'
11     allowed_domains = ['tabao.com', "detail.tmall.com", "s.taobao.com", "item.taobao.com", "dsc.taobaocdn.com",
12                        "img.alicdn.com"]
13     start_urls = ['https://www.taobao.com/']
14 
15 
16     def start_requests(self):
17         return [scrapy.Request(url="https://www.taobao.com/",  callback=self.start_search)]
18 
19 
20     def start_search(self,response):
21         keyword = input("please input what do you what to seach ?").strip()
22         keyword = urllib.request.quote(keyword)
23         for i in range(1,2): # 这里可以优化，可以写一个自动判断是否还有下一页的函数
24             url = "https://s.taobao.com/search?q=" + keyword + "&s=" + str(i * 44)
25             yield scrapy.Request(url=url,callback=self.parse_search_page)
26             time.sleep(20)
27 
28     def parse_search_page(self, response):
29         """处理搜索页面"""
30         html = response.body.decode("utf8", "ignore")
31         try:
32             # 查找uid 和是否属于天猫，因为淘宝和天猫的详情页面不一样，得到是一个tupe
33             uids_and_isTmail = re.compile(r'"nid":"(.*?)".*?"isTmall":(.*?),').findall(html)
34             for uid_and_isTmail in uids_and_isTmail:
35                 if uid_and_isTmail[1] == "true":
36                     detailUrl = "https://detail.tmall.com/item.htm?id=" + str(uid_and_isTmail[0])
37                 else:
38                     detailUrl = "https://item.taobao.com/item.htm?id=" + str(uid_and_isTmail[0])
39                 yield scrapy.Request(detailUrl, callback=self.parsePictureUrl)
40                 time.sleep(10)
41         except Exception as e:
42             print(e)
43 
44     def parsePictureUrl(self, response):
45         """通过详情页面得到存放高清图片的网址"""
46         html = response.body.decode("utf8", "ignore")
47         try:
48             pictureUrl = re.compile('descUrl.*?:.*?//(.*?)\'').findall(html)[0]
49             #必须加http才能访问
50             pictureUrl = "http://" + pictureUrl
51             yield scrapy.Request(pictureUrl, callback=self.parsePicture)
52         except Exception as e:
53             print(e)
54 
55     def parsePicture(self, response):
56         """打开存放高清图片的网址后得到是一个json文件，里面有各个高清图片的详细网址，得到这些详细网址，然后交由scrapy下载"""
57         item = TbItem()
58         html = response.body.decode("utf8","ignore")
59         try:
60             downPictureUrlList = re.compile('src=.*?\"(.*?)\"').findall(html)
61             for downPictureUrl in downPictureUrlList:
62                 item["img"] = [downPictureUrl]
63                 yield item
64         except:
65             print("can not find down page")

四、下面对这次代码做总结：

学到的知识：

一，对整个basic spider的详细处理流程有了个清楚的认识。明白了ｓｃｒａｐｙ　各函数的数据流程，

二，学会看网页源代码，淘宝网页都是动态加载，要想得到你要的东西得经过好几层的挖掘，但是总会有规律。

三，学会用异常处理。异常处理太重要了，他让程序不致于因一个ｕｒｌ出错而停止。

四，ｓｃｒａｐｙ的调试，做爬虫时，会调试真的很重要。

还需要学习的知识：

一，ｓｃｒａｐｙ　日志系统，怎么记录ｓｃｒａｐｙ　的日志，

二，学会模拟登录，我上次模拟登录知乎都没有出错，这次出错，不知道是什么原因，

三，学会数据库处理，但学习入ｍｙｓｑｌ　再学习入ｍｏｎｇｏｄｂ

程序的不足

一、只是把图年保存到本地，后期会加入到保存到数据库的代码
二、目前只是测试了服装类，其他类未测试 
三、目前只是能保存图片，如果在浏览图片时看到某个图片所展示的衣服很好看，不能根据该图片追踪到淘宝店铺，不能筛选同类型的图片


最后附上ｇｉｔｈｕｂ　：https://github.com/573320328/taobao

posted @ 2018-01-21 20:01 outback123 阅读(4794) 评论(0) 编辑收藏举报

刷新页面返回顶部

Outback

爬取淘宝高清图片

公告