Scrapy中对xpath使用re

Scrapy中使用xpath时,根据xpath的语法不一定能得到想要的。

如下面的html源码:

1 <div class="db_contout">    <div class="db_cont">                <div class="details_nav">            <a href="http://movie.mtime.com/79055/addimage.html" class="db_addpic" target="_blank">                <strong class="px16">+</strong> 添加图片</a>            <ul id="imageNavUl">                <li><i>&nbsp;</i><a href="http://movie.mtime.com/79055/posters_and_images/">全部图片</a></li>                <li><i>&nbsp;</i><a href="#">剧照</a></li>                <li><i>&nbsp;</i><a href="#">海报</a></li>                <li><i>&nbsp;</i><a href="#">工作照</a></li>                <li><i>&nbsp;</i><a href="#">新闻图片</a></li>                <li><i>&nbsp;</i><a href="#">桌面</a></li>                <li><i>&nbsp;</i><a href="#">封套</a></li>            </ul>        </div>        <div class="db_pictypeout">            <div class="pictypenav clearfix">                                <ul id="imageSubNavUl" class="fl mt3">                </ul>                                <div id="filters" class="db_selbox fr">                </div>            </div>                        <dl id="imagesDiv" class="db_pictypelist clearfix">            </dl>                        <div id="pageDiv">            </div>        </div>    </div></div><div id="M13_B_DB_Movie_FooterTopTG"></div><script type="text/javascript">
2     var imageList = [{"stagepicture":[{"officialstageimage":[{"id":1059362,"title":"官方剧照 #16","type":6,"subType":6001,"status":1,"img_220":"http://img31.mtime.cn/pi/2014/02/28/042610.59909056_220X220.jpg","img_1000":"http://img31.mtime.cn/pi/2014/02/28/042610.59909056_1000X1000.jpg","width":3233,"height":2000,"fileSize":5472,"enterTime":"2009-07-09","enterNickName":"jackali","description":"","commentCount":0,"imgDetailUrl":"http://movie.mtime.com/79055/posters_and_images/1059362/","topNum":4,"newIndex":37,"typeHotIndex":0,"typeNewIndex":37,"img_235":"http://img31.mtime.cn/pi/2014/02/28/042610.59909056_235X235.jpg"},{"id":829271,"title":"官方剧照 #06","type":6,"subType":6001,"status":1,"img_220":"http://img31.mtime.cn/pi/2014/02/28/042556.29233713_220X220.jpg","img_1000":"http://img31.mtime.cn/pi/2014/02/28/042556.29233713_1000X1000.jpg","width":842,"height":477,"fileSize":74,"enterTime":"2008-12-17","enterNickName":"边界","description":"","commentCount":0,"imgDetailUrl":"http://movie.mtime.com/79055/posters_and_images/829271/","topNum":0,"newIndex":51,"typeHotIndex":1,"typeNewIndex":51,"img_235":"http://img31.mtime.cn/pi/2014/02/28/042556.29233713_235X235.jpg"},{"id":625583,"title":"官方剧照 

要得到img_1000后面picture的source路径,通过xpath的语法我没有得到直接取到的方法,折中办法参考:http://www.cnblogs.com/Garvey/p/6697162.html,使用re来获得需要的内容。

 1 class MtimeSpider(scrapy.Spider):
 2     name = "mtime"
 3     allowed_domains = ["http://www.mtime.com"]
 4     start_urls = (
 5         'http://movie.mtime.com/79055/posters_and_images/posters/hot.html',
 6     )
 7 
 8     def parse(self, response):
 9         allpics = response.xpath("//script[@type='text/javascript']").re('\"img_1000\":\"(.+?jpg)\"')
10         print len(allpics)
11         nameList = []
12         i = 0
13         for pic in allpics:
14             i = i+1
15             item = S0819MtimeTiantangItem()
16             while True:
17                 itemName = random.randint(0, 1000)*3
18                 itemName = str(itemName)
19                 if itemName in nameList:
20                     pass
21                 else:
22                     name = str(i)
23                     nameList.append(itemName)
24                     #print "-----"+itemName
25                     print "-----"
26                     #print nameList
27                     break
28             addr = pic
29             item['name'] = name
30             item['addr'] = addr
31             print "+++++"+addr 
32             print "+++++"+name
33             yield item

 

posted @ 2017-08-20 08:43  笑面浮屠  阅读(2165)  评论(0编辑  收藏  举报