scrapy入门

Scrapy下载地址:

  • 官网:https://scrapy.org/
  • GitHub:https://github.com/scrapy/scrapy

获取Scrapy Document

  • 从GitHub下载scrapy
  • 进入scrapy-master\docs,按README.rst生成Scrapy Html Document
cd scrapy-master\docs
pip install -r requirements.txt   //下载Sphinx Python library
make html   //文档将生成在build/html

开发环境搭建

window:

pip install wheel
pip install lxml
pip install pyopenssl
pip install twisted
pip install pywin32
pip install scrapy

出现:error: Microsoft Visual C++ 14.0 is required. Get it with “Microsoft Visual C++ Build Tools”: http://landinghub.visualstudio.com/visual-cpp-build-tools

在这下载:https://www.lfd.uci.edu/~gohlke/pythonlibs/

  pip install 文件全路径 

调试Scrapy 

Scrapy不方便调试,但是为了深入学习框架内部的一些原理,有时候仅仅依靠日志是不够的。下面提供一种scrapy的debug方式

demo直接用来自官方例子来演示:https://github.com/scrapy/quotesbot

 

在运行 scrapy 库时,其实是相当于运行一个 python 脚本:

#!/usr/bin/python

from scrapy.cmdline import execute
execute()

所以,我们将上面的代码保存为一个 debug.py 的文件在 scrapy 项目目录下

接着配置调试器,如下

接下来,直接在debug.py中就可以以此为入口调试了

使用Scrapy Shell

运行shell

scrapy shell 'http://quotes.toscrape.com/page/1/'

使用CSS

获取title

>>> response.css('title')
[<Selector xpath='descendant-or-self::title' data='<title>Quotes to Scrape</title>'>]    

获取title文本

>>> response.css('title::text').extract()
['Quotes to Scrape']

response

response.body

response.text

LinkExtractor

当时html

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
        "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
    <title>职位搜索 | 社会招聘 | Tencent 腾讯招聘</title>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>
    <!-- Js Css -->
    <link media="screen" href="//cdn.m.tencent.com/hr_static/css/all.css?max_age=86412" type="text/css"
          rel="stylesheet"/>
    <script type="text/javascript" src="//cdn.m.tencent.com/hr_static/js/jquery-1.7.2.min.js"></script>
    <script type="text/javascript" src="//cdn.m.tencent.com/hr_static/js/jquery-ui-1.7.2.custom.min.js"></script>
    <script type="text/javascript" src="//cdn.m.tencent.com/hr_static/js/thickbox.js"></script>
    <link media="screen" href="//cdn.m.tencent.com/hr_static/css/thickbox.css" type="text/css" rel="stylesheet"/>
    <script type="text/javascript" src="//cdn.m.tencent.com/hr_static/js/functions.js"></script>
    <script type="text/javascript" src="//cdn.m.tencent.com/hr_static/js/utils.js"></script>
    <script language="javascript" src="//vm.gtimg.cn/tencentvideo/txp/js/txplayer.js" charset="utf-8"></script>
    <script type="text/javascript" src="//cdn.m.tencent.com/hr_static/js/all.js?max_age=86412"></script>
    <!-- Js Css -->
    <script>
        var keywords_json = [];
    
    </script>
</head>

<body>
<div id="header">
    <div class="maxwidth">
        <a href="index.php" class="left" id="logo"><img src="//cdn.m.tencent.com/hr_static/img/logo.png"/></a>
        <div class="right" id="headertr">
            <div class="right pl9" id="topshares">
                <div class="shares">
                    <span class="left">分享到:</span>
                    <!--<a href="javascript:;" onclick="shareto('qqt','top');" id="qqt" title="分享到腾讯微博">分享到腾讯微博</a>-->
                    <a href="javascript:;" onclick="shareto('qzone','top');" id="qzone" title="分享到QQ空间">分享到QQ空间</a>
                    <!--<a href="javascript:;" onclick="shareto('pengyou','top');" id="pengyou" title="分享到腾讯朋友">分享到腾讯朋友</a>-->
                    <a href="javascript:;" onclick="shareto('sinat','top');" id="sinat" title="分享到新浪微博">分享到新浪微博</a>
                    <!--<a href="javascript:;"  onclick="shareto('renren','top');"id="renren" title="分享到人人网">分享到人人网</a>-->
                    <!--<a href="javascript:;"  onclick="shareto('kaixin001','top');"id="kaixin" title="分享到开心网">分享到开心网</a>-->
                    <div class="clr"></div>
                </div>
                <!--<a href="javascript:;">分享</a>-->
            </div>
            <!--<div class="right pl9">-->
            <!--<a href="http://t.qq.com/QQjobs" id="tqq" target="_blank">收听腾讯招聘</a>-->
            <!--</div>-->
            <div class="right pr9">
                <a href="login.php" id="header_login_anchor">登录</a><span class="plr9">|</span><a href="reg.php">注册</a>
                <span class="plr9">|</span><a href="question.php">反馈建议</a>
                <span class="plr9">|</span><a href="http://careers.tencent.com/global" target="_blank">Tencent Global
                Talent</a>
                <script>
                        var User_Account = "";
                    
                </script>
            </div>
            <div class="clr"></div>
        </div>
        <div class="clr"></div>
    </div>
    <div id="menus">
        <div class="maxwidth">
            <ul id="menu" class="left">
                <li id="nav1"><a href="index.php">&nbsp;</a></li>
                <li id="nav2" class="active"><a href="social.php">&nbsp;</a></li>
                <li id="nav3"><a href="about.php">&nbsp;</a></li>
                <li id="nav4"><a href="workInTencent.php">&nbsp;</a></li>
            </ul>
            <a class="right texti9" target="_blank" id="navxy" href="http://join.qq.com">校园招聘</a>
            <div class="clr"></div>
        </div>
    </div>
</div>
<div id="sociaheader">
</div>
<div id="position" class="maxwidth">
    <a name="a" id="a"></a>
    <div class="left wcont_b box">
        <div class="blueline">
            <div class="butzwss"></div>
        </div>
        <form id="searchform" class="buts1">
            <div id="searchrow1">
                <div id="search1"><input id="search2" name="keywords" t="请输入关键词" value="" class="left"/><input
                        class="left" id="search3" type="submit" value=""/>
                    <div class="clr"></div>
                </div>
                <input type="hidden" name="lid" value="0"/>
                <input type="hidden" name="tid" value="0"/>
            </div>
            <div id="searchrow2">
                <div class="srow2l left"></div>
                <div class="left items pl9 itemnone" id="additems">
                    <a href="position.php?keywords=&tid=0" class="item active"><span><font>全部</font></span></a>
                    <a class="item" href="position.php?keywords=&tid=0&lid=2218"><span><font>深圳</font></span></a>
                    <a class="item" href="position.php?keywords=&tid=0&lid=2156"><span><font>北京</font></span></a>
                    <a class="item" href="position.php?keywords=&tid=0&lid=2175"><span><font>上海</font></span></a>
                    <a class="item" href="position.php?keywords=&tid=0&lid=2196"><span><font>广州</font></span></a>
                    <a class="item" href="position.php?keywords=&tid=0&lid=2268"><span><font>成都</font></span></a>
                    <a class="item" href="position.php?keywords=&tid=0&lid=2252"><span><font>杭州</font></span></a>
                    <a class="item" href="position.php?keywords=&tid=0&lid=2426"><span><font>昆明</font></span></a>
                    <a class="item" href="position.php?keywords=&tid=0&lid=33"><span><font>美国</font></span></a>
                    <a class="item" href="position.php?keywords=&tid=0&lid=2459"><span><font>中国香港</font></span></a>
                    <a class="item itemhide"
                       href="position.php?keywords=&tid=0&lid=2418"><span><font>长春</font></span></a>
                    <a class="item itemhide"
                       href="position.php?keywords=&tid=0&lid=2355"><span><font>武汉</font></span></a>
                    <a class="item itemhide"
                       href="position.php?keywords=&tid=0&lid=2226"><span><font>重庆</font></span></a>
                    <a class="item itemhide" href="position.php?keywords=&tid=0&lid=90"><span><font>荷兰</font></span></a>
                    <a class="item itemhide"
                       href="position.php?keywords=&tid=0&lid=2406"><span><font>沈阳</font></span></a>
                    <a class="item itemhide"
                       href="position.php?keywords=&tid=0&lid=2381"><span><font>西安</font></span></a>
                    <a class="item itemhide" href="position.php?keywords=&tid=0&lid=59"><span><font>日本</font></span></a>
                    <a class="item itemhide"
                       href="position.php?keywords=&tid=0&lid=2436"><span><font>贵阳</font></span></a>
                    <a class="item itemhide"
                       href="position.php?keywords=&tid=0&lid=2393"><span><font>太原</font></span></a>
                    <a class="item itemhide"
                       href="position.php?keywords=&tid=0&lid=2346"><span><font>郑州</font></span></a>
                    <a class="item itemhide"
                       href="position.php?keywords=&tid=0&lid=2314"><span><font>南宁</font></span></a>
                    <a class="item itemhide" href="position.php?keywords=&tid=0&lid=2442"><span><font>呼和浩特</font></span></a>
                    <a class="item itemhide"
                       href="position.php?keywords=&tid=0&lid=2458"><span><font>西宁</font></span></a>
                    <a class="item itemhide"
                       href="position.php?keywords=&tid=0&lid=95"><span><font>雄安新区</font></span></a>
                    <a class="item itemhide"
                       href="position.php?keywords=&tid=0&lid=81"><span><font>新加坡</font></span></a>
                    <a class="item itemhide"
                       href="position.php?keywords=&tid=0&lid=2320"><span><font>合肥</font></span></a>
                    <a class="item itemhide"
                       href="position.php?keywords=&tid=0&lid=2439"><span><font>兰州</font></span></a>
                    <a class="item itemhide"
                       href="position.php?keywords=&tid=0&lid=2448"><span><font>银川</font></span></a>
                    <a class="item itemhide"
                       href="position.php?keywords=&tid=0&lid=2225"><span><font>天津</font></span></a>
                    <a class="item itemhide"
                       href="position.php?keywords=&tid=0&lid=2407"><span><font>大连</font></span></a>
                    <a class="item itemhide" href="position.php?keywords=&tid=0&lid=2453"><span><font>乌鲁木齐</font></span></a>
                    <a class="item itemhide" href="position.php?keywords=&tid=0&lid=2336"><span><font>石家庄</font></span></a>
                    <a class="item itemhide"
                       href="position.php?keywords=&tid=0&lid=2283"><span><font>福州</font></span></a>
                </div>
                <div class="left"><a href="javascript:;" class="more2">更多</a></div>
                <div class="clr"></div>
            </div>
            <div id="searchrow3">
                <div class="srow2l left"></div>
                <div class="left items pl9">
                    <a href="position.php?keywords=&lid=0" class="item active"><span><font>全部</font></span></a>
                    <a class="item" href="position.php?keywords=&lid=0&tid=87"><span><font>技术类</font></span></a>
                    <a class="item" href="position.php?keywords=&lid=0&tid=82"><span><font>产品/项目类</font></span></a>
                    <a class="item" href="position.php?keywords=&lid=0&tid=83"><span><font>市场类</font></span></a>
                    <a class="item" href="position.php?keywords=&lid=0&tid=81"><span><font>设计类</font></span></a>
                    <a class="item" href="position.php?keywords=&lid=0&tid=84"><span><font>职能类</font></span></a>
                    <a class="item" href="position.php?keywords=&lid=0&tid=85"><span><font>内容编辑类</font></span></a>
                    <a class="item" href="position.php?keywords=&lid=0&tid=86"><span><font>客户服务类</font></span></a>
                </div>
                <div class="clr"></div>
            </div>
        </form>
        <table class="tablelist" cellpadding="0" cellspacing="0">
            <tr class="h">
                <td class="l" width="374">职位名称</td>
                <td>职位类别</td>
                <td>人数</td>
                <td>地点</td>
                <td>发布时间</td>
            </tr>
            <tr class="even">
                <td class="l square"><a target="_blank" href="position_detail.php?id=44053&keywords=&tid=0&lid=0">28601-211
                    微信支付交通行业车主产品C端运营经理(深圳)</a></td>
                <td>产品/项目类</td>
                <td>1</td>
                <td>深圳</td>
                <td>2018-09-07</td>
            </tr>
            <tr class="odd">
                <td class="l square"><a target="_blank" href="position_detail.php?id=44054&keywords=&tid=0&lid=0">28601-321
                    微信支付交通行业商务拓展经理(深圳)</a></td>
                <td>市场类</td>
                <td>1</td>
                <td>深圳</td>
                <td>2018-09-07</td>
            </tr>
            <tr class="even">
                <td class="l square"><a target="_blank" href="position_detail.php?id=44044&keywords=&tid=0&lid=0">WXG02-116
                    微信质量平台开发工程师(广州)</a></td>
                <td>技术类</td>
                <td>1</td>
                <td>广州</td>
                <td>2018-09-07</td>
            </tr>
            <tr class="odd">
                <td class="l square"><a target="_blank" href="position_detail.php?id=44046&keywords=&tid=0&lid=0">WXG03-211
                    公众平台产品运营(广州)</a></td>
                <td>产品/项目类</td>
                <td>1</td>
                <td>广州</td>
                <td>2018-09-07</td>
            </tr>
            <tr class="even">
                <td class="l square"><a target="_blank" href="position_detail.php?id=44048&keywords=&tid=0&lid=0">WXG03-211
                    公众平台数据运营(广州)</a></td>
                <td>产品/项目类</td>
                <td>1</td>
                <td>广州</td>
                <td>2018-09-07</td>
            </tr>
            <tr class="odd">
                <td class="l square"><a target="_blank" href="position_detail.php?id=44049&keywords=&tid=0&lid=0">WXG03-211小程序数据分析(广州)</a>
                </td>
                <td>技术类</td>
                <td>1</td>
                <td>广州</td>
                <td>2018-09-07</td>
            </tr>
            <tr class="even">
                <td class="l square"><a target="_blank" href="position_detail.php?id=44052&keywords=&tid=0&lid=0">28601-211
                    微信支付交通行业车主产品B端运营经理(深圳)</a></td>
                <td>产品/项目类</td>
                <td>1</td>
                <td>深圳</td>
                <td>2018-09-07</td>
            </tr>
            <tr class="odd">
                <td class="l square"><a target="_blank" href="position_detail.php?id=44043&keywords=&tid=0&lid=0">TEG04-腾讯微校产品运营经理(教育合作拓展)(深圳)</a>
                </td>
                <td>产品/项目类</td>
                <td>1</td>
                <td>深圳</td>
                <td>2018-09-07</td>
            </tr>
            <tr class="even">
                <td class="l square"><a target="_blank" href="position_detail.php?id=44047&keywords=&tid=0&lid=0">WXG03-211
                    开放平台产品运营(广州)</a></td>
                <td>产品/项目类</td>
                <td>1</td>
                <td>广州</td>
                <td>2018-09-07</td>
            </tr>
            <tr class="odd">
                <td class="l square"><a target="_blank" href="position_detail.php?id=44023&keywords=&tid=0&lid=0">MIG02-业务管理经理(深圳)</a>
                </td>
                <td>产品/项目类</td>
                <td>1</td>
                <td>深圳</td>
                <td>2018-09-07</td>
            </tr>
            <tr class="f">
                <td colspan="5">
                    <div class="left"><span class="lightblue total">3319</span>个职位</div>
                    <div class="right">
                        <div class="pagenav"><a href="javascript:;" class="noactive" id="prev">上一页</a><a class="active"
                                                                                                         href="javascript:;">1</a><a
                                href="position.php?&start=10#a">2</a><a href="position.php?&start=20#a">3</a><a
                                href="position.php?&start=30#a">4</a><a href="position.php?&start=40#a">5</a><a
                                href="position.php?&start=50#a">6</a><a href="position.php?&start=60#a">7</a><a
                                href="position.php?&start=70#a">...</a><a href="position.php?&start=3310#a">332</a><a
                                href="position.php?&start=10#a" id="next">下一页</a>
                            <div class="clr"></div>
                        </div>
                    </div>
                    <div class="clr"></div>
                </td>
            </tr>
        </table>
    </div>
    <div class="right wcont_s box">
        <div class="blueline">
            <div class="butcjwt"></div>
        </div>
        <div class="module_faqs square"><a href="faq.php?id=5" title="如何应聘腾讯公司的职位?">如何应聘腾讯公司的职位?</a><a
                href="faq.php?id=3" title="应届生如何应聘?">应届生如何应聘?</a><a href="faq.php?id=19"
                                                                    title="腾讯应聘流程是什么?">腾讯应聘流程是什么?</a><a
                href="faq.php?id=20" title="我注册了简历,但为什么没有人联系我?">我注册了简历,但为什么没...</a><a href="faq.php?id=22"
                                                                                      title="我忘记密码了,怎么办?">我忘记密码了,怎么办?</a><a
                href="faq.php?id=23" title="如何进行简历修改?">如何进行简历修改?</a></div>
    </div>
    <div class="clr"></div>
</div>
<div id="homeDep">
    <table id="homeads">
        <tr>
            <td align="center"><a href="http://tencent.avature.net/career" target="blank">全球招聘</a></td>
            <td align="center"><a href="http://game.qq.com/hr/" target="blank">互动娱乐事业群招聘</a></td>
            <td align="center"><a href="http://hr.tencent.com/position.php?lid=&tid=&keywords=WXG" target="blank">微信事业群招聘</a>
            </td>
            <td align="center"><a href="http://hr.qq.com/" target="blank">技术工程事业群招聘</a></td>
            <td align="center"><a href="http://snghr.tencent.com" target="blank">社交网络事业群招聘</a></td>
            <td align="center"><a href="http://mighr.qq.com" target="blank">移动互联网事业群招聘</a></td>
            <td align="center"><a href="http://hr.tencent.com/position.php?keywords=OMG" target="blank">网络媒体事业群招聘</a>
            </td>
        </tr>
    </table>
</div>
<div id="footer">
    <div>
        <a href="http://www.tencent.com/" target="_blank">关于腾讯</a><span>|</span><a
            href="http://www.qq.com/contract.shtml" target="_blank">服务条款</a><span>|</span><a
            href="http://hr.tencent.com/" target="_blank">腾讯招聘</a><span>|</span><a
            href="http://careers.tencent.com/global" target="_blank">Tencent Global Talent</a><span>|</span><a
            href="http://gongyi.qq.com/" target="_blank">腾讯公益</a><span>|</span><a href="http://service.qq.com/"
                                                                                  target="_blank">客服中心</a>
    </div>
    <p>Copyright &copy; 1998 - 2018 Tencent. All Rights Reserved.</p>
</div>
<script type="text/javascript" src="//tajs.qq.com/stats?sId=64934792" charset="UTF-8"></script>
</body>
</html>
View Code

在shell中使用LinkExtractor

 scrapy shell "http://hr.tencent.com/position.php?&start=0#a"
 from scrapy.linkextractors import LinkExtractor

 

 

posted @ 2018-08-14 15:27  逐梦客!  阅读(157)  评论(0)    收藏  举报