网络爬虫之lxml(一)

       网络爬虫顾明思议就是从互联网中获取数据,然后对这些数据进行处理,然后让数据成为自己

想要的部分,比如分析2019年自动化测试工程师的薪资情况到底是怎么样的,可以获取到招聘平台

所有自动化测试招聘的薪资范围,然后对薪资做一个排名分析,当然事情做起来并不是说的这样简单

的。在Python的网络爬虫中,从平台中获取数据的方式主要会应用到lxml,re模块,以及beautifulsoup4,

这里先来看lxml的应用,首先需要安装它,安装它的命令为:pip3  install lxml。安装成功后,就可以直

接使用了。从平台获取数据使用到的库是requests库,这里不再详细的介绍了。在chrome的浏览器,到

google应用商店安装xpath helper的插件,安装成功后,浏览器就会显示这样的标识。lxml主要使用

了xpath的获取的方式,如果对UI自动化测试熟悉的同学就会知道,元素定位当中就会涉及到xpath的元素

定位的方式。这里以获取豆瓣电影即将上映的数据为案例,见截图:

打开浏览器的调试模式,使用元素定位的方式,可以看到所有电影的数据都是在ul下的class为lists里面,见截图:

那么首先获取到ul下的数据,见实现的代码:

#!/usr/bin/env python
# -*-coding:utf-8 -*-


import  requests
from lxml import  etree



def get_douban():
    r=requests.get(
        url='https://movie.douban.com/cinema/nowplaying/xian/',
        headers={'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36'})
    return r.text


def get_douban_movies():
    html=etree.HTML(get_douban(),parser=etree.HTMLParser(encoding='utf-8'))
    uls=html.xpath('//ul[@class="lists"]')
    print(uls)


if __name__ == '__main__':
    get_douban_movies()

 执行代码后,见执行的结果:

如上可以看到,列表里面返回了两个数据,其实第一个是正在上映的数据,第二个是即将上映的数据,那么我们在列表中取第二个

数据就好,修改代码,并且循环解析代码,见修改后的代码:

#!/usr/bin/env python
# -*-coding:utf-8 -*-


import  requests
from lxml import  etree



def get_douban():
    r=requests.get(
        url='https://movie.douban.com/cinema/nowplaying/xian/',
        headers={'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36'})
    return r.text


def get_douban_movies():
    html=etree.HTML(get_douban(),parser=etree.HTMLParser(encoding='utf-8'))
    uls=html.xpath('//ul[@class="lists"]')[1]
    for ul in uls:
        print(etree.tostring(ul,encoding='utf-8').decode('utf-8'))


if __name__ == '__main__':
    get_douban_movies()

 

见执行代码后,获取到的页面数据:

C:\Python37\python3.exe D:/git/GITHUB/WebCrawler/dataParsing/即将上映.py
<li id="30165034" class="list-item" data-title="&#x6628;&#x65E5;&#x5947;&#x8FF9;" data-wish="5906" data-duration="116&#x5206;&#x949F;" data-region="&#x82F1;&#x56FD;" data-director="&#x4E39;&#x5C3C;&#xB7;&#x535A;&#x4F0A;&#x5C14;" data-actors="&#x5E0C;&#x7C73;&#x4EC0;&#xB7;&#x5E15;&#x7279;&#x5C14; / &#x8389;&#x8389;&#xB7;&#x8A79;&#x59C6;&#x65AF; / &#x51EF;&#x7279;&#xB7;&#x9EA6;&#x514B;&#x91D1;&#x519C;" data-category="upcoming" data-enough="false" data-subject="30165034">
                        <ul class="">
                            <li class="poster">
                                    <a href="https://movie.douban.com/subject/30165034/?from=playing_poster" target="_blank" data-psource="poster">
                                        <img src="https://img1.doubanio.com/view/photo/s_ratio_poster/public/p2561245949.jpg" alt="昨日奇迹" rel="nofollow" class=""/>
                                    </a>
                            </li>
                            <li class="stitle">
                                    <a class="" href="https://movie.douban.com/subject/30165034/?from=playing_poster" target="_blank" title="昨日奇迹" data-psource="title">
                                        昨日奇迹
                                    </a>
                            </li>
                            <li class="release-date">
                                08月16日上映
                            </li>
                        </ul>
                    </li>
                    
<li id="33381471" class="list-item" data-title="&#x53E4;&#x7A91;&#x8FF7;&#x8E2A;" data-wish="284" data-duration="86&#x5206;&#x949F;" data-region="&#x4E2D;&#x56FD;&#x5927;&#x9646;" data-director="&#x8881;&#x6770;" data-actors="&#x90ED;&#x96EA;&#x8299; / &#x7F57;&#x5F6C; / &#x5218;&#x6C38;&#x5947;" data-category="upcoming" data-enough="false" data-subject="33381471">
                        <ul class="">
                            <li class="poster">
                                    <a href="https://movie.douban.com/subject/33381471/?from=playing_poster" target="_blank" data-psource="poster">
                                        <img src="https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2558727463.jpg" alt="古窑迷踪" rel="nofollow" class=""/>
                                    </a>
                            </li>
                            <li class="stitle">
                                    <a class="" href="https://movie.douban.com/subject/33381471/?from=playing_poster" target="_blank" title="古窑迷踪" data-psource="title">
                                        古窑迷踪
                                    </a>
                            </li>
                            <li class="release-date">
                                08月16日上映
                            </li>
                        </ul>
                    </li>
                    
<li id="34482589" class="list-item" data-title="&#x6211;&#x4EEC;&#x7684;&#x56DB;&#x5341;&#x5E74;" data-wish="269" data-duration="94&#x5206;&#x949F;" data-region="&#x4E2D;&#x56FD;&#x5927;&#x9646;" data-director="&#x674E;&#x6613;&#x7965; &#x9C8D;&#x632F;&#x6C5F; &#x674E;&#x632F;&#x4F1F; &#x970D;&#x731B;" data-actors="&#x674E;&#x6613;&#x7965; / &#x9C8D;&#x632F;&#x6C5F; / &#x91D1;&#x5B8F;" data-category="upcoming" data-enough="false" data-subject="34482589">
                        <ul class="">
                            <li class="poster">
                                    <a href="https://movie.douban.com/subject/34482589/?from=playing_poster" target="_blank" data-psource="poster">
                                        <img src="https://img1.doubanio.com/view/photo/s_ratio_poster/public/p2562828338.jpg" alt="我们的四十年" rel="nofollow" class=""/>
                                    </a>
                            </li>
                            <li class="stitle">
                                    <a class="" href="https://movie.douban.com/subject/34482589/?from=playing_poster" target="_blank" title="我们的四十年" data-psource="title">
                                        我们的四十年
                                    </a>
                            </li>
                            <li class="release-date">
                                08月16日上映
                            </li>
                                <li class="sbtn">
                                    <a class="ticket-btn" href="https://movie.douban.com/ticket/redirect/?movie_id=34482589" target="_blank">
                                        选座购票
                                    </a>
                                </li>
                        </ul>
                    </li>
                    
<li id="33383770" class="list-item" data-title="&#x730E;&#x88AD;" data-wish="77" data-duration="90&#x5206;&#x949F;" data-region="&#x4E2D;&#x56FD;&#x5927;&#x9646;" data-director="&#x5218;&#x8273;&#x6770;" data-actors="&#x674E;&#x5929;&#x70E8; / &#x51AF;&#x521A; / &#x8C2D;&#x98DE;&#x71D5;" data-category="upcoming" data-enough="false" data-subject="33383770">
                        <ul class="">
                            <li class="poster">
                                    <a href="https://movie.douban.com/subject/33383770/?from=playing_poster" target="_blank" data-psource="poster">
                                        <img src="https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2561001593.jpg" alt="猎袭" rel="nofollow" class=""/>
                                    </a>
                            </li>
                            <li class="stitle">
                                    <a class="" href="https://movie.douban.com/subject/33383770/?from=playing_poster" target="_blank" title="猎袭" data-psource="title">
                                        猎袭
                                    </a>
                            </li>
                            <li class="release-date">
                                08月22日上映
                            </li>
                        </ul>
                    </li>
                    
<li id="27163278" class="list-item" data-title="&#x901F;&#x5EA6;&#x4E0E;&#x6FC0;&#x60C5;&#xFF1A;&#x7279;&#x522B;&#x884C;&#x52A8;" data-wish="28176" data-duration="134&#x5206;&#x949F;" data-region="&#x7F8E;&#x56FD;" data-director="&#x5927;&#x536B;&#xB7;&#x96F7;&#x5947;" data-actors="&#x9053;&#x6069;&#xB7;&#x5F3A;&#x68EE; / &#x6770;&#x68EE;&#xB7;&#x65AF;&#x5766;&#x68EE; / &#x4F0A;&#x5FB7;&#x91CC;&#x65AF;&#xB7;&#x827E;&#x5C14;&#x5DF4;" data-category="upcoming" data-enough="false" data-subject="27163278">
                        <ul class="">
                            <li class="poster">
                                    <a href="https://movie.douban.com/subject/27163278/?from=playing_poster" target="_blank" data-psource="poster">
                                        <img src="https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2561542272.jpg" alt="速度与激情:特别行动" rel="nofollow" class=""/>
                                    </a>
                            </li>
                            <li class="stitle">
                                    <a class="" href="https://movie.douban.com/subject/27163278/?from=playing_poster" target="_blank" title="速度与激情:特别行动" data-psource="title">
                                        速度与激情:特...
                                    </a>
                            </li>
                            <li class="release-date">
                                08月23日上映
                            </li>
                        </ul>
                    </li>
                    
<li id="26331839" class="list-item" data-title="&#x4FDD;&#x6301;&#x6C89;&#x9ED8;" data-wish="20899" data-duration="96&#x5206;&#x949F;" data-region="&#x4E2D;&#x56FD;&#x5927;&#x9646;" data-director="&#x5468;&#x53EF;" data-actors="&#x5468;&#x8FC5; / &#x5434;&#x9547;&#x5B87; / &#x7956;&#x5CF0;" data-category="upcoming" data-enough="false" data-subject="26331839">
                        <ul class="">
                            <li class="poster">
                                    <a href="https://movie.douban.com/subject/26331839/?from=playing_poster" target="_blank" data-psource="poster">
                                        <img src="https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2558702991.jpg" alt="保持沉默" rel="nofollow" class=""/>
                                    </a>
                            </li>
                            <li class="stitle">
                                    <a class="" href="https://movie.douban.com/subject/26331839/?from=playing_poster" target="_blank" title="保持沉默" data-psource="title">
                                        保持沉默
                                    </a>
                            </li>
                            <li class="release-date">
                                08月23日上映
                            </li>
                                <li class="sbtn">
                                    <a class="ticket-btn" href="https://movie.douban.com/ticket/redirect/?movie_id=26331839" target="_blank">
                                        选座购票
                                    </a>
                                </li>
                        </ul>
                    </li>
                    
<li id="30232732" class="list-item" data-title="&#x4FA0;&#x8DEF;&#x76F8;&#x9022;" data-wish="2207" data-duration="96&#x5206;&#x949F;(&#x4E2D;&#x56FD;&#x5927;&#x9646;)" data-region="&#x4E2D;&#x56FD;&#x5927;&#x9646;" data-director="&#x90B5;&#x4E9A;&#x5CF0;" data-actors="&#x59DC;&#x6B66; / &#x90B5;&#x5175; / &#x59DA;&#x5A06;" data-category="upcoming" data-enough="false" data-subject="30232732">
                        <ul class="">
                            <li class="poster">
                                    <a href="https://movie.douban.com/subject/30232732/?from=playing_poster" target="_blank" data-psource="poster">
                                        <img src="https://img1.doubanio.com/view/photo/s_ratio_poster/public/p2560507417.jpg" alt="侠路相逢" rel="nofollow" class=""/>
                                    </a>
                            </li>
                            <li class="stitle">
                                    <a class="" href="https://movie.douban.com/subject/30232732/?from=playing_poster" target="_blank" title="侠路相逢" data-psource="title">
                                        侠路相逢
                                    </a>
                            </li>
                            <li class="release-date">
                                08月23日上映
                            </li>
                        </ul>
                    </li>
                    
<li id="27602052" class="list-item" data-title="&#x547C;&#x4F26;&#x8D1D;&#x5C14;&#x57CE;" data-wish="131" data-duration="98&#x5206;&#x949F;" data-region="&#x4E2D;&#x56FD;&#x5927;&#x9646;" data-director="&#x6D82;&#x4EEC;" data-actors="&#x8428;&#x4EC1;&#x9AD8;&#x5A03; / &#x963F;&#x5C14;&#x5FB7;&#x90A3; / &#x963F;&#x8339;&#x5A1C;" data-category="upcoming" data-enough="false" data-subject="27602052">
                        <ul class="">
                            <li class="poster">
                                    <a href="https://movie.douban.com/subject/27602052/?from=playing_poster" target="_blank" data-psource="poster">
                                        <img src="https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2554877386.jpg" alt="呼伦贝尔城" rel="nofollow" class=""/>
                                    </a>
                            </li>
                            <li class="stitle">
                                    <a class="" href="https://movie.douban.com/subject/27602052/?from=playing_poster" target="_blank" title="呼伦贝尔城" data-psource="title">
                                        呼伦贝尔城
                                    </a>
                            </li>
                            <li class="release-date">
                                08月23日上映
                            </li>
                        </ul>
                    </li>
                    
<li id="26897048" class="list-item" data-title="&#x78A7;&#x8840;&#x4E39;&#x7802;" data-wish="110" data-duration="101&#x5206;&#x949F;" data-region="&#x4E2D;&#x56FD;&#x5927;&#x9646;" data-director="&#x53F8;&#x5C0F;&#x51AC;" data-actors="&#x5F20;&#x5BB6;&#x9F0E; / &#x5218;&#x957F;&#x5FB7; / &#x68A6;&#x79E6;" data-category="upcoming" data-enough="false" data-subject="26897048">
                        <ul class="">
                            <li class="poster">
                                    <a href="https://movie.douban.com/subject/26897048/?from=playing_poster" target="_blank" data-psource="poster">
                                        <img src="https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2390932870.jpg" alt="碧血丹砂" rel="nofollow" class=""/>
                                    </a>
                            </li>
                            <li class="stitle">
                                    <a class="" href="https://movie.douban.com/subject/26897048/?from=playing_poster" target="_blank" title="碧血丹砂" data-psource="title">
                                        碧血丹砂
                                    </a>
                            </li>
                            <li class="release-date">
                                08月23日上映
                            </li>
                        </ul>
                    </li>
                    
<li id="34658362" class="list-item" data-title="&#x5230;&#x4F60;&#x8EAB;&#x8FB9;" data-wish="53" data-duration="96&#x5206;&#x949F;" data-region="&#x4E2D;&#x56FD;&#x5927;&#x9646;" data-director="&#x5F90;&#x5E06;" data-actors="&#x9AD8;&#x5229;&#x8679; / &#x6218;&#x6708;&#x6E90; / &#x9648;&#x4E00;&#x51E1;" data-category="upcoming" data-enough="false" data-subject="34658362">
                        <ul class="">
                            <li class="poster">
                                    <a href="https://movie.douban.com/subject/34658362/?from=playing_poster" target="_blank" data-psource="poster">
                                        <img src="https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2564956343.jpg" alt="到你身边" rel="nofollow" class=""/>
                                    </a>
                            </li>
                            <li class="stitle">
                                    <a class="" href="https://movie.douban.com/subject/34658362/?from=playing_poster" target="_blank" title="到你身边" data-psource="title">
                                        到你身边
                                    </a>
                            </li>
                            <li class="release-date">
                                08月23日上映
                            </li>
                        </ul>
                    </li>
            

Process finished with exit code 0

 

 下来就是获取即将上映电影的数据,比如电影的名称,电影的海报,电影的地区,以及导演,主演这些信息,见修改后的

代码信息:

#!/usr/bin/env python
# -*-coding:utf-8 -*-


import  requests
from lxml import  etree



def get_douban():
    r=requests.get(
        url='https://movie.douban.com/cinema/nowplaying/xian/',
        headers={'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36'})
    return r.text


def get_douban_movies():
    movies=[]
    html=etree.HTML(get_douban(),parser=etree.HTMLParser(encoding='utf-8'))
    uls=html.xpath('//ul[@class="lists"]')[1]
    for ul in uls:
        title=ul.xpath('@data-title')[0]
        duration=ul.xpath('@data-duration')[0]
        region=ul.xpath('@data-region')[0]
        director=ul.xpath('@data-director')[0]
        actors=ul.xpath('@data-actors')[0]
        poster=ul.xpath('.//img/@src')[0]
        movie={
            '电影名称':title,
            '时长':duration,
            '地区':region,
            '导演':director,
            '主演':actors,
            '海报':poster
        }
        movies.append(movie)
    for item in movies:
        print(item)

if __name__ == '__main__':
    get_douban_movies()

 见代码执行后的结果截图信息:

 



posted @ 2019-08-11 22:57  无涯(WuYa)  阅读(387)  评论(0编辑  收藏  举报