网络爬虫之lxml(一)
网络爬虫顾明思议就是从互联网中获取数据,然后对这些数据进行处理,然后让数据成为自己
想要的部分,比如分析2019年自动化测试工程师的薪资情况到底是怎么样的,可以获取到招聘平台
所有自动化测试招聘的薪资范围,然后对薪资做一个排名分析,当然事情做起来并不是说的这样简单
的。在Python的网络爬虫中,从平台中获取数据的方式主要会应用到lxml,re模块,以及beautifulsoup4,
这里先来看lxml的应用,首先需要安装它,安装它的命令为:pip3 install lxml。安装成功后,就可以直
接使用了。从平台获取数据使用到的库是requests库,这里不再详细的介绍了。在chrome的浏览器,到
google应用商店安装xpath helper的插件,安装成功后,浏览器就会显示这样的标识。lxml主要使用
了xpath的获取的方式,如果对UI自动化测试熟悉的同学就会知道,元素定位当中就会涉及到xpath的元素
定位的方式。这里以获取豆瓣电影即将上映的数据为案例,见截图:
打开浏览器的调试模式,使用元素定位的方式,可以看到所有电影的数据都是在ul下的class为lists里面,见截图:
那么首先获取到ul下的数据,见实现的代码:
#!/usr/bin/env python # -*-coding:utf-8 -*- import requests from lxml import etree def get_douban(): r=requests.get( url='https://movie.douban.com/cinema/nowplaying/xian/', headers={'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36'}) return r.text def get_douban_movies(): html=etree.HTML(get_douban(),parser=etree.HTMLParser(encoding='utf-8')) uls=html.xpath('//ul[@class="lists"]') print(uls) if __name__ == '__main__': get_douban_movies()
执行代码后,见执行的结果:
如上可以看到,列表里面返回了两个数据,其实第一个是正在上映的数据,第二个是即将上映的数据,那么我们在列表中取第二个
数据就好,修改代码,并且循环解析代码,见修改后的代码:
#!/usr/bin/env python # -*-coding:utf-8 -*- import requests from lxml import etree def get_douban(): r=requests.get( url='https://movie.douban.com/cinema/nowplaying/xian/', headers={'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36'}) return r.text def get_douban_movies(): html=etree.HTML(get_douban(),parser=etree.HTMLParser(encoding='utf-8')) uls=html.xpath('//ul[@class="lists"]')[1] for ul in uls: print(etree.tostring(ul,encoding='utf-8').decode('utf-8')) if __name__ == '__main__': get_douban_movies()
见执行代码后,获取到的页面数据:
C:\Python37\python3.exe D:/git/GITHUB/WebCrawler/dataParsing/即将上映.py <li id="30165034" class="list-item" data-title="昨日奇迹" data-wish="5906" data-duration="116分钟" data-region="英国" data-director="丹尼·博伊尔" data-actors="希米什·帕特尔 / 莉莉·詹姆斯 / 凯特·麦克金农" data-category="upcoming" data-enough="false" data-subject="30165034"> <ul class=""> <li class="poster"> <a href="https://movie.douban.com/subject/30165034/?from=playing_poster" target="_blank" data-psource="poster"> <img src="https://img1.doubanio.com/view/photo/s_ratio_poster/public/p2561245949.jpg" alt="昨日奇迹" rel="nofollow" class=""/> </a> </li> <li class="stitle"> <a class="" href="https://movie.douban.com/subject/30165034/?from=playing_poster" target="_blank" title="昨日奇迹" data-psource="title"> 昨日奇迹 </a> </li> <li class="release-date"> 08月16日上映 </li> </ul> </li> <li id="33381471" class="list-item" data-title="古窑迷踪" data-wish="284" data-duration="86分钟" data-region="中国大陆" data-director="袁杰" data-actors="郭雪芙 / 罗彬 / 刘永奇" data-category="upcoming" data-enough="false" data-subject="33381471"> <ul class=""> <li class="poster"> <a href="https://movie.douban.com/subject/33381471/?from=playing_poster" target="_blank" data-psource="poster"> <img src="https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2558727463.jpg" alt="古窑迷踪" rel="nofollow" class=""/> </a> </li> <li class="stitle"> <a class="" href="https://movie.douban.com/subject/33381471/?from=playing_poster" target="_blank" title="古窑迷踪" data-psource="title"> 古窑迷踪 </a> </li> <li class="release-date"> 08月16日上映 </li> </ul> </li> <li id="34482589" class="list-item" data-title="我们的四十年" data-wish="269" data-duration="94分钟" data-region="中国大陆" data-director="李易祥 鲍振江 李振伟 霍猛" data-actors="李易祥 / 鲍振江 / 金宏" data-category="upcoming" data-enough="false" data-subject="34482589"> <ul class=""> <li class="poster"> <a href="https://movie.douban.com/subject/34482589/?from=playing_poster" target="_blank" data-psource="poster"> <img src="https://img1.doubanio.com/view/photo/s_ratio_poster/public/p2562828338.jpg" alt="我们的四十年" rel="nofollow" class=""/> </a> </li> <li class="stitle"> <a class="" href="https://movie.douban.com/subject/34482589/?from=playing_poster" target="_blank" title="我们的四十年" data-psource="title"> 我们的四十年 </a> </li> <li class="release-date"> 08月16日上映 </li> <li class="sbtn"> <a class="ticket-btn" href="https://movie.douban.com/ticket/redirect/?movie_id=34482589" target="_blank"> 选座购票 </a> </li> </ul> </li> <li id="33383770" class="list-item" data-title="猎袭" data-wish="77" data-duration="90分钟" data-region="中国大陆" data-director="刘艳杰" data-actors="李天烨 / 冯刚 / 谭飞燕" data-category="upcoming" data-enough="false" data-subject="33383770"> <ul class=""> <li class="poster"> <a href="https://movie.douban.com/subject/33383770/?from=playing_poster" target="_blank" data-psource="poster"> <img src="https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2561001593.jpg" alt="猎袭" rel="nofollow" class=""/> </a> </li> <li class="stitle"> <a class="" href="https://movie.douban.com/subject/33383770/?from=playing_poster" target="_blank" title="猎袭" data-psource="title"> 猎袭 </a> </li> <li class="release-date"> 08月22日上映 </li> </ul> </li> <li id="27163278" class="list-item" data-title="速度与激情:特别行动" data-wish="28176" data-duration="134分钟" data-region="美国" data-director="大卫·雷奇" data-actors="道恩·强森 / 杰森·斯坦森 / 伊德里斯·艾尔巴" data-category="upcoming" data-enough="false" data-subject="27163278"> <ul class=""> <li class="poster"> <a href="https://movie.douban.com/subject/27163278/?from=playing_poster" target="_blank" data-psource="poster"> <img src="https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2561542272.jpg" alt="速度与激情:特别行动" rel="nofollow" class=""/> </a> </li> <li class="stitle"> <a class="" href="https://movie.douban.com/subject/27163278/?from=playing_poster" target="_blank" title="速度与激情:特别行动" data-psource="title"> 速度与激情:特... </a> </li> <li class="release-date"> 08月23日上映 </li> </ul> </li> <li id="26331839" class="list-item" data-title="保持沉默" data-wish="20899" data-duration="96分钟" data-region="中国大陆" data-director="周可" data-actors="周迅 / 吴镇宇 / 祖峰" data-category="upcoming" data-enough="false" data-subject="26331839"> <ul class=""> <li class="poster"> <a href="https://movie.douban.com/subject/26331839/?from=playing_poster" target="_blank" data-psource="poster"> <img src="https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2558702991.jpg" alt="保持沉默" rel="nofollow" class=""/> </a> </li> <li class="stitle"> <a class="" href="https://movie.douban.com/subject/26331839/?from=playing_poster" target="_blank" title="保持沉默" data-psource="title"> 保持沉默 </a> </li> <li class="release-date"> 08月23日上映 </li> <li class="sbtn"> <a class="ticket-btn" href="https://movie.douban.com/ticket/redirect/?movie_id=26331839" target="_blank"> 选座购票 </a> </li> </ul> </li> <li id="30232732" class="list-item" data-title="侠路相逢" data-wish="2207" data-duration="96分钟(中国大陆)" data-region="中国大陆" data-director="邵亚峰" data-actors="姜武 / 邵兵 / 姚娆" data-category="upcoming" data-enough="false" data-subject="30232732"> <ul class=""> <li class="poster"> <a href="https://movie.douban.com/subject/30232732/?from=playing_poster" target="_blank" data-psource="poster"> <img src="https://img1.doubanio.com/view/photo/s_ratio_poster/public/p2560507417.jpg" alt="侠路相逢" rel="nofollow" class=""/> </a> </li> <li class="stitle"> <a class="" href="https://movie.douban.com/subject/30232732/?from=playing_poster" target="_blank" title="侠路相逢" data-psource="title"> 侠路相逢 </a> </li> <li class="release-date"> 08月23日上映 </li> </ul> </li> <li id="27602052" class="list-item" data-title="呼伦贝尔城" data-wish="131" data-duration="98分钟" data-region="中国大陆" data-director="涂们" data-actors="萨仁高娃 / 阿尔德那 / 阿茹娜" data-category="upcoming" data-enough="false" data-subject="27602052"> <ul class=""> <li class="poster"> <a href="https://movie.douban.com/subject/27602052/?from=playing_poster" target="_blank" data-psource="poster"> <img src="https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2554877386.jpg" alt="呼伦贝尔城" rel="nofollow" class=""/> </a> </li> <li class="stitle"> <a class="" href="https://movie.douban.com/subject/27602052/?from=playing_poster" target="_blank" title="呼伦贝尔城" data-psource="title"> 呼伦贝尔城 </a> </li> <li class="release-date"> 08月23日上映 </li> </ul> </li> <li id="26897048" class="list-item" data-title="碧血丹砂" data-wish="110" data-duration="101分钟" data-region="中国大陆" data-director="司小冬" data-actors="张家鼎 / 刘长德 / 梦秦" data-category="upcoming" data-enough="false" data-subject="26897048"> <ul class=""> <li class="poster"> <a href="https://movie.douban.com/subject/26897048/?from=playing_poster" target="_blank" data-psource="poster"> <img src="https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2390932870.jpg" alt="碧血丹砂" rel="nofollow" class=""/> </a> </li> <li class="stitle"> <a class="" href="https://movie.douban.com/subject/26897048/?from=playing_poster" target="_blank" title="碧血丹砂" data-psource="title"> 碧血丹砂 </a> </li> <li class="release-date"> 08月23日上映 </li> </ul> </li> <li id="34658362" class="list-item" data-title="到你身边" data-wish="53" data-duration="96分钟" data-region="中国大陆" data-director="徐帆" data-actors="高利虹 / 战月源 / 陈一凡" data-category="upcoming" data-enough="false" data-subject="34658362"> <ul class=""> <li class="poster"> <a href="https://movie.douban.com/subject/34658362/?from=playing_poster" target="_blank" data-psource="poster"> <img src="https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2564956343.jpg" alt="到你身边" rel="nofollow" class=""/> </a> </li> <li class="stitle"> <a class="" href="https://movie.douban.com/subject/34658362/?from=playing_poster" target="_blank" title="到你身边" data-psource="title"> 到你身边 </a> </li> <li class="release-date"> 08月23日上映 </li> </ul> </li> Process finished with exit code 0
下来就是获取即将上映电影的数据,比如电影的名称,电影的海报,电影的地区,以及导演,主演这些信息,见修改后的
代码信息:
#!/usr/bin/env python # -*-coding:utf-8 -*- import requests from lxml import etree def get_douban(): r=requests.get( url='https://movie.douban.com/cinema/nowplaying/xian/', headers={'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36'}) return r.text def get_douban_movies(): movies=[] html=etree.HTML(get_douban(),parser=etree.HTMLParser(encoding='utf-8')) uls=html.xpath('//ul[@class="lists"]')[1] for ul in uls: title=ul.xpath('@data-title')[0] duration=ul.xpath('@data-duration')[0] region=ul.xpath('@data-region')[0] director=ul.xpath('@data-director')[0] actors=ul.xpath('@data-actors')[0] poster=ul.xpath('.//img/@src')[0] movie={ '电影名称':title, '时长':duration, '地区':region, '导演':director, '主演':actors, '海报':poster } movies.append(movie) for item in movies: print(item) if __name__ == '__main__': get_douban_movies()
见代码执行后的结果截图信息:
欢迎关注微信公众号“Python自动化测试”