Python爬取51job招聘网Python工作薪资
python岗位的薪资情况如何?用python爬虫来收集数据
一,需求内容
51job招聘网站的python岗位的薪资情况
二,准备工具
1,urllib库
2,beautifulsoup库
#导入库
from bs4 import BeautifulSoup from urllib.request import urlopen
三,分析源码
<div class="el"> <p class="t1 "> <em class="check" name="delivery_em" onclick="checkboxClick(this)">'l</em> <input class="checkbox" type="checkbox" name="delivery_jobid" value="118428015" jt="0" style="display:none" /> <span> <a target="_blank" title="Python开发工程师实习生" href="https://jobs.51job.com/hefei-ssq/118428015.html?s=01&t=0" onmousedown=""> Python开发工程师实习生 </a> </span> </p> <span class="t2"><a target="_blank" title="合肥合和信息科技有限公司" href="https://jobs.51job.com/all/co5410918.html">合肥合和信息科技有限公司</a></span> <span class="t3">合肥-蜀山区</span> <span class="t4">4.5-6千/月</span> <span class="t5">11-10</span> </div>
从源码中可以看到所需的内在<p>标签和<span>标签中
soup = BeautifulSoup(html,"html.parser") #通过标签选择 titles=soup.select("p[class='t1'] a") salaries=soup.select("span[class='t4']") # CSS 选择器
四,输出结果
for i in range(len(titles)): print("{:30}{}".format(titles[i].get('title'),salaries[i+1].get_text()))
五,模拟请求头
header ={ "Connection": "keep-alive", "Upgrade-Insecure-Requests": "1", "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36", "Accept":" text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8", "Accept-Encoding": "gzip,deflate", "Accept-Language": "zh-CN,zh;q=0.8"};
六,完整代码
from bs4 import BeautifulSoup from urllib.request import urlopen header ={ "Connection": "keep-alive", "Upgrade-Insecure-Requests": "1", "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36", "Accept":" text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8", "Accept-Encoding": "gzip,deflate", "Accept-Language": "zh-CN,zh;q=0.8"}; html = urlopen("https://search.51job.com/list/000000,000000,0000,00,9,99,Python,2,1.html?lang=c&postchannel=0000&workyear=99&cotype=99°reefrom=99&jobterm=99&companysize=99&ord_field=0&dibiaoid=0&line=&welfare=").read().decode('GBK') soup = BeautifulSoup(html,"html.parser") titles=soup.select("p[class='t1'] a") salaries=soup.select("span[class='t4']") # CSS 选择器 for i in range(len(titles)): print("{:30}{}".format(titles[i].get('title'),salaries[i+1].get_text()))