爬取某招聘网站特殊字段的信息

举例：爬取带有“数据分析”职位的信息

需要材料：Python、

pip install requests
pip install lxml
pip install pandas

第一步：引用

import time
import requests
from lxml import etree
import pandas as pd
from pandas import DataFrame
from pandas import Series

第二步：找到可打开的正确的url

url = 'http://search.51job.com/list/080200,000000,0000,00,9,99,%25E6%2595%25B0%25E6%258D%25AE%25E5%2588%2586%25E6%259E%2590,2,1.html?lang=c&stype=1&postchannel=0000&workyear=99&cotype=99&degreefrom=99&jobterm=99&companysize=99&lonlat=0%2C0&radius=-1&ord_field=0&confirmdate=9&fromType=&dibiaoid=0&address=&line=&specialarea=00&from=&welfare=’

第三步：向这个URL发出请求，文本编码修改为中文字符

head = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:43.0) Gecko/20100101 Firefox/43.0'}
s = requests.Session()
res = s.get(url,headers = head)
res.encoding = 'gbk'
root = etree.HTML(res.text)

第四步：在原网址查看所需字段的位置

读取所需数据

position = root.xpath('//div[@class="el"]/p/span/a/@title')
company = root.xpath('//div[@class="el"]/span[@class="t2"]/a/@title')
place = root.xpath('//div[@class="el"]/span[@class="t3"]/text()')
salary = root.xpath('//div[@class="el"]/span[@class="t4"]/text()')
date = root.xpath('//div[@class="el"]/span[@class="t5"]/text()')
job = DataFrame([position,company,place,salary,date]).T
job.columns = ['position','company','place','salary','date']
job['page'] = 1
job.head()

测试运行

封装函数【要找到自增规律】

def Crawler_51job(df_origin,n):
    head = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:43.0) Gecko/20100101 Firefox/43.0'}
    s = requests.Session()
    for i in range(1,n+1):
        url = '

'http://search.51job.com/jobsearch/search_result.php?fromJs=1&jobarea=000000%2C00&district=000000&funtype=0000&industrytype=00&issuedate=9&providesalary=99&keyword=%E4%BF%A1%E7%94%A8%E7%AE%A1%E7%90%86&keywordtype=2&curr_page=' + str(i) + '&lang=c&stype=1&postchannel=0000&workyear=99&cotype=99&degreefrom=99&jobterm=99&companysize=99&lonlat=0%2C0&radius=-1&ord_field=0&list_type=0&fromType=14&dibiaoid=0&confirmdate=9'

'
        res = s.get(url,headers = head)
        res.encoding = 'gbk'
        root = etree.HTML(res.text)
        position = root.xpath('//div[@class="el"]/p/span/a/@title')
        company = root.xpath('//div[@class="el"]/span[@class="t2"]/a/@title')
        place = root.xpath('//div[@class="el"]/span[@class="t3"]/text()')
        salary = root.xpath('//div[@class="el"]/span[@class="t4"]/text()')
        date = root.xpath('//div[@class="el"]/span[@class="t5"]/text()')
        df = DataFrame([position,company,place,salary,date]).T
        df.columns = ['position','company','place','salary','date']
        df = DataFrame([position,company,place,salary,date]).T
        df.columns = ['position','company','place','salary','date']
        df['page'] = i
        time.sleep(2)
        df_origin = pd.concat([df_origin,df])
    return(df_origin)

函数调用与数据输出

job = Crawler_51job(job,17)

job = Crawler_51job(job,17)
job.to_csv('51job.csv',index = 0)

结果（excel坏了，用记事本查看一下）

这样就完成了一个爬虫小程序

posted @ 2017-07-03 13:39 积水成渊数据分析阅读(484) 评论(0) 编辑收藏举报

刷新页面返回顶部

积水成渊数据分析

爬取某招聘网站特殊字段的信息

公告