抓取了招聘信息并告诉你哪种Python 程序员最赚钱
本文的文字及图片来源于网络,仅供学习、交流使用,不具有任何商业用途,版权归原作者所有,如有问题请及时联系我们以作处理
本文章来自腾讯云 作者:Python进阶者
想要学习Python?有问题得不到第一时间解决?来看看这里“1039649593”满足你的需求,资料都已经上传至文件中,可以自行下载!还有海量最新2020python学习资料。
点击查看
本文以Python爬虫、数据分析、后端、数据挖掘、全栈开发、运维开发、高级开发工程师、大数据、机器学习、架构师这10个岗位,从拉勾网上爬取了相应的职位信息和任职要求,并通过数据分析可视化,直观地展示了这10个职位的平均薪资和学历、工作经验要求。
这是之前写的两篇文章的整合版(Python职位分析上与Python职位分析下),由csdn排版,这几天这个文章又活起来了(不过的确是挺好的,当时写花了好几天时间),所以特地发一遍,让新读者也看看,文章很长,耐心观看。
爬虫准备
1、先获取薪资和学历、工作经验要求
由于拉勾网数据加载是动态加载的,需要我们分析。分析方法如下:
F12分析页面数据存储位置
我们发现网页内容是通过post请求得到的,返回数据是json格式,那我们直接拿到json数据即可。
我们只需要薪资和学历、工作经验还有单个招聘信息,返回json数据字典中对应的英文为:positionId,salary, education, workYear(positionId为单个招聘信息详情页面编号)。相关操作代码如下:
文件存储:
def file_do(list_info): # 获取文件大小 file_size = os.path.getsize(r'G:\lagou_anv.csv') if file_size == 0: # 表头 name = ['ID','薪资', '学历要求', '工作经验'] # 建立DataFrame对象 file_test = pd.DataFrame(columns=name, data=list_info) # 数据写入 file_test.to_csv(r'G:\lagou_anv.csv', encoding='gbk', index=False) else: with open(r'G:\lagou_anv.csv', 'a+', newline='') as file_test: # 追加到文件后面 writer = csv.writer(file_test) # 写入文件 writer.writerows(list_info)
基本数据获取:
# 1. post 请求 url req_url = 'https://www.lagou.com/jobs/positionAjax.json?needAddtionalResult=false' # 2.请求头 headers headers = { 'Accept': 'application/json,text/javascript,*/*;q=0.01', 'Connection': 'keep-alive', 'Cookie': '你的Cookie值,必须加上去', 'Host': 'www.lagou.com', 'Referer': 'https://www.lagou.com/jobs/list_Python?labelWords=&fromSearch=true&suginput=', 'User-Agent': str(UserAgent().random), } def get_info(headers): # 3.for 循环请求(一共30页) for i in range(1, 31): # 翻页 data = { 'first': 'true', 'kd': 'Python爬虫', 'pn': i } # 3.1 requests 发送请求 req_result = requests.post(req_url, data=data, headers=headers) req_result.encoding = 'utf-8' print("第%d页:"%i+str(req_result.status_code)) # 3.2 获取数据 req_info = req_result.json() # 定位到我们所需数据位置 req_info = req_info['content']['positionResult']['result'] print(len(req_info)) list_info = [] # 3.3 取出具体数据 for j in range(0, len(req_info)): salary = req_info[j]['salary'] education = req_info[j]['education'] workYear = req_info[j]['workYear'] positionId = req_info[j]['positionId'] list_one = [positionId,salary, education, workYear] list_info.append(list_one) print(list_info) # 存储文件 file_do(list_info) time.sleep(1.5)
运行结果:
2、根据获取到的positionId
来访问招聘信息详细页面
根据positionId
还原访问链接:
position_url = [] def read_csv(): # 读取文件内容 with open(r'G:\lagou_anv.csv', 'r', newline='') as file_test: # 读文件 reader = csv.reader(file_test) i = 0 for row in reader: if i != 0 : # 根据positionID补全链接 url_single = "https://www.lagou.com/jobs/%s.html"%row[0] position_url.append(url_single) i = i + 1 print('一共有:'+str(i-1)+'个') print(position_url)
访问招聘信息详情页面,获取职位描述(岗位职责和岗位要求)并清理数据:
def get_info(): for position_url in position_urls: work_duty = '' work_requirement = '' response00 = get_response(position_url,headers = headers) time.sleep(1) content = response00.xpath('//*[@id="job_detail"]/dd[2]/div/p/text()') # 数据清理 j = 0 for i in range(len(content)): content[i] = content[i].replace('\xa0',' ') if content[i][0].isdigit(): if j == 0: content[i] = content[i][2:].replace('、',' ') content[i] = re.sub('[;;.0-9。]','', content[i]) work_duty = work_duty+content[i]+ '/' j = j + 1 elif content[i][0] == '1' and not content[i][1].isdigit(): break else: content[i] = content[i][2:].replace('、', ' ') content[i] = re.sub('[、;;.0-9。]','',content[i]) work_duty = work_duty + content[i]+ '/' m = i # 岗位职责 write_file(work_duty) print(work_duty) # 数据清理 j = 0 for i in range(m,len(content)): content[i] = content[i].replace('\xa0',' ') if content[i][0].isdigit(): if j == 0: content[i] = content[i][2:].replace('、', ' ') content[i] = re.sub('[、;;.0-9。]', '', content[i]) work_requirement = work_requirement + content[i] + '/' j = j + 1 elif content[i][0] == '1' and not content[i][1].isdigit(): # 控制范围 break else: content[i] = content[i][2:].replace('、', ' ') content[i] = re.sub('[、;;.0-9。]', '', content[i]) work_requirement = work_requirement + content[i] + '/' # 岗位要求 write_file2(work_requirement) print(work_requirement) print("-----------------------------")
运行结果:
3、四种图可视化数据+数据清理方式
矩形树图:
# 1.矩形树图可视化学历要求 from pyecharts import TreeMap education_table = {} for x in education: education_table[x] = education.count(x) key = [] values = [] for k,v in education_table.items(): key.append(k) values.append(v) data = [] for i in range(len(key)) : dict_01 = {"value": 40, "name": "我是A"} dict_01["value"] = values[i] dict_01["name"] = key[i] data.append(dict_01) tree_map = TreeMap("矩形树图", width=1200, height=600) tree_map.add("学历要求",data, is_label_show=True, label_pos='inside')
玫瑰饼图:
# 2.玫瑰饼图可视化薪资 import re import math ''' # 薪水分类 parameter : str_01--字符串原格式:20k-30k returned value : (a0+b0)/2 --- 解析后变成数字求中间值:25.0 ''' def assort_salary(str_01): reg_str01 = "(\d+)" res_01 = re.findall(reg_str01, str_01) if len(res_01) == 2: a0 = int(res_01[0]) b0 = int(res_01[1]) else : a0 = int(res_01[0]) b0 = int(res_01[0]) return (a0+b0)/2 from pyecharts import Pie salary_table = {} for x in salary: salary_table[x] = salary.count(x) key = ['5k以下','5k-10k','10k-20k','20k-30k','30k-40k','40k以上'] a0,b0,c0,d0,e0,f0=[0,0,0,0,0,0] for k,v in salary_table.items(): ave_salary = math.ceil(assort_salary(k)) print(ave_salary) if ave_salary < 5: a0 = a0 + v elif ave_salary in range(5,10): b0 = b0 +v elif ave_salary in range(10,20): c0 = c0 +v elif ave_salary in range(20,30): d0 = d0 +v elif ave_salary in range(30,40): e0 = e0 +v else : f0 = f0 + v values = [a0,b0,c0,d0,e0,f0] pie = Pie("薪资玫瑰图", title_pos='center', width=900) pie.add("salary",key,values,center=[40, 50],is_random=True,radius=[30, 75],rosetype="area",is_legend_show=False,is_label_show=True)
普通柱状图:
# 3.工作经验要求柱状图可视化 from pyecharts import Bar workYear_table = {} for x in workYear: workYear_table[x] = workYear.count(x) key = [] values = [] for k,v in workYear_table.items(): key.append(k) values.append(v) bar = Bar("柱状图") bar.add("workYear", key, values, is_stack=True,center= (40,60))
词云图:
import jieba from pyecharts import WordCloud import pandas as pd import re,numpy stopwords_path = 'H:\PyCoding\Lagou_analysis\stopwords.txt' def read_txt(): with open("G:\lagou\Content\\ywkf_requirement.txt",encoding='gbk') as file: text = file.read() content = text # 去除所有评论里多余的字符 content = re.sub('[,,。. \r\n]', '', content) segment = jieba.lcut(content) words_df = pd.DataFrame({'segment': segment}) # quoting=3 表示stopwords.txt里的内容全部不引用 stopwords = pd.read_csv(stopwords_path, index_col=False,quoting=3, sep="\t", names=['stopword'], encoding='utf-8') words_df = words_df[~words_df.segment.isin(stopwords.stopword)] words_stat = words_df.groupby(by=['segment'])['segment'].agg({"计数": numpy.size}) words_stat = words_stat.reset_index().sort_values(by=["计数"], ascending=False) test = words_stat.head(200).values codes = [test[i][0] for i in range(0, len(test))] counts = [test[i][1] for i in range(0, len(test))] wordcloud = WordCloud(width=1300, height=620) wordcloud.add("必须技能", codes, counts, word_size_range=[20, 100]) wordcloud.render("H:\PyCoding\Lagou_analysis\cloud_pit\ywkf_bxjn.html")
Python爬虫岗位
学历要求
工作月薪
工作经验要求
Python数据分析岗位
学历要求
工作月薪
工作经验要求