拉勾网爬虫--待改正
由于在微信公众号CSDN上看到一篇拉勾网招聘信息爬取及分析的文章,觉得非常不错,于是也copy一下,但是却出现了许多文章中没有提到的错误,正是一失足成千古恨啊!
首先插入代码:
import requests from fake_useragent import UserAgent url='https://www.lagou.com/jobs/positionAjax.json?city=%E5%B9%BF%E5%B7%9E&needAddtionalResult=false' headers = { 'Accept': 'application/json,text/javascript,*/*;q=0.01', 'Connection': 'keep-alive', 'Cookie': 'user_trace_token=20190219170421-589c51fd-3425-11e9-94ca-525400f775ce; LGUID=20190219170421-589c556f-3425-11e9-94ca-525400f775ce; JSESSIONID=ABAAABAAAIAACBI83910AF8CFDCD43C502B73B369BC11AE; PRE_UTM=; PRE_HOST=www.so.com; PRE_SITE=http%3A%2F%2Fwww.so.com%2Flink%3Fm%3Dah6rTRiEAqghnfjOchMrldC9g09Z6O4EM8yoD1U73IL58lzzlfsBR1G3ekEi1hDYMb8HzC5keoRl8AIGdSPOI6dMEYY3t8OajAwFORdOWma8%253D; PRE_LAND=https%3A%2F%2Fwww.lagou.com%2F; sajssdk_2015_cross_new_user=1; ab_test_random_num=0; sensorsdata2015jssdkcross=%7B%22distinct_id%22%3A%2216905871fb4c9-0ba623cd0a19b4-5d4e211f-921600-16905871fba156%22%2C%22%24device_id%22%3A%2216905871fb4c9-0ba623cd0a19b4-5d4e211f-921600-16905871fba156%22%7D; _putrc=C4C6A5FE2C61AA92123F89F2B170EADC; login=true; hasDeliver=0; gate_login_token=dae438a7f0f2180190e414e0d39025bcfe698c75269decc65dbe3ad7b5d645a8; unick=%E5%8F%B6%E7%90%86%E4%BD%A9; _gid=GA1.2.689310331.1550567063; _gat=1; _ga=GA1.2.149442038.1550567063; Hm_lvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1550567064,1550575932; Hm_lpvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1550576293; LGSID=20190219193211-ff3ee358-3439-11e9-826e-5254005c3644; LGRID=20190219193811-d650b01a-343a-11e9-826e-5254005c3644; TG-TRACK-CODE=index_search; SEARCH_ID=11dee42b2b7a4d68bd2b4434a28a282c; index_location_city=%E5%B9%BF%E5%B7%9E', 'Host': 'www.lagou.com', 'Referer': 'https://www.lagou.com/jobs/list_python?labelWords=sug&fromSearch=true&suginput=p', 'User-Agent':str(UserAgent().random), 'X-Requested-With':'XMLHttpRequest' } proxies = {'https': '49.86.183.149:9999'} # rsp=requests.post(url=url,proxies=proxies) rsp=requests.request("post",url=url,proxies=proxies,headers=headers,timeout=10) print(rsp.content.decode())
在爬取拉勾网招聘信息时,需要进行许多分析,但是网络上已经有了许多分析,所以也就不再多说,总之登陆后输入python职位,找到与招聘信息相关的一个URL就可以了,然后找到
['content']['positionResult']['result']
然后来说一下存在的错误:
1.10060或者10061错误:因为刚开始没有使用代理ip,导致被拉勾网认定为爬虫,ip被封了,之后继续访问就被积极!拒绝了
2.使用代理ip需要注意拉勾网使用的是https协议,所以proxies对应的就要是proxies,而且代理的ip地址也必须是https的,这就要在代理ip的网址仔细看,而且也要看清楚端口号;另外Ip要选择绿色ip,否则也会出错
3.在改正错误的过程中,我也巩固了知识点,知道了requests怎么使用post请求。有两种方法:requests.request(“post”,url=url,proxies=proxies)或者requests.post(url,proxies=proxies)
4.另外我也知道了一个第三方库,fake_useragent是一个生成UserAgent 的库,可以指定浏览器的类型,也可以随机生成UserAgent