python爬虫简介
一、什么是网络爬虫?
网络爬虫,是一种按照一定规则,自动的抓取万维网信息的程序或者脚本。
二、python网络爬虫,
需要用到的第三方包 requests和BeautifulSoup4
pip install requests
pip install BeautifulSoup4
常用方法总结:
response = requests.get('URL') #获取网 response.text #文本内容(字符串 response.content #文件内容,比如图 response.encoding #设置编 response.aperant_encoding #显示下载时候的编 response.status_code #状态码 response.cookies.get_dict() requests.get('http://www.autohome.com.cn/news/',cookie={'xx':'xxx'})
beautifulsoup4模块
soup = BeautifulSoup('htmlstr',features='html.parser') v1 = soup.find('div') v1 = soup.find(id = 'i1') v1 = soup.find('div',id = 'i1') v2 = soup.find_all('div') v2 = soup.find_all(id = 'i1') v2 = soup.find_all('div',id = 'i1') v1.text #字符串 v1.attr #属性 #v2是个列表 v2[0].attr
三、初始demo
import requests from bs4 import BeautifulSoup response = requests.get(url = 'https://www.autohome.com.cn/news/') #下载页面 response.encoding = response.apparent_encoding soup = BeautifulSoup(response.text,features='html.parser') #创建Beautisoup对象 target = soup.find(id='auto-channel-lazyload-article') #找到新闻栏 #print(target) li_list = target.find_all('li') for i in li_list: a = i.find('a') if a: print(a.attrs.get('href')) txt = a.find('h3').text imagurl = a.find('img').attrs.get('src') print(imagurl) img_response = requests.get(url = 'https:'+imagurl) import uuid file_name = str(uuid.uuid4())+'.jpg' with open(file_name,"wb") as f: f.write(img_response.content)
四、抽屉登录并点赞
''' 抽屉小套路,用户认证的cookie不是登录用户密码返回的cookie 而是第一次get返回的cookie,然后登陆的时候把这个cookie带过去进行授权操作 ''' import requests headers = { 'user-agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36' } post_data = { 'phone':'8615191481351', 'password':'11111111', 'oneMonth':1 } ret1 = requests.get( url = 'https://dig.chouti.com', headers = headers ) cookie1 = ret1.cookies.get_dict() print(cookie1) ret2 = requests.post( url = 'https://dig.chouti.com/login', data = post_data, headers = headers, cookies = cookie1 ) cookie2 = ret2.cookies.get_dict() print(cookie2) ret3 = requests.post( url = 'https://dig.chouti.com/link/vote?linksId=21910661', cookies = { 'gpsd':cookie1['gpsd'] #'gpsd': 'f59363bb59b30fe7126b38756c6e5680' }, headers = headers ) print(ret3.text) ret = requests.post( url = 'https://dig.chouti.com/vote/cancel/vote.do', cookies = { 'gpsd': cookie1['gpsd'] }, data = {'linksId': 21910661}, headers = headers ) print(ret.text)
更多关于request参数的介绍:http://www.cnblogs.com/wupeiqi/articles/6283017.html