python爬虫备忘录
我都不知道多久没有发过博文了,伴随着毕业的到来,论文和实习,都一起到来了,可能我以后也很少发布编程类的文章了,更多的将会注重于网络安全文章的发布了,windowsAPI的文章将会逐渐从我的博文中删除,以后将会不定期更新webdirscan,weblogon_brust等的python技术文章,二进制或者手机APP类型的,很感叹自己一路走过来,学习了不少的知识,也遇到过很多大佬,向他们学习了一些知识,到如今,我还是觉得脚踏实地的走比较靠谱。之后我会陆续更新我的开源webdirscan软件,开源一些信息收集的小工具。
爬虫环境配置
selenium
描述:模拟浏览器访问,有界面
安装: pip3 install selenium
基本使用:
import selenium
from selenium import webdriver
driver=webdriver.Chrome()
chromedriver
描述:谷歌驱动
安装:pip install chromedriver
基本使用:
import selenium
from selenium import webdriver
driver=webdriver.Chrome()
phantomjs
描述:模拟浏览器访问,无界面
安装: 下载https://phantomjs.org/download
配置:
export PATH=${PATH}:/root/phantomjs/bin
export OPENSSL_CONF=/etc/ssl/
beautifulsoup4
描述:html
安装: pip3 install beautifulsoup4
使用:
from bs4 import BeautifulSoup
soup=BeautifulSoup('<html><html>','lxml')
pyquery
描述:类似Jquery
安装: pip3 install pyquery
使用:
from pyquery import PyQuery as pq
doc=pq('<html>HELLO</html>')
result=doc('html').text()
result
pymysql
描述:mysql
安装: pip3 install pymysql
使用:
import pymysql
conn=pymysql.connect(host='127.0.0.1',user='root',password='root',port=3306,db='mysql')
cursor=conn.cursor()
cursor.execute('select * from db')
cursor.fetchone()
出错解决:
update mysql.user set authentication_string=PASSWORD('root'), plugin='mysql_native_password' where user='root';
pymongo
描述:MongoDB
安装:pip3 install pymongo
使用:
import pymongo
client=pymongo.MongoClient('localhost')
db=client['newtestdb']
db['table'].insert({'name':'Bob'})
db['table'].find_one({name':'Bob'})
pyredis
描述:Redis
安装:pip3 install redis
使用:
import redis
r=redis.Redis('localhost',6379)
r.set('name','Bob')
r.get('name')
flask代理
描述:proxy
安装:pip3 install flask
使用:
import flask
django
描述:django
安装:pip3 install django
使用:
import django
jupyter
描述:makedown在线,编译在线
安装:pip3 install jupyter
使用:
jupyter notebook
ALL
pip3 install requests selenium beautifulsoup4 pyquery pymysql pymongo redis flask django jupyter
什么是爬虫?
一个简单的请求
import requests
response=request.get('https://www.baidu.com')
#response.decoding="utf8"
print(response.text)
print(response.header)
print(response.status_code)
headers={'User-Agent':"**********"}
response=requests.get('https://www.baidu.com',headers=headers)
print(response.status_code)
# 下载图片
response=request.get('https://www.baidu.com/gif.ico')
print(response.content)
with open('/var/tmp/1.git','wb') as f:
f.write(response.content)
f.close
JS(javascript)渲染问题
selenium/webDriver or Splash or pyv8、Ghost.py
from selenium import webdriver
driver=webdriver.Chrome()
driver.get('http://m.weibo.com')
print(driver.page_sources)
Urllib库
什么是Urllib?
内置请求:
urllib.request 请求模块
usllib.error 异常
urllib.parse url解析模块
urllib.robotparser robot.txt解析模块
相比python2变化:
python2:
impoort urllib2
response=usllib.urlopen('https://www.baidu.com')
python3:
import urllib.request
response=urllib.request.urlopen('https://www.baidu.com')
Requests库
import requests
response=requests.get('https://www.baidu.com')
print(type(response))
print(response.status_code)
print(type(response.text))
print(response.text)
print(response.cookies)
import requests
response=requests.get('http://www.baidu.com?id=1')
print(response.text)
参数:
import requests
data={
'name':'germay',
'age':22
}
response=requests.get('url',params=data)
print(response.text)
JSON:
import request
response=requests.get('url')
print(type(response.text))
print(response.json())
print(json.loads(response.text))
print(type(response.json()))
二进制:
import requests
response=request.get('url/img.ico')
print(response.text)
print(response.content)
with open('a.ico','web') as f:
f.write(response.content)
f.close()
headers:
import requests
headers={
'User-Agent':'Moziila/5.0'
}
response =request.get('URL/explore',headers=headers)
print(response.text)
POST请求:
import request
data={'name':'asd','age':22}
response=request.post('http://www',data=data)
print(response.text)
import request
data={'name':'asd','age':22}
headers={
'User-Agent':'asdasdasd'
}
response=requests.post('URL',data=data,headers=headers)
print(response.json())
响应:
import requests
response=requests.get('URL')
print(response.status_code)
print(response.headers)
print(response.cookies)
print(response.url)
print(response.history)
状态码判断
import requests
response=request.get('URL')
exit() if not response.status_code==request.codes.ok else print('ok')
#exit() if not response.status_code==request.codes.200 else print('ok')
高级操作
文件上传:
import requests
files={'file',open('img.jpg','rb')}
response=requests.post('URL',file=files)
print(response.text)
获取cookie
import requests
response=request.get('URL')
print(response.cookies)
for key,value in reqonse.cookies.items():
print(key+'='+value)
会话维持:
import requests
s=request.Session()
s.get('set cookie URL')
response=s.get('get cookie url')
print(response.text)
证书验证
import requests
response=requests.get('URL')
print(response.status_code)
import requests
from request.packages import urllib3
urllib3.disable_warinings()
response=requests.get('URL',verify=False)# 不认证证书
print(response.status_code)
import requests
response=requests.get('URL',cert=('/path/server.crt','path/key'))
print(response.status_code)
代理设置:
import request
proxies={
'http':'http://127.0.0.1:1080',
'https':'https://127.0.0.1:1080'
}
response=request.get('URL',proxies=proxies)
print(response.status_code)
安装:pip install 'requests[socks]'
import requests
proxies={
'http':'socks5://127.0.0.1:1080',
'https':'socks5://127.0.0.1:1080'
}
response=request.get('URL',proxies=proxies)
print(response.status_code)
超时设置:
import requests
response=request.get('https://www.baidu.com',timeout=1)
认证设置(网站访问认证):
import requests
from requests.auth import HTTPBasicAuth
r=requests.get('http://127.0.0.1:9090',auth=HTTPBasicAuth('user',123))
print(r.status_code)
import requests
r=requests.get('http://127.0.0.1:9090',auth=('user','123'))
print(r.status_code)
异常处理:
import requests
from request.excception import ReadTimeout,HTTPError,RequestException
try:
response=request.get('URL',timeout=1)
print(response.status_code)
except ReadTimeout:
print('...')
except HTTPError:
print('...')
except RequestException:
print('...')
re正则表达式
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-wq1rQiNO-1578217063553)(C:\Users\lvy\AppData\Roaming\Typora\typora-user-images\image-20200105170342954.png)]
match
re.match
从字符串起始位置匹配一个模式,没有就是None
re.match(pattern,staring,flags=0)
普通匹配:
impoer re
content="Hello 123 4567 World_This is a Regex Demo"
result=re.match('^Hello\s\d\d\d\s\d[4]\s\w[10]\s\s[2].*Demo$',content)
print(result.group())
print(result.span())
泛匹配:
impoer re
content="Hello 123 4567 Wordld_This is a Regex Demo"
result=re.match("^Hello.*Demo$",content)
print(result.group())
print(result.span())
目标匹配:
impoer re
content="Hello 123 4567 Wordld_This is a Regex Demo"
result=re.match('^Hello\s(\d+)\sWorld.*Demo$',content)# (\d+)
print(result.group(1))
print(result.span())
贪婪匹配:
impoer re
content="Hello 123 4567 Wordld_This is a Regex Demo"
result=re.match("^He.*(\d+).*Demo$",content)
print(result)
print(result.group(1))
print(result.span())
非贪婪匹配:
impoer re
content="Hello 123 4567 Wordld_This is a Regex Demo"
result=re.match("^He.*?(\d+).*Demo$",content)
print(result)
print(result.group(1))
模式匹配:
impoer re
content="Hello 123 4567 Wordld_This
is a Regex Demo
"
result=re.match("^He.*?(\d+).*?Demo$",content,re.S)#re.S 这样.*可以匹配任意字符
print(result)
print(result.group(1))
转义:
import re
content='price is $5.00'
result=re.match('price is \$5\.00',content)
print(result)
print(result.group(1))
search(一个)
re.search:
import re
content='Extra strings Hello 1234567 World_This is a Regex Demo Extra strings'
result=re.search('Hello.*?(\d+).*?Demo',content)
print(result)
print(result.group(1))
findall(所有)
result=re.findall('<li.*?href="(.*?)".*?singer="(.*?)">(.*?)</a>',html.re.S)
print(result)
Sub
替换字符串中每一个匹配的字串后返回替换后的字符串
import re
content='Extra strings Hello 1234567 World_This is a Regex Demo Extra strings'
content=re.sub('\d+','',content)
print(content)
import re
content='Extra strings Hello 1234567 World_This is a Regex Demo Extra strings'
content=re.sub('\d+','replace',content)
print(content)
替换整体
import re
content='Extra strings Hello 1234567 World_This is a Regex Demo Extra strings'
content=re.sub('(\d+)','r\1 8910',content)
print(content)
compile
将一个正则表达式编译成表达式对象
import re
content='Extra strings Hello 1234567 World_This
is a Regex Demo Extra strings'
pattern=re.compile("Hello.*Demo",re.S)
result=re.match(pattern,content)
print(result)
技术不分国界