python简单爬虫(一)python自带的urllib发送get请求和Post请求&代理handler使用&代理IP池使用
======简答的爬虫===========
简单的说,爬虫的意思就是根据url访问请求,然后对返回的数据进行提取,获取对自己有用的信息。然后我们可以将这些有用的信息保存到数据库或者保存到文件中。如果我们手工一个一个访问提取非常慢,所以我们需要编写程序去获取有用的信息,这也就是爬虫的作用。
一、概念:
网络爬虫,也叫网络蜘蛛(Web Spider),如果把互联网比喻成一个蜘蛛网,Spider就是一只在网上爬来爬去的蜘蛛。网络爬虫就是根据网页的地址来寻找网页的,也就是URL。举一个简单的例子,我们在浏览器的地址栏中输入的字符串就是URL,例如:https://www.baidu.com/
URL就是同意资源定位符(Uniform Resource Locator),它的一般格式如下(带方括号[]的为可选项):
protocol :// hostname[:port] / path / [;parameters][?query]#fragment
URL的格式由三部分组成:
(1)protocol:第一部分就是协议,例如百度使用的就是https协议;
(2)hostname[:port]:第二部分就是主机名(还有端口号为可选参数),一般网站默认的端口号为80,例如百度的主机名就是www.baidu.com,这个就是服务器的地址;
(3)path:第三部分就是主机资源的具体地址,如目录和文件名等。
网络爬虫就是根据这个URL来获取网页信息的。
二、简单爬虫实例
在Python3.x中,我们可以使用urlib这个组件抓取网页,urllib是一个URL处理包,这个包中集合了一些处理URL的模块,如下:
1.urllib.request模块是用来打开和读取URLs的;
2.urllib.error模块包含一些有urllib.request产生的错误,可以使用try进行捕捉处理;
3.urllib.parse模块包含了一些解析URLs的方法;
4.urllib.robotparser模块用来解析robots.txt文本文件.它提供了一个单独的RobotFileParser类,通过该类提供的can_fetch()方法测试爬虫是否可以下载一个页面。
我们使用urllib.request.urlopen()这个接口函数就可以很轻松的打开一个网站,读取并打印信息。
urlopen有一些可选参数,具体信息可以查阅Python自带的documentation。
了解到这些,我们就可以写一个最简单的程序:
# 爬虫项目 from urllib import request if __name__ == "__main__": response = request.urlopen("http://qiaoliqiang.cn") html = response.read() html = html.decode("utf-8") print(html)
结果:
E:\pythonWorkSpace\FirstProject\venv\Scripts\python.exe E:/pythonWorkSpace/FirstProject/HelloPython/reptile.py <!DOCTYPE html> <html lang="en"> <head> <meta charset="UTF-8">
........
上述代码有一个缺陷就是我们需要知道网站的编码格式才能正确的解析,所以我们需要改进
三、自动获取网页编码方式的方法
获取网页编码的方式有很多,个人更喜欢用第三方库的方式。
首先我们需要安装第三方库chardet,它是用来判断编码的模块,安装方法如下图所示,只需要输入指令:(或者再pycharm中的File->Settings->Project Inceptor中点击+号搜索chardet)
pip install chardet
安装好后,我们就可以使用chardet.detect()方法,判断网页的编码方式了。至此,我们就可以编写一个小程序判断网页的编码方式了。
# 爬虫项目2(自动获取)
from urllib import request
import chardet
if __name__ == "__main__":
response = request.urlopen("http://qiaoliqiang.cn/")
html = response.read()
charset = chardet.detect(html)#返回的是一个字典
print(charset)#{'encoding': 'utf-8', 'confidence': 0.99, 'language': ''}
html = html.decode(charset["encoding"])
print(html)
结果:
{'encoding': 'utf-8', 'confidence': 0.99, 'language': ''} <!DOCTYPE html> <html lang="en"> <head> <meta charset="UTF-8"> <title>XXXXXXXXXXX</title> <script src="JS/jquery-1.8.3.js"></script>
...
至此实现了简单的爬虫,接下来就是提取返回的信息。也就是解析获取到的数据。
========urllib发送get请求和post请求=============
首先搭建一个后台服务器,SpringBoot项目搭建的一个小项目,在filter中获取请求的方式以及参数:
package cn.qs.filter; import java.io.IOException; import java.util.Enumeration; import javax.servlet.Filter; import javax.servlet.FilterChain; import javax.servlet.FilterConfig; import javax.servlet.ServletException; import javax.servlet.ServletRequest; import javax.servlet.ServletResponse; import javax.servlet.annotation.WebFilter; import javax.servlet.http.HttpServletRequest; import javax.servlet.http.HttpServletResponse; import javax.servlet.http.HttpSession; import org.apache.commons.lang3.StringUtils; import org.slf4j.Logger; import org.slf4j.LoggerFactory; import cn.qs.bean.user.User; /** * 登陆过滤器 * * @author Administrator * */ @WebFilter(filterName = "loginFilter", urlPatterns = "/*") public class LoginFilter implements Filter { private static final Logger logger = LoggerFactory.getLogger(LoginFilter.class); public LoginFilter() { } public void destroy() { } public void doFilter(ServletRequest request, ServletResponse response, FilterChain chain) throws IOException, ServletException { HttpServletRequest req = (HttpServletRequest) request; String method = req.getMethod(); System.out.println("请求方式: " + method); Enumeration<String> parameterNames = request.getParameterNames(); while (parameterNames.hasMoreElements()) { String key = (String) parameterNames.nextElement(); System.out.println(key + " \t " + request.getParameter(key)); } response.setContentType("text/html;charset=UTF-8"); response.getWriter().write("回传中文"); } public void init(FilterConfig fConfig) throws ServletException { } }
(1)发送一个不携带参数的get请求
from urllib import request if __name__ == "__main__": response = request.urlopen("http://localhost:8088/login.html") html = response.read() #解码 html = html.decode('utf-8') print(html)
结果:
回传中文
JavaWeb控制台打印信息如下:
请求方式: GET
(2)发送一个携带参数的get请求
import urllib.request import urllib.parse # 定义出基础网址 base_url='http://localhost:8088/login.html' #构造一个字典参数 data_dict={ "username":"张三", "password":"13221321", "utype":"1", "vcode":"2132312" } # 使用urlencode这个方法将字典序列化成字符串,最后和基础网址进行拼接 data_string=urllib.parse.urlencode(data_dict) print(data_string) new_url=base_url+"?"+data_string response=urllib.request.urlopen(new_url) print(response.read().decode('utf-8'))
结果:
password=13221321&utype=1&vcode=2132312&username=%E5%BC%A0%E4%B8%89
回传中文
JavaWeb控制台打印信息如下:
请求方式: GET
password 13221321
utype 1
vcode 2132312
username 张三
(3)携带参数的POST请求
import urllib.request import urllib.parse #定义一个字典参数 data_dict={"username":"张三","password":"123456"} #使用urlencode将字典参数序列化成字符串 data_string=urllib.parse.urlencode(data_dict) #将序列化后的字符串转换成二进制数据,因为post请求携带的是二进制参数 last_data=bytes(data_string,encoding='utf-8') print(last_data) #如果给urlopen这个函数传递了data这个参数,那么它的请求方式则不是get请求,而是post请求 response=urllib.request.urlopen("http://localhost:8088/login.html",data=last_data) #我们的参数出现在form表单中,这表明是模拟了表单的提交方式,以post方式传输数据 print(response.read().decode('utf-8'))
结果:
b'password=123456&username=%E5%BC%A0%E4%B8%89' 回传中文
JavaWeb控制台打印信息如下:
请求方式: POST
password 123456
username 张三
补充:一个例子,python读取数据库,并读取url、method、param去访问请求,最后将结果记录输出到html中:
#!/usr/bin/python3 import pymysql from urllib import request import urllib.parse import chardet import json # 访问请求的方法 def requestUrl(result): url = str(result['url']); method = str(result['method']); data = str(result['param']); if url is None or method is None: return; if data is not None: data = str(data); data = data.replace("form=" , ""); # 去掉form= #数组参数处理 if data.startswith('[') and data.endswith(']'): datas = json.loads(data); if len(datas) > 0: data = json.dumps(datas[0]) else : data = '{"time": 1}'; elif "{}" == data or "" == data: data = '{"time": 1}'; else: data = '{"time": 1}'; try: # POST请求 if 'POST' in method: # 将序列化后的字符串转换成二进制数据,因为post请求携带的是二进制参数 last_data = bytes(data, encoding='utf-8'); response = urllib.request.urlopen(url, data=last_data); responseResult = response.read().decode('utf-8') result['responseResult'] = responseResult else: data_string=urllib.parse.urlencode(data); new_url = url + "?" + data_string; response=urllib.request.urlopen(new_url) responseResult = response.read().decode('utf-8') result['responseResult'] = responseResult except Exception as e: result['responseResult'] = "error,原因: " + str(e) # 输出爬取到的数据到本地磁盘中 def out_html(datas): if datas is None: return; file = open('D:\\out.html', 'w', encoding='utf-8') file.write("<html>") file.write(r''' <meta http-equiv="Content-Type" content="text/html; charset=utf-8"/> '''); file.write("<head>") file.write("<title>爬取结果</title>") # 设置表格显示边框 file.write(r''' <style> table{width:100%;table-layout: fixed;word-break: break-all; word-wrap: break-word;} table td{border:1px solid black;width:300px} </style> ''') file.write("</head>") file.write("<body>") file.write("<table cellpadding='0' cellspacing='0'>") # 遍历datas填充到表格中 for data in datas: file.write("<tr>") file.write("<td>%s</td>" % data['interfaceName']) file.write('<td><a href='+str(data['url'])+'>'+str(data['url'])+'</a></td>') file.write("<td>%s</td>" % data['method']) file.write("<td>%s</td>" % data['param']) file.write("<td>%s</td>" % data['responseResult']) file.write("</tr>") file.write("</table>") file.write("</body>") file.write("</html>") #主函数用法 if __name__ == '__main__': # 打开数据库连接 db = pymysql.connect("localhost", "root", "123456", "pycraw") # 使用cursor()方法获取操作游标 cursor = db.cursor(cursor = pymysql.cursors.DictCursor) # SQL 查询语句 sql = "SELECT * FROM interface "; try: # 执行SQL语句 cursor.execute(sql) # 获取所有记录列表 results = cursor.fetchall() for result in results: requestUrl(result); out_html(results); print("处理完成") except Exception as e: print(e); # 关闭数据库连接 db.close()
结果:
补充:也可以基于 Request 对象自己构造请求方法、请求header、请求data 等参数然后进行请求
上面的request.urlopen 存在的问题:不能定制请求头(urllib.request.Request(url, headers, data) 可以定制请求头), handler 可以定制更高级的请求头(比如动态cookie和代理不能使用请求对象的定制)。 下面研究两种方法的使用。
1. urllib.request.Request 的使用
1. java 后端接口
@RequestMapping("/t4") public void test4(ServletRequest servletRequest) throws IOException { System.out.println("******"); HttpServletRequest servletRequest1 = (HttpServletRequest) servletRequest; System.out.println(servletRequest1.getMethod() + "\t" + servletRequest1.getContentType()); Enumeration<String> headerNames = servletRequest1.getHeaderNames(); System.out.println("======"); while (headerNames.hasMoreElements()) { String s = headerNames.nextElement(); Enumeration<String> headers = servletRequest1.getHeaders(s); while (headers.hasMoreElements()) { System.out.println("key: " + s + "\tvalues: " + headers.nextElement()); } } System.out.println("======"); Enumeration<String> parameterNames = servletRequest1.getParameterNames(); while (parameterNames.hasMoreElements()) { String s = parameterNames.nextElement(); System.out.println("key: " + s + "\tvalues: " + StringUtils.join(servletRequest1.getParameterValues(s))); } List<String> strings = IOUtils.readLines(servletRequest1.getInputStream()); System.out.println("======"); System.out.println(strings); }
2. py的Request 对象测试
# author: qlq # date: 2022/7/20 14:28 import json import urllib from urllib import request if __name__ == '__main__': # 默认是get 请求 req = request.Request("http://localhost:8088/inner/t4") request.urlopen(req) # 指定GET请求 header2 = {"key21": "value21", "key22": "value22"} req2 = request.Request("http://localhost:8088/inner/t4?param1=1¶m2=2", headers=header2, method="GET") request.urlopen(req2) # 指定POST请求发送JSON参数,参数是作为RequestParam 传上去的 header3 = {"key31": "value31", "key2": "value32"} jsonParam3 = {"param311": "value311", "param322": "value322"} # 使用urlencode将字典参数序列化成字符串 jsonParamStr = urllib.parse.urlencode(jsonParam3) # 将序列化后的字符串转换成二进制数据,因为post请求携带的是二进制参数 last_data = bytes(jsonParamStr, encoding='utf-8') req3 = request.Request("http://localhost:8088/inner/t4?param31=1¶m32=2", data=last_data, headers=header2, method="POST") request.urlopen(req3) # 指定POST请求, 且发送参数是在请求体以JSON形式发送的 header4 = {"key41": "value41", "key42": "value42", "Content-Type": "application/json"} jsonParam4 = {"param41": "value41", "param42": "value42"} req4 = request.Request("http://localhost:8088/inner/t4?param31=1¶m32=2", data=json.dumps(jsonParam4).encode('utf-8'), headers=header4, method="POST") request.urlopen(req4)
3. 后端控制台:
****** GET null ====== key: accept-encoding values: identity key: host values: localhost:8088 key: user-agent values: Python-urllib/3.9 key: connection values: close ====== ====== [] ****** GET null ====== key: accept-encoding values: identity key: host values: localhost:8088 key: user-agent values: Python-urllib/3.9 key: key21 values: value21 key: key22 values: value22 key: connection values: close ====== key: param1 values: 1 key: param2 values: 2 ====== [] ****** POST application/x-www-form-urlencoded ====== key: accept-encoding values: identity key: content-type values: application/x-www-form-urlencoded key: content-length values: 35 key: host values: localhost:8088 key: user-agent values: Python-urllib/3.9 key: key21 values: value21 key: key22 values: value22 key: connection values: close ====== key: param31 values: 1 key: param32 values: 2 key: param311 values: value311 key: param322 values: value322 ====== [] ****** POST application/json ====== key: accept-encoding values: identity key: content-length values: 44 key: host values: localhost:8088 key: user-agent values: Python-urllib/3.9 key: key41 values: value41 key: key42 values: value42 key: content-type values: application/json key: connection values: close ====== key: param31 values: 1 key: param32 values: 2 ====== [{"param41": "value41", "param42": "value42"}]
4. 查看源码:
urllib.request.urlopen 如下:
def urlopen(url, data=None, timeout=socket._GLOBAL_DEFAULT_TIMEOUT, *, cafile=None, capath=None, cadefault=False, context=None): global _opener if cafile or capath or cadefault: import warnings warnings.warn("cafile, capath and cadefault are deprecated, use a " "custom context instead.", DeprecationWarning, 2) if context is not None: raise ValueError( "You can't pass both context and any of cafile, capath, and " "cadefault" ) if not _have_ssl: raise ValueError('SSL support not available') context = ssl.create_default_context(ssl.Purpose.SERVER_AUTH, cafile=cafile, capath=capath) https_handler = HTTPSHandler(context=context) opener = build_opener(https_handler) elif context: https_handler = HTTPSHandler(context=context) opener = build_opener(https_handler) elif _opener is None: _opener = opener = build_opener() else: opener = _opener return opener.open(url, data, timeout)
继续追到open: urllib.request.OpenerDirector.open
def open(self, fullurl, data=None, timeout=socket._GLOBAL_DEFAULT_TIMEOUT): # accept a URL or a Request object if isinstance(fullurl, str): req = Request(fullurl, data) else: req = fullurl if data is not None: req.data = data req.timeout = timeout protocol = req.type # pre-process request meth_name = protocol+"_request" for processor in self.process_request.get(protocol, []): meth = getattr(processor, meth_name) req = meth(req) sys.audit('urllib.Request', req.full_url, req.data, req.headers, req.get_method()) response = self._open(req, data) # post-process response meth_name = protocol+"_response" for processor in self.process_response.get(protocol, []): meth = getattr(processor, meth_name) response = meth(req, response) return response
可以看到这里有判断,如果是字符串的URL, 会手动构造一个 Request 对象; 否则的话fullurl 就是一个req 对象,直接进行后续处理。
2. handler 的简单使用
测试代码
import urllib.request url = "http://baidu.com" headers = { "User-Agent": "Mozilla/5.0" } request = urllib.request.Request(url = url, headers = headers) # handler build_opener open handler = urllib.request.HTTPHandler() opener = urllib.request.build_opener(handler) response = opener.open(request) # read result = response.read().decode("utf-8") print(result)
结果:
<html> <meta http-equiv="refresh" content="0;url=http://www.baidu.com/"> </html>
3. 代理handler 的使用
例如: 查看自己电脑的IP
1. 不适用代理IP的情况
# author: qlq # date: 2022/7/25 10:22 # desc: import urllib.request url = "http://www.ipdizhichaxun.com/" headers = { "User-Agent": "Mozilla/5.0" } request = urllib.request.Request(url=url, headers=headers) # handler build_opener open handler = urllib.request.HTTPHandler() opener = urllib.request.build_opener(handler) response = opener.open(request) # read result = response.read().decode("utf-8") # save result with open("detail.html", "w", encoding="utf-8") as fp: fp.write(result)
然后从detail.html 可以看到IP信息:
2. 使用代理IP:
首先可以去快代理、太阳代理等公司申请一些免费的代理IP。
import urllib.request url = "http://www.ipdizhichaxun.com/" headers = { "User-Agent": "Mozilla/5.0" } request = urllib.request.Request(url=url, headers=headers) proxies = { "http": "110.80.160.175:4331" } # handler build_opener open handler = urllib.request.ProxyHandler(proxies=proxies) # handler = urllib.request.HTTPHandler() opener = urllib.request.build_opener(handler) response = opener.open(request) # read result = response.read().decode("utf-8") # save result with open("detail.html", "w", encoding="utf-8") as fp: fp.write(result)
结果:
3. 使用代理池
使用代理池,无非就是从一堆池子里随机选一个,或者轮询:
(1) random 随机算法
import random proxies = [ {"http": "110.80.160.175:4331"}, {"http": "222.242.136.190:4335"}, {"http": "122.246.94.116:4324"}, {"http": "125.117.128.138:4368"}, {"http": "113.133.20.3:4331"} ] print(random.choice(proxies)) print(random.choices(proxies))
结果:
{'http': '222.242.136.190:4335'} [{'http': '125.117.128.138:4368'}]
(2) 代理池使用
import random import urllib.request url = "http://www.ipdizhichaxun.com/" headers = { "User-Agent": "Mozilla/5.0" } request = urllib.request.Request(url=url, headers=headers) proxies_pool = [ {"http": "110.80.160.175:4331"}, {"http": "222.242.136.190:4335"}, {"http": "122.246.94.116:4324"}, {"http": "125.117.128.138:4368"}, {"http": "113.133.20.3:4331"} ] proxies = random.choice(proxies_pool) # handler build_opener open handler = urllib.request.ProxyHandler(proxies=proxies) opener = urllib.request.build_opener(handler) response = opener.open(request) # read result = response.read().decode("utf-8") # save result with open("detail.html", "w", encoding="utf-8") as fp: fp.write(result)
4. 使用 BeautifulSoup4 解析html 元素
关于beautifulSoup4 插件安装以及pycharm 导入参考下篇文章
import random import urllib.request from bs4 import BeautifulSoup url = "http://www.ipdizhichaxun.com/" headers = { "User-Agent": "Mozilla/5.0" } request = urllib.request.Request(url=url, headers=headers) proxies_pool = [ {"http": "220.184.160.208:4314"}, {"http": "123.144.61.160:4346"}, {"http": "113.229.174.27:4331"}, {"http": "114.239.149.191:4345"}, {"http": "180.120.181.150:4331"} ] proxies = random.choice(proxies_pool) # handler build_opener open handler = urllib.request.ProxyHandler(proxies=proxies) opener = urllib.request.build_opener(handler) response = opener.open(request) # read result = response.read().decode("utf-8") # save result # with open("detail.html", "w", encoding="utf-8") as fp: # fp.write(result) # extract result soup = BeautifulSoup(result, 'html.parser', from_encoding='utf-8') # 找到内容区域 div = soup.find('div', id='wrapper') # 从其子类里面寻找结果标签 resultp = div.findChild("p", class_="result") print(resultp.getText())
5. 使用xpath 解析元素
import random import urllib.request from lxml import etree url = "http://www.ipdizhichaxun.com/" headers = { "User-Agent": "Mozilla/5.0" } request = urllib.request.Request(url=url, headers=headers) proxies_pool = [ {"http": "182.204.181.169:4313"}, {"http": "183.92.217.100:4325"} ] proxies = random.choice(proxies_pool) # handler build_opener open handler = urllib.request.ProxyHandler(proxies=proxies) opener = urllib.request.build_opener(handler) response = opener.open(request) # read result = response.read().decode("utf-8") result = etree.HTML(result) prompt = result.xpath("//div[@id='wrapper']//*[@class='result']//text()") print(prompt)
补充:访问https 请求
1. 方法一:urlopen 可以指定context 参数
import urllib.request import ssl context = ssl._create_unverified_context() response = urllib.request.urlopen("https://baidu.com", context=context) result = response.read().decode("utf-8") print(result)
2. 方法二: 全局取消
import ssl ssl._create_default_https_context = ssl._create_unverified_context