python3爬取网页中的邮箱地址
分析结果对:
http://xxx.com?method=getrequest&gesnum=00000001
http://xxx.com?method=getrequest&gesnum=00000002
http://xxx.com?method=getrequest&gesnum=00000003
返回的数据进行爬取
由于返回的python3 JSON数据中存在单个转义字符“\”的处理 没有处理好
req =requests.get(url=url,headers=headers,verify=False,timeout=60).json()
于是通过返回的是 bytes 型的二进制数据 进行处理。
req =requests.get(url=url,headers=headers,verify=False,allow_redirects=False,timeout=60)
data= json.dumps(bytes.decode(req.content,'UTF-8'))
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 | #!/usr/bin/python3 #-*- coding:utf-8 -*- #编写环境 windows 7 x64 Notepad++ + Python3.5.0 import urllib3 urllib3.disable_warnings() import sys import requests import re import json cookie = '''JSESSIONID=1B7407076DE01727BC48DCD56FF9BA70; entsoft=entsoft; JSESSIONID=4877B5AC1DF6307E90CF1641D3863A6C; radId=45991FBF-0BC4-3BA4-08E2-00072022FB2C''' headers = { 'Accept' : 'application/json, text/plain, */*' , 'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36' , 'Accept-Encoding' : 'gzip, deflate' , 'Accept-Language' : 'zh-CN,zh;q=0.9' , 'Cookie' : cookie, } #输出00000001-00000300存放在num.txt中 def getNum(): filename = 'C:\\Users\\Administrator\\Desktop\\脚本\\num.txt' file = open (filename, 'w' ) for i in range ( 1 , 300 ): file .write(( "%08d" % i) + '\n' ) file .close() def main(): #url ='http://xxx.com?method=getrequest&gesnum=00000001' getNum() filename = 'C:\\Users\\Administrator\\Desktop\\脚本\\num.txt' with open (filename, 'r' ) as file : for line in file : url = 'http://xxx.com?method=getrequest&gesnum={line}' . format (line = line) #print(url) #req =requests.get(url=url,headers=headers,verify=False,timeout=60).json() #遇到问题: python3 JSON数据中存在单个转义字符“\”的处理没解决 于是使用下面的方式 req = requests.get(url = url,headers = headers,verify = False ,allow_redirects = False ,timeout = 60 ) #使用json.dumps的方法,可以将json对象转化为字符串 #print(req.content) #response.text 返回的是一个 unicode 型的文本数据 #response.content 返回的是 bytes 型的二进制数据 #由于返回unicode 型的文本数据报错,使用返回bytes 型的二进制数据 data = json.dumps(bytes.decode(req.content, 'UTF-8' )) #print(data) #正则匹配邮箱地址 emailRegex = r "[-_\w\.]{0,64}@([-\w]{1,63}\.)*[-\w]{1,63}" email = re.search(emailRegex,data) print (email) if __name__ = = '__main__' : main() |
<_sre.SRE_Match object; span=(158, 184), match='xxxx@hotmail.com'> <_sre.SRE_Match object; span=(145, 170), match='xxxx@nordictelecom.net'>
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 | #!/usr/bin/python3 #-*- coding:utf-8 -*- #编写环境 windows 7 x64 Notepad++ + Python3.5.0 def main(): filename = "C:\\Users\\Administrator\\Desktop\\脚本\\email_handle.txt" filename1 = "C:\\Users\\Administrator\\Desktop\\脚本\\email_handle_handle.txt" file1 = open (filename1, 'w' ) with open (filename, 'r' ) as file : for line in file : data = line[ 48 :] print (data) file1.write(data) file .close() file1.close() if __name__ = = '__main__' : main() |
xxxx@hotmail.com'>
xxxx@nordictelecom.net'>
python爬虫使用Cookie的两种方法
https://blog.csdn.net/weixin_38706928/article/details/80376572
Python3 关于UnicodeDecodeError/UnicodeEncodeError: ‘gbk’ codec can’t decode/encode bytes类似的文本编码问题
https://www.cnblogs.com/worstprogrammer/p/5189758.html
Python模拟登陆(使用requests库)
https://blog.csdn.net/majianfei1023/article/details/49927969
Python的urllib3软件包的证书认证及警告的禁用
https://blog.csdn.net/taiyangdao/article/details/72825735
JSON在线解析及格式化验证
https://www.json.cn/
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】凌霞软件回馈社区,博客园 & 1Panel & Halo 联合会员上线
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步