Python爬虫学习笔记(四)

Request：

Test1（基本属性：POST）：

代码1：

import requests

# 发送POST请求
data = {

}
response = requests.post(url, data=data)

POST请求

Test2（auth认证）：

代码2：

import requests

# 发送POST请求
data = {

}
response = requests.post(url, data=data)

#内网 => 需要认证
auth = (user, pwd)
response = requests.get(url, auth=auth)

auth认证

Test3（添加代理IP）：

代码3：

import requests

# 1.请求url
url = 'http://www.baidu.com'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.104 Safari/537.36',

}
free_proxy = {
    'http': '58.249.55.222:9797'
}
response = requests.get(url=url, headers=headers, proxies=free_proxy)
print(response.status_code)

添加代理IP

Test4（SSL认证）：

代码4：

import requests

url = 'https://www.12306.cn/'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.104 Safari/537.36',

}
response = requests.get(url=url, headers=headers)
data = response.content.decode()
with open('03-ssl.html', 'w', encoding='utf-8')as f:
    f.write(data)

错误示例

错误原因：

https是有第三方CA证书认证的
但12306虽然是https，但是他不是CA证书，他是自己颁布的证书
解决方法：
告诉web的服务器忽略证书访问

错误原因及解决方法

正确操作：

import requests

url = 'https://www.12306.cn/'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.104 Safari/537.36',

}
response = requests.get(url=url, headers=headers, verify=False)
data = response.content.decode()
with open('03-ssl.html', 'w', encoding='utf-8')as f:
    f.write(data)

忽略证书进行SSL认证

Test5（Cookies）：

以抓取https://www.yaozh.com/为例

代码5：

import requests

# 请求数据url
member_url = 'https://www.yaozh.com/mamber'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.104 Safari/537.36',

}
# cookies字符串
cookies = 'acw_tc=2f624a2116146683053267637e08ee667240e1e39bfc4dbbbd0b8a8f33eaa4; PHPSESSID=em18pu35k89loei7glff6tthu5; _ga=GA1.2.1727298837.1614668307; _gid=GA1.2.781607072.1614668307; _gat=1; Hm_lvt_65968db3ac154c3089d7f9a4cbb98c94=1614668307; Hm_lpvt_65968db3ac154c3089d7f9a4cbb98c94=1614668313; yaozh_logintime=1614668314; yaozh_user=1038868%09s1mpL3; yaozh_userId=1038868; yaozh_jobstatus=kptta67UcJieW6zKnFSe2JyYnoaSZ5htnZqdg26qb21rg66flM6bh5%2BscZdyVNaWz9Gwl4Ny2G%2BenofNlKqpl6XKppZVnKmflWlxg2lolJeaA7663ea03b06b64c50D898E7C9Aa493SkZSblWmHcNiemZtVq56lloN0pG2SaZ%2BGam2SbGiZnZWSmZyYaodw4g%3D%3Dc0addf8502ad2cb2f15b3ee2b0e1386b; db_w_auth=854448%09s1mpL3; UtzD_f52b_saltkey=oZ3HIEZl; UtzD_f52b_lastvisit=1614664715; UtzD_f52b_lastact=1614668315%09uc.php%09; UtzD_f52b_auth=25786v5G66Jeq3x5p14pj246attR7pdjk5X409nOsIqKeaEypUvH%2BA1LTN4dNrChFQS1hmPQNfLGoyIxAD8PRqgk4bA; yaozh_uidhas=1; yaozh_mylogin=1614668317; acw_tc=2f624a2116146683053267637e08ee667240e1e39bfc4dbbbd0b8a8f33eaa4'
# 但需要的是cookies的字典类型
cookies_dict = {

}
cookies_list = cookies.split('; ')
for cookie in cookies_list:
    cookies_dict[cookie.split('=')[0]] = cookie.split('=')[1]

response = requests.get(url=member_url, headers=headers, cookies=cookies_dict)
data = response.content.decode()
with open('05-cookie.html', 'w', encoding='utf-8')as f:
    f.write(data)

Cookies - 代码传参登陆

由返回页面可知，登录成功。

Test6（cookies - session）：

以抓取https://www.yaozh.com/为例

代码6：

Cookies - 代码带着cookie去请求数据

由返回页面可知，登录成功。

数据解析：

·正则表达式：

Test1（正则表达式 - 贪婪模式）：

代码1：

import re

# 贪婪模式：从开头匹配到结尾
# 非贪婪模式：使用?
one = 'mdfjiefhuehfgieufn213431241234n'
pattern = re.compile('m(.*)n')
result = pattern.findall(one)
print(result)

贪婪模式

返回1：

['dfjiefhuehfgieufn213431241234']

View Code

Test2（正则表达式 - 费贪婪模式）：

代码2：

import re

# 贪婪模式：从开头匹配到结尾
# 非贪婪模式：使用?
one = 'mdfjiefhuehfgieufn213431241234n'
pattern = re.compile('m(.*?)n')
result = pattern.findall(one)
print(result)

View Code

返回2：

['dfjiefhuehfgieuf']

View Code

Test3（正则表达式 - 匹配换行符）：

代码3：

import re

# .除了换行符之外的匹配
one = """
    maiwgdyuwagdwyadg
    1234567612121134n
"""
pattern = re.compile('m(.*)n')
result = pattern.findall(one)
print(result)

不匹配换行符

返回3：

[]

代码4：

import re

# .除了换行符之外的匹配
one = """
    maiwgdyuwagdwyadg
    1234567612121134n
"""
pattern = re.compile('m(.*)n', re.S)
result = pattern.findall(one)
print(result)

匹配换行符

返回4：

代码5：

import re

# .除了换行符之外的匹配
one = """
    maiwgdyuwagdwyadg
    1234567612121134N
"""
pattern = re.compile('m(.*)n', re.S | re.I)
result = pattern.findall(one)
print(result)

忽略大小写

返回5：

Test4（正则表达式 - 纯数字的正则）：

代码6：

import re

# 纯数字的正则 \d 0 - 9之间的一个数
pattern = re.compile('^\d+$')
one = '1234'

# 匹配判断的方法
result = pattern.match(one)

print(result.group())

View Code

Test5（正则表达式 - 范围运算）：

代码7：

import re

# 范围运算 [123]  [1-9]
one = '798345'

pattern = re.compile('[1-9]')
result = pattern.findall(one)
print(result)

View Code

['7', '9', '8', '3', '4', '5']

注：

match：从头匹配，匹配一次
search：从任意位置，匹配一次
findall：查找符合正则的内容
sub：替换字符串
split：拆分

posted @ 2021-03-02 16:45 3cH0_Nu1L 阅读(236) 评论(0) 编辑收藏举报

刷新页面返回顶部

3cH0_Nu1L

Python爬虫学习笔记(四)

Request：

Test1（基本属性：POST）：

代码1：

Test2（auth认证）：

代码2：

Test3（添加代理IP）：

代码3：

Test4（SSL认证）：

代码4：

错误原因：

正确操作：

返回：

Test5（Cookies）：

代码5：

返回：

Test6（cookies - session）：

代码6：

返回：

数据解析：

·正则表达式：

Test1（正则表达式 - 贪婪模式）：

代码1：

返回1：

Test2（正则表达式 - 费贪婪模式）：

代码2：

返回2：

Test3（正则表达式 - 匹配换行符）：

代码3：

返回3：

代码4：

返回4：

代码5：

返回5：

Test4（正则表达式 - 纯数字的正则）：

代码6：

返回：

Test5（正则表达式 - 范围运算）：

代码7：

返回：

注：

公告