python编码问题
linux、mac os黙认utf-8
windows黙认GBK
python 2
python 2黙认使用ascii码,python 2读入文件中的字符串黙认以文件声明为准,声明的是什么就以什么做为编码
GBK—decode('gbk')———》unicode-----encode('utf-8')-----》UTF-8
UTF-8-----decode('utf-8')-------->unicode-----encode('gbk')------->GBK
python3
python 3黙认使用unicode,python 3读入文件中的字符串时,不管是什么编码,都会先转换成unicode,也就是说,在python 3中使用的字符串都是unicode编码
python 3还有一种格式,称为bytes,用于存储和网络传输
requests乱码问题
如下:
import requests #1、指定url url = 'https://www.baidu.com' #2、发起get请求,返回响应对象 response = requests.get(url=url) #3、获取响应对象值 ,.text为str,content为byte response_text = response.text with open('./re2.html',"w",encoding="utf-8") as f: f.write(response_text)
以上代码写入的re2.html将出现乱码
原因:
response.text将请求的网页数据黙认以'latin1'编码decode成unicode,而网页请求过来的数据是utf-8编码格式,所以response.text得到的就是乱码
解决方案:
1、如果 Requests 检测不到正确的编码,那么你告诉它正确的是什么
import requests #1、指定url url = 'https://www.baidu.com' #2、发起get请求,返回响应对象 response = requests.get(url=url)
response.encoding = 'utf-8'
print(type(response)) #3、获取响应对象值 ,.text为str,content为byte response_text = response.text with open('./re3.html',"w",encoding="utf-8") as f: f.write(response_text)
2、将错误编码的unicode数据以原来错误的decode编码重新encode成bytes格式
import requests #1、指定url url = 'https://www.baidu.com' #2、发起get请求,返回响应对象 response = requests.get(url=url) #人为指定编码格式为utf-8 # response.encoding = 'utf-8' #3、获取响应对象值 ,.text为str,content为byte,将response.text以'latin-1'编码进行encode response_text = response.text.encode('latin-1') with open('./re3.html',"wb") as f: f.write(response_text)
3、直接使用response.content,获取bytes编码格式数据
import requests #1、指定url url = 'https://www.baidu.com' #2、发起get请求,返回响应对象 response = requests.get(url=url) #人为指定编码格式为utf-8 # response.encoding = 'utf-8' #3、获取响应对象值 ,.text为str,content为byte response_content = response.content with open('./re3.html',"wb") as f: f.write(response_content)