'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte
问题描述:在使用python爬取斗鱼直播的数据时,使用str(读取到的字节,编码格式)进行解码时报错:'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte
代码如下:
from urllib import request class Spilder(): url='https://www.douyu.com/' def __fetch_content(self): r = request.urlopen(Spilder.url) htmls = r.read() #获取字节码(html) htmls = str(htmls, encoding='utf-8') def go(self): self.__fetch_content() spilder=Spilder() spilder.go()
问题原因:断点调试的时候发现r.read()获取到的字节码是以‘b’\x1f\x8b\x08’开头的,说明它是gzip压缩过的数据,这也是报错的原因,所以我们需要对我们接收的字节码进行一个解码操作。修改之后的代码如下:
from urllib import request from io import BytesIO import gzip class Spider(): url = 'https://www.douyu.com/' def __fetch_content(self): r = request.urlopen(Spider.url) htmls = r.read() buff = BytesIO(htmls) f = gzip.GzipFile(fileobj=buff) htmls = f.read().decode('utf-8') # 入口方法 def go(self): self.__fetch_content() spider = Spider() spider.go()
修改之后解码正常
唯有热爱方能抵御岁月漫长。