scrapy采集gb2312网页中文乱码笔记

由于目标采集资源为gb2312发生乱码,采用中间件的解决方式,中间件为DownloaderMiddleware

1     def process_response(self, request, response, spider):
2         # Called with the response returned from the downloader.
3         # Must either;
4         # - return a Response object
5         # - return a Request object
6         response = HtmlResponse(url=response.url, body=response.body, encoding='utf-8')
7         # - or raise IgnoreRequest
8         return response

即在下载网页阶段是将网页转换为utf-8格式,另外需要将中间激活,在配置文件settings.py文件中插入代码,以激活

1 DOWNLOADER_MIDDLEWARES = {'news.middlewares.NewsDownloaderMiddleware': 1000}

至此,爬虫文件中不需要进行额外的转码,即可正常显示中文了

posted @ 2021-03-05 11:18  思何  阅读(111)  评论(0编辑  收藏  举报