scrapy中输出中文保存中文
1.json文件中文解码:
#!/usr/bin/python #coding=utf-8 #author=dahu import json with open('huxiu.json','r') as f: data=json.load(f) print data[0]['title'] for key in data[0]: print '\"%s\":\"%s\",'%(key,data[0][key])
中文写入json:
#!/usr/bin/python #coding=utf-8 #author=dahu import json data={ "desc":"女友不是你想租想租就能租", "link":"/article/214877.html", "title":"押金8000元,共享女友门槛不低啊" } with open('tmp.json','w') as f: json.dump(data,f,ensure_ascii=False) #指定ensure_ascii
2.scrapy在保存json文件时,容易乱码,
例如:
scrapy crawl huxiu --nolog -o huxiu.json $ head huxiu.json [ {"title": "\u62bc\u91d18000\u5143\uff0c\u5171\u4eab\u5973\u53cb\u95e8\u69db\u4e0d\u4f4e\u554a", "link": "/article/214877.html", "desc": "\u5973\u53cb\u4e0d\u662f\u4f60\u60f3\u79df\u60f3\u79df\u5c31\u80fd\u79df"}, {"title": "\u5f20\u5634\uff0c\u817e\u8baf\u8981\u5582\u4f60\u5403\u836f\u4e86", "link": "/article/214879.html", "desc": "\u201c\u8033\u65c1\u56de\u8361\u7740Pony\u9a6c\u7684\u6559\u8bf2\uff1a\u597d\u597d\u7528\u8111\u5b50\u60f3\u60f3\uff0c\u4e0d\u5145\u94b1\uff0c\u4f60\u4eec\u4f1a\u53d8\u5f3a\u5417\uff1f\u201d"},
结合上面保存json文件为中文的技巧:
settings.py文件改动:
ITEM_PIPELINES = { 'coolscrapy.pipelines.CoolscrapyPipeline': 300, }
注释去掉
pipelines.py改成如下:
# -*- coding: utf-8 -*- # Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html import json # import codecs class CoolscrapyPipeline(object): # def __init__(self): # self.file = codecs.open('data_cn.json', 'wb', encoding='utf-8') def process_item(self, item, spider): # line = json.dumps(dict(item),ensure_ascii=False) + '\n' # self.file.write(line) with open('data_cn1.json', 'a') as f: json.dump(dict(item), f, ensure_ascii=False) f.write(',\n') return item
注释的部分是另一种写法,核心在于settings里启动pipeline,会自动运行process_item程序,所以就可以保存我们想要的任何格式
此时终端输入命令
scrapy crawl huxiu --nolog
如果仍然加 -o file.json ,file和pipeline里定义文件都会生成,但是file的json格式仍然是乱码。
3.进一步
由上分析可以得出另一个结论,setting里的ITEM_PIPELINES 是控制着pipeline的,如果我们多开启几个呢:
ITEM_PIPELINES = { 'coolscrapy.pipelines.CoolscrapyPipeline': 300, 'coolscrapy.pipelines.CoolscrapyPipeline1': 300, }
# -*- coding: utf-8 -*- # Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html import json # import codecs class CoolscrapyPipeline(object): # def __init__(self): # self.file = codecs.open('data_cn.json', 'wb', encoding='utf-8') def process_item(self, item, spider): # line = json.dumps(dict(item),ensure_ascii=False) + '\n' # self.file.write(line) with open('data_cn1.json', 'a') as f: json.dump(dict(item), f, ensure_ascii=False) f.write(',\n') return item class CoolscrapyPipeline1(object): def process_item(self, item, spider): with open('data_cn2.json', 'a') as f: json.dump(dict(item), f, ensure_ascii=False) f.write(',hehe\n') return item
运行:
$ scrapy crawl huxiu --nolog
$ head -n 2 data_cn* ==> data_cn1.json <== {"title": "押金8000元,共享女友门槛不低啊", "link": "/article/214877.html", "desc": "女友不是你想租想租就能租"}, {"title": "张嘴,腾讯要喂你吃药了", "link": "/article/214879.html", "desc": "“耳旁回荡着Pony马的教诲:好好用脑子想想,不充钱,你们会变强吗?”"}, ==> data_cn2.json <== {"title": "押金8000元,共享女友门槛不低啊", "link": "/article/214877.html", "desc": "女友不是你想租想租就能租"},hehe {"title": "张嘴,腾讯要喂你吃药了", "link": "/article/214879.html", "desc": "“耳旁回荡着Pony马的教诲:好好用脑子想想,不充钱,你们会变强吗?”"},hehe
可以看到两个文件都生成了!而且还是按照我们想要的格式!