2018 年 7月随笔档案 - my8100

scrapy_redis 相关: 多线程更新 score/request.priority

摘要：0.背景使用 scrapy_redis 爬虫，忘记或错误设置 request.priority(Rule 也可以通过参数 process_request 设置 request.priority)，导致提取 item 的 request 排在有序集 xxx:requests 的队尾，持续占用内存。阅读全文

posted @ 2018-07-26 18:52 my8100 阅读(567) 评论(0) 推荐(0) 编辑

Scrapy 扩展中间件: 针对特定响应状态码，使用代理重新请求

摘要：0.参考 https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#module-scrapy.downloadermiddlewares.redirect https://doc.scrapy.org/en/latest/ 阅读全文

posted @ 2018-07-18 18:47 my8100 阅读(5480) 评论(0) 推荐(1) 编辑

Scrapy 扩展中间件: 同步/异步提交批量 item 到 MySQL

摘要：0.参考 https://doc.scrapy.org/en/latest/topics/item-pipeline.html?highlight=mongo#write-items-to-mongodb 20180721新增：异步版本 https://twistedmatrix.com/docum 阅读全文

posted @ 2018-07-18 12:55 my8100 阅读(2582) 评论(0) 推荐(0) 编辑

Scrapy 隐含 bug: 强制关闭爬虫后从 requests.queue 读取的已保存 request 数量可能有误

摘要：问题描述和解决方案已提交至 Scrapy issues： The size of requests.queue may be wrong when resuming crawl from unclean shutdown. #3333 阅读全文

posted @ 2018-07-16 09:39 my8100 阅读(672) 评论(0) 推荐(0) 编辑

Scrapyd 改进第二步: Web Interface 添加 STOP 和 START 超链接, 一键调用 Scrapyd API

摘要：0.提出问题 Scrapyd 提供的开始和结束项目的API如下，参考 Scrapyd 改进第一步: Web Interface 添加 charset=UTF-8, 避免查看 log 出现中文乱码，准备继续在页面上进一步添加 START 和 STOP 超链接。 http://scrapyd.readt 阅读全文

posted @ 2018-07-15 18:47 my8100 阅读(1117) 评论(0) 推荐(0) 编辑

Scrapyd 改进第一步: Web Interface 添加 charset=UTF-8, 避免查看 log 出现中文乱码

摘要：0.问题现象和原因如下图所示，由于 Scrapyd 的 Web Interface 的 log 链接直接指向 log 文件，Response Headers 的 Content-Type 又没有声明字符集 charset=UTF-8，因此通过浏览器查看 log 会出现非 ASCII 乱码。 1.解阅读全文

posted @ 2018-07-15 16:18 my8100 阅读(1538) 评论(0) 推荐(0) 编辑

scrapy_redis 相关: 将 jobdir 保存的爬虫进度转移到 Redis

摘要：0.参考 Scrapy 隐含 bug: 强制关闭爬虫后从 requests.queue 读取的已保存 request 数量可能有误 1.说明 Scrapy 设置 jobdir，停止爬虫后，保存文件目录结构： requests.queue/p0 文件保存 priority=0 的未调度 request 阅读全文

posted @ 2018-07-11 19:07 my8100 阅读(1292) 评论(0) 推荐(1) 编辑

my8100

07 2018 档案

公告