2017 年 10月随笔档案 - my8100

python之re正则简单够用

摘要：0. 1.参考 Python正则表达式指南 https://docs.python.org/2/library/re.html https://docs.python.org/2/howto/regex.html https://docs.python.org/3/library/re.html 2 阅读全文

posted @ 2017-10-30 10:34 my8100 阅读(334) 评论(0) 推荐(0) 编辑

python之使用 wkhtmltopdf 和 pdfkit 批量加载html生成pdf，适用于博客备份和官网文档打包

摘要：0. 1.参考 Python 爬虫：把廖雪峰教程转换成 PDF 电子书 https://github.com/lzjun567/crawler_html2pdf wkhtmltopdf 就是一个非常好的工具，它可以用适用于多平台的 html 到 pdf 的转换，pdfkit 是 wkhtmltopd 阅读全文

posted @ 2017-10-28 18:53 my8100 阅读(13180) 评论(1) 推荐(0) 编辑

requests之headers 'Content-Type': 'text/html'误判encoding为'ISO-8859-1'导致中文text解码错误

摘要：0. requests不设置UA 访问baidu 得到 r.headers['Content-Type'] 是text/html 使用chrome UA： Content-Type:text/html; charset=utf-8 1.参考代码分析Python requests库中文编码问题 is 阅读全文

posted @ 2017-10-26 16:22 my8100 阅读(3271) 评论(0) 推荐(0) 编辑

python提取网页表格并保存为csv

摘要：0. 1.参考 W3C HTML 表格表格标签表格元素定位参看网页源代码并没有 thead 和 tbody。。。 2.提取表格数据表格标题可能出现超链接，导致标题被拆分，也可能不带表格标题。。表格内容换行 tag 规律 2.1提取所有表格标题列表 2.2每个表格分别写入csv文件代码处阅读全文

posted @ 2017-10-22 16:11 my8100 阅读(13157) 评论(0) 推荐(0) 编辑

HTML转义字符&npsp；表示non-breaking space，unicode编码为u'\xa0',超出gbk编码范围？

摘要：0.目录 1.参考2.问题定位不间断空格的unicode表示为 u\xa0',超出gbk编码范围？3.如何处理.extract_first().replace(u'\xa0', u' ').strip().encode('utf-8','replace') 1.参考 Beautiful Soup a 阅读全文

posted @ 2017-10-22 13:06 my8100 阅读(6082) 评论(0) 推荐(0) 编辑

Scrapy Selectors 选择器

摘要：0. 1.参考《用Python写网络爬虫》——2.2 三种网页抓取方法 re / lxml / BeautifulSoup 需要注意的是，lxml在内部实现中，实际上是将CSS选择器转换为等价的XPath选择器。从结果中可以看出，在抓取我们的示例网页时，Beautiful Soup比其他两种方法阅读全文

posted @ 2017-10-20 17:33 my8100 阅读(3184) 评论(1) 推荐(1) 编辑

scrapy相关：splash 实践

摘要：0. 1.参考 https://github.com/scrapy-plugins/scrapy-splash#configuration 以此为准 scrapy相关：splash安装 A javascript rendering service 渲染 2.实践 2.1新建项目后修改 setting 阅读全文

posted @ 2017-10-19 17:56 my8100 阅读(628) 评论(0) 推荐(0) 编辑

scrapy相关：splash安装 A javascript rendering service 渲染

摘要：0. splash：美人鱼溅，泼 1.参考 Splash使用初体验 docker在windows下的安装 https://blog.scrapinghub.com/2015/03/02/handling-javascript-in-scrapy-with-splash/ Splash is ou 阅读全文

posted @ 2017-10-19 17:45 my8100 阅读(1993) 评论(0) 推荐(0) 编辑

MongoDB 及 scrapy 应用

摘要：0 1.Scrapy 使用 MongoDB https://doc.scrapy.org/en/latest/topics/item-pipeline.html#write-items-to-mongodb Write items to MongoDB In this example we’ll w 阅读全文

posted @ 2017-10-18 12:11 my8100 阅读(1138) 评论(0) 推荐(0) 编辑

摘要：0.问题现象爬取 item：写入jsonline jl 文件 item 被转 str，默认 ensure_ascii = True，则非 ASCII 字符被转化为 `\uXXXX`，每一个 ‘{xxx}’ 单位被写入文件目标：注意最后用 chrome 或 notepad++ 打开确认，fire 阅读全文

posted @ 2017-10-16 18:30 my8100 阅读(5515) 评论(1) 推荐(1) 编辑

wb 黑名单批量操作

摘要：0. 参考 yu961549745/WeiboBlackList 微博批量拉黑 1. 代码 block.py 更新内容：多线程，urllib.request 改为 requests + session 改成从 firefox 或 chrome 读取 cookie 更方便，懒得改了阅读全文

posted @ 2017-10-11 12:21 my8100 阅读(579) 评论(0) 推荐(0) 编辑

my8100

10 2017 档案

公告