Web Crawler - 随笔分类 - aaronthon

scrapy-redis的使用与解析

摘要：scrapy-redis是一个基于redis的scrapy组件，通过它可以快速实现简单分布式爬虫程序，该组件本质上提供了三大功能： scheduler - 调度器 dupefilter - URL去重规则（被调度器使用） pipeline - 数据持久化 scrapy-redis组件 1. URL去阅读全文

posted @ 2018-07-23 12:38 aaronthon 阅读(3466) 评论(0) 推荐(2)

爬虫之Scrapy详解

摘要：性能相关在编写爬虫时，性能的消耗主要在IO请求中，当单进程单线程模式下请求URL时必然会引起等待，从而使得请求整体变慢。 import requests def fetch_async(url): response = requests.get(url) return response url_l 阅读全文

posted @ 2018-04-27 16:34 aaronthon 阅读(1015) 评论(1) 推荐(2)

爬虫之requests详解

摘要：requests Python标准库中提供了：urllib、urllib2、httplib等模块以供Http请求，但是，它的 API 太渣了。它是为另一个时代、另一个互联网所创建的。它需要巨量的工作，甚至包括各种方法覆盖，来完成最简单的任务。 Requests 是使用 Apache2 License 阅读全文

posted @ 2018-04-26 21:31 aaronthon 阅读(790) 评论(0) 推荐(2)

爬取抖音视频

摘要：import requests user_id = '58841646784' # 6556303280 # 获取一个用户的所有作品 """ signature = _bytedAcrawler.sign('用户ID') douyin_falcon:node_modules/byted-acrawler/dist/runtime """ import subprocess signat... 阅读全文

posted @ 2018-04-25 09:17 aaronthon 阅读(1343) 评论(0) 推荐(0)

爬取拉钩网

摘要：import re import requests all_cookie_dict = {} # ##################################### 第一步：访问登录页面 ##################################### r1 = requests.get( url='https://passport.lagou.com/login/l... 阅读全文

posted @ 2018-04-24 17:16 aaronthon 阅读(201) 评论(0) 推荐(0)

爬虫自动登陆GitHub

摘要：import requests from bs4 import BeautifulSoup r1 = requests.get( url='https://github.com/login' ) s1 = BeautifulSoup(r1.text, 'html.parser') token = s1.find(name='input', attrs={'name': 'authent... 阅读全文

posted @ 2018-04-24 10:14 aaronthon 阅读(368) 评论(0) 推荐(0)

爬取博客园博客

摘要：# import os import requests from bs4 import BeautifulSoup # 登陆, 模仿用户浏览器 r1 = requests.get( # 要爬取的博客圆页面 url='https://zzk.cnblogs.com/s/blogpost?Keywords=blog%3aaronthon%201&pageindex=9', #... 阅读全文

posted @ 2018-04-23 11:09 aaronthon 阅读(140) 评论(0) 推荐(0)

爬取煎蛋网文章

摘要：# import os import requests from bs4 import BeautifulSoup r1 = requests.get( url='http://jandan.net/', # 浏览器的信息 headers={ 'user-agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleW... 阅读全文

posted @ 2018-04-23 10:11 aaronthon 阅读(161) 评论(0) 推荐(0)

爬取抽屉热搜榜文章

摘要：import os import requests from bs4 import BeautifulSoup # 登陆, 模仿用户浏览器 r1 = requests.get( # 要爬取的网页 url='https://dig.chouti.com/', # 浏览器的信息 headers={ 'user-agent':'Mozilla/5.0 (... 阅读全文

posted @ 2018-04-22 11:03 aaronthon 阅读(402) 评论(0) 推荐(0)

准备

摘要：1. 下载BeautifulSoup和requests 1. 先去https://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted下载 Twisted‑18.7.0‑cp36‑cp36m‑win_amd64.whl文件，并保存到一个文件夹中。 2. 打开cmd命阅读全文

posted @ 2018-04-21 21:00 aaronthon 阅读(142) 评论(0) 推荐(0)

爬虫示例

摘要：import requests import re import json def getPage(url): response=requests.get(url) return response.text def parsePage(s): com=re.compile('<div class=" 阅读全文

posted @ 2018-04-20 18:46 aaronthon 阅读(169) 评论(0) 推荐(0)

aaronthon

随笔分类 - Web Crawler

公告