随笔分类 -  爬虫 / Scrapy框架

摘要:命令参考:[https://github.com/scrapy/scrapyd-client](https://github.com/scrapy/scrapyd-client) [https://scrapyd.readthedocs.io](https://scrapyd.readthedocs 阅读全文
posted @ 2023-07-17 11:49 运维爱背锅 阅读(121) 评论(0) 推荐(0) 编辑
摘要:scrapy本身是自带支持HTTP2的爬取: [https://docs.scrapy.org/en/latest/topics/settings.html?highlight=H2DownloadHandler#download-handlers-base](https://docs.scrapy 阅读全文
posted @ 2023-07-17 11:47 运维爱背锅 阅读(172) 评论(0) 推荐(0) 编辑
摘要:**高级方法:** **一般方法:** 运行爬虫时使用-a传递参数 ```Bash scrapy crawl 爬虫名 -a key=values ``` 然后在爬虫类的__init__魔法方法中获取kwargs ```Python class Bang123Spider(RedisCrawlSpid 阅读全文
posted @ 2023-07-17 11:44 运维爱背锅 阅读(27) 评论(0) 推荐(0) 编辑
摘要:settings.py中设置配置项 ```Python MONGODB_HOST = "127.0.0.1" MONGODB_PORT = 27017 MONGODB_DB_NAME = "bang123" ``` pipelines.py: ```Python from scrapy.pipeli 阅读全文
posted @ 2023-07-17 11:44 运维爱背锅 阅读(20) 评论(0) 推荐(0) 编辑
摘要:scrapy特性就是效率高,异步,如果非要集成selenium实际上意义不是特别大....因为selenium慢.... 案例:淘宝首页推荐商品的标题获取 爬虫类 toabao.py ```Python import scrapy from scrapy.http import HtmlRespon 阅读全文
posted @ 2023-07-17 11:42 运维爱背锅 阅读(49) 评论(0) 推荐(0) 编辑
摘要:安装包 ```Python pip install -U scrapy-redis ``` settings.py ```Python ##### Scrapy-Redis ##### ### Scrapy指定Redis 配置 ### # 其他默认配置在scrapy_redis.default.py 阅读全文
posted @ 2023-07-17 11:40 运维爱背锅 阅读(62) 评论(0) 推荐(0) 编辑
摘要:参考官方文档:[https://docs.scrapy.org/en/latest/topics/jobs.html?highlight=JOBDIR#jobs-pausing-and-resuming-crawls](https://docs.scrapy.org/en/latest/topics 阅读全文
posted @ 2023-07-17 11:39 运维爱背锅 阅读(465) 评论(0) 推荐(0) 编辑
摘要:CrawlSpider类型的爬虫会根据指定的rules规则自动找到url比自动爬取。 优点:适合整站爬取,自动翻页爬取 缺点:比较难以通过meta传参,只适合一个页面就能拿完数据的。 ```Python import scrapy from scrapy.http import HtmlRespon 阅读全文
posted @ 2023-07-17 11:38 运维爱背锅 阅读(25) 评论(0) 推荐(0) 编辑
摘要:num = 0 ```Python import scrapy from scrapy.http import HtmlResponse from scrapy_demo.items import DoubanItem """ 这个例子主要是学习meta传参。 """ class DoubanSpi 阅读全文
posted @ 2023-07-17 11:36 运维爱背锅 阅读(6) 评论(0) 推荐(0) 编辑
摘要:假设我们在settings.py定义了一个IP地址池 ```Bash ##### 自定义设置 IP_PROXY_POOL = ( "127.0.0.1:6789", "127.0.0.1:6789", "127.0.0.1:6789", "127.0.0.1:6789", ) ``` 要在爬虫文件中 阅读全文
posted @ 2023-07-17 11:36 运维爱背锅 阅读(109) 评论(0) 推荐(0) 编辑
摘要:```Python # Scrapy settings for scrapy_demo project # # For simplicity, this file contains only settings considered important or # commonly used. You 阅读全文
posted @ 2023-07-17 11:35 运维爱背锅 阅读(24) 评论(0) 推荐(0) 编辑
摘要:```Python import scrapy from scrapy.http.request import Request from scrapy.http.response.html import HtmlResponse from scrapy_demo.items import Forum 阅读全文
posted @ 2023-07-17 11:34 运维爱背锅 阅读(14) 评论(0) 推荐(0) 编辑
摘要:# 创建项目 **执行命令** ```Bash scrapy startproject ``` # **项目结构** ![](https://secure2.wostatic.cn/static/dkJyXRT5EDBrNskNyzpNyY/image.png?auth_key=1689564783 阅读全文
posted @ 2023-07-17 11:33 运维爱背锅 阅读(42) 评论(0) 推荐(0) 编辑
摘要:![](https://secure2.wostatic.cn/static/6mSAqCGta7HpNwgYGG5D13/image.png?auth_key=1689564711-ucXZC28uz1CritVB5QTEff-0-46f7c0a9a3589af32224146e59889692) 阅读全文
posted @ 2023-07-17 11:32 运维爱背锅 阅读(9) 评论(0) 推荐(0) 编辑

点击右上角即可分享
微信分享提示
🚀
回顶
收起
  1. 1 404 not found REOL
404 not found - REOL
00:00 / 00:00
An audio error has occurred.