Scrapy框架 - 随笔分类 - 运维爱背锅

Scrapyd、scrapyd-client部署爬虫项目

摘要：命令参考：[https://github.com/scrapy/scrapyd-client](https://github.com/scrapy/scrapyd-client) [https://scrapyd.readthedocs.io](https://scrapyd.readthedocs 阅读全文

posted @ 2023-07-17 11:49 运维爱背锅阅读(121) 评论(0) 推荐(0) 编辑

Scrapy框架爬取HTTP/2网站

摘要：scrapy本身是自带支持HTTP2的爬取： [https://docs.scrapy.org/en/latest/topics/settings.html?highlight=H2DownloadHandler#download-handlers-base](https://docs.scrapy 阅读全文

posted @ 2023-07-17 11:47 运维爱背锅阅读(172) 评论(0) 推荐(0) 编辑

Scrapy如何在启动时向爬虫传递参数

摘要：**高级方法：** **一般方法：** 运行爬虫时使用-a传递参数 ```Bash scrapy crawl 爬虫名 -a key=values ``` 然后在爬虫类的__init__魔法方法中获取kwargs ```Python class Bang123Spider(RedisCrawlSpid 阅读全文

posted @ 2023-07-17 11:44 运维爱背锅阅读(27) 评论(0) 推荐(0) 编辑

Scrapy在pipeline中集成mongodb

摘要：settings.py中设置配置项 ```Python MONGODB_HOST = "127.0.0.1" MONGODB_PORT = 27017 MONGODB_DB_NAME = "bang123" ``` pipelines.py： ```Python from scrapy.pipeli 阅读全文

posted @ 2023-07-17 11:44 运维爱背锅阅读(20) 评论(0) 推荐(0) 编辑

Scrapy集成selenium-案例-淘宝首页推荐商品获取

摘要：scrapy特性就是效率高，异步，如果非要集成selenium实际上意义不是特别大....因为selenium慢.... 案例：淘宝首页推荐商品的标题获取爬虫类 toabao.py ```Python import scrapy from scrapy.http import HtmlRespon 阅读全文

posted @ 2023-07-17 11:42 运维爱背锅阅读(49) 评论(0) 推荐(0) 编辑

Scrapy-redis组件，实现分布式爬虫

摘要：安装包 ```Python pip install -U scrapy-redis ``` settings.py ```Python ##### Scrapy-Redis ##### ### Scrapy指定Redis 配置 ### # 其他默认配置在scrapy_redis.default.py 阅读全文

posted @ 2023-07-17 11:40 运维爱背锅阅读(62) 评论(0) 推荐(0) 编辑

Scrapy自带的断点续爬JOB-DIR参数

摘要：参考官方文档：[https://docs.scrapy.org/en/latest/topics/jobs.html?highlight=JOBDIR#jobs-pausing-and-resuming-crawls](https://docs.scrapy.org/en/latest/topics 阅读全文

posted @ 2023-07-17 11:39 运维爱背锅阅读(465) 评论(0) 推荐(0) 编辑

Scrapy-CrawlSpider爬虫类使用案例

摘要：CrawlSpider类型的爬虫会根据指定的rules规则自动找到url比自动爬取。优点：适合整站爬取，自动翻页爬取缺点：比较难以通过meta传参，只适合一个页面就能拿完数据的。 ```Python import scrapy from scrapy.http import HtmlRespon 阅读全文

posted @ 2023-07-17 11:38 运维爱背锅阅读(25) 评论(0) 推荐(0) 编辑

scrapy 请求meta参数使用案例-豆瓣电影爬取

摘要：num = 0 ```Python import scrapy from scrapy.http import HtmlResponse from scrapy_demo.items import DoubanItem """ 这个例子主要是学习meta传参。 """ class DoubanSpi 阅读全文

posted @ 2023-07-17 11:36 运维爱背锅阅读(6) 评论(0) 推荐(0) 编辑

Scrapy如何在爬虫类中导入settings配置

摘要：假设我们在settings.py定义了一个IP地址池 ```Bash ##### 自定义设置 IP_PROXY_POOL = ( "127.0.0.1:6789", "127.0.0.1:6789", "127.0.0.1:6789", "127.0.0.1:6789", ) ``` 要在爬虫文件中阅读全文

posted @ 2023-07-17 11:36 运维爱背锅阅读(109) 评论(0) 推荐(0) 编辑

Scrapy-settings.py常规配置

摘要：```Python # Scrapy settings for scrapy_demo project # # For simplicity, this file contains only settings considered important or # commonly used. You 阅读全文

posted @ 2023-07-17 11:35 运维爱背锅阅读(24) 评论(0) 推荐(0) 编辑

Scrapy爬虫文件代码基本认识和细节解释

摘要：```Python import scrapy from scrapy.http.request import Request from scrapy.http.response.html import HtmlResponse from scrapy_demo.items import Forum 阅读全文

posted @ 2023-07-17 11:34 运维爱背锅阅读(14) 评论(0) 推荐(0) 编辑

Scrapy创建项目、爬虫文件

摘要：# 创建项目 **执行命令** ```Bash scrapy startproject ``` # **项目结构** ![](https://secure2.wostatic.cn/static/dkJyXRT5EDBrNskNyzpNyY/image.png?auth_key=1689564783 阅读全文

posted @ 2023-07-17 11:33 运维爱背锅阅读(42) 评论(0) 推荐(0) 编辑

Scrapy框架架构

摘要：![](https://secure2.wostatic.cn/static/6mSAqCGta7HpNwgYGG5D13/image.png?auth_key=1689564711-ucXZC28uz1CritVB5QTEff-0-46f7c0a9a3589af32224146e59889692) 阅读全文

posted @ 2023-07-17 11:32 运维爱背锅阅读(9) 评论(0) 推荐(0) 编辑

全网同号，关注《运维爱背锅》，用通俗易懂的方式学会运维！从零基础到进阶，分享运维技术和项目案例，一起探讨运维背锅人生！开启背锅之旅！

随笔分类 - 爬虫 / Scrapy框架

公告

搜索

随笔分类

随笔档案

阅读排行榜

评论排行榜

推荐排行榜

最新评论