2018 年 12月 18 日随笔档案 - 北伽

2018年12月18日

摘要：主程序代码： 1 import scrapy 2 from scrapyDemo.items import ScrapydemoItem 3 4 class PostSpider(scrapy.Spider): 5 name = 'home' 6 # allowed_domains = ['www. 阅读全文

posted @ 2018-12-18 18:13 北伽阅读(657) 评论(0) 推荐(0) 编辑

Scrapy操作浏览器获取网易新闻数据

摘要：爬虫代码： 1 import scrapy 2 from selenium import webdriver 3 4 class WangyiSpider(scrapy.Spider): 5 name = 'wangyi' 6 # allowed_domains = ['www.xxx.com'] 阅读全文

posted @ 2018-12-18 18:09 北伽阅读(329) 评论(0) 推荐(0) 编辑

Scrapy框架中的 UA伪装

摘要：例如：百度输入ip查看是自己本机的ip，通过UA伪装成其他机器的ip, 爬虫代码： 1 import scrapy 2 3 4 class UatestSpider(scrapy.Spider): 5 name = 'UATest' 6 # allowed_domains = ['www.xxx.c 阅读全文

posted @ 2018-12-18 18:03 北伽阅读(1119) 评论(0) 推荐(0) 编辑

scrapy框架中如何使用selenuim

摘要：主程序代码： 1 import scrapy 2 from selenium import webdriver 3 4 class SelenuimtestSpider(scrapy.Spider): 5 name = 'selenuimTest' 6 # allowed_domains = ['w 阅读全文

posted @ 2018-12-18 17:56 北伽阅读(276) 评论(0) 推荐(0) 编辑

基于scrapy中---全站爬取数据----CrawlSpider的使用

摘要： #数据源：糗事百科爬虫代码： 1 import scrapy 2 from scrapy.linkextractors import LinkExtractor 3 from scrapy.spiders import CrawlSpider, Rule 4 5 6 class QiubaiSpi 阅读全文

posted @ 2018-12-18 17:52 北伽阅读(151) 评论(0) 推荐(0) 编辑

基于百度AI的自然语言处理文字分类

摘要：前言：需要在百度AI平台注册登录并创建项目。爬虫代码 1 import scrapy 2 from BaiDuAi.items import BaiduaiItem 3 4 class AiSpider(scrapy.Spider): 5 name = 'ai' 6 # allowed_doma 阅读全文

posted @ 2018-12-18 17:48 北伽阅读(931) 评论(0) 推荐(0) 编辑

基于scrapy-redis两种形式的分布式爬虫

摘要： redis分布式部署 1.scrapy框架是否可以自己实现分布式？ - 不可以。原因有二。其一：因为多台机器上部署的scrapy会各自拥有各自的调度器，这样就使得多台机器无法分配start_urls列表中的url。（多台机器无法共享同一个调度器）其二：多台机器爬取到的数据无法通过同一个管道对数据阅读全文

posted @ 2018-12-18 17:44 北伽阅读(438) 评论(0) 推荐(0) 编辑

北伽

每一个不曾起舞的日子，都是对生命的辜负

公告