scrapy框架初级
一、安装
python模块网站,应用文件放置在scrips下,whl:https://www.lfd.uci.edu/~gohlke/pythonlibs/
Scrapy框架依赖 Twistid需要再上边网站下载,放置scrips下;
1 pip install C:\python\Anaconda3\Twisted-18.7.0-cp36-cp36m-win_amd64.whl 2 pip install scrapy
二、创建Scrapy项目
1.由于pychram没有集成环境,需要执行命令创建,执行完,用pychram选择新窗口打开;
1 scrapy startproject projectname
2.创建爬虫文件执行命令如下:
1 命令部分 文件名 爬取得网站 2 scrapy genspider baidu baidu.com 3 scrapy genspider -t crawl baidu baidu.com
3配置文件修改:
settings.py文件
1 USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.79 Safari/537.36 Maxthon/5.2.3.6000' 2 3 # Obey robots.txt rules 4 ROBOTSTXT_OBEY = False 5 DOWNLOAD_DELAY = 3 6 ITEM_PIPELINES = { 7 'xiaoshuo_pc.pipelines.XiaoshuoPcPipeline': 300, 8 }
4运行程序:
1 scrapy crawl name(变量值) 2 scrapy crawl name -o book.json(输出到文件{json、xml、csv}) 3 scrapy crawl name -o book.json -t json(-t 代表格式输出,一般忽略)
**第一次运行的时候,我遇到no module named win32API错误,这是因为Python没有自带访问windows系统API的库的,需要下载第三方库。库的名称叫pywin32,可以从网上直接下载,下载链接:http://sourceforge.net/projects/pywin32/files%2Fpywin32/ (下载适合你的Python版本)下载后放置到scripts目录下双机运行,即可((或者pip install pypiwin32))
三、小说获取示例代码:
创建入口执行文件main.py
1 from scrapy.cmdline import execute 2 execute("scrapy crawl zol".split()) # zol为zol文件中的变量定义的名 3 class ShiqikSpider(scrapy.Spider): 4 name = 'shiqik' 5 allowed_domains = ['17k.com'] 6 start_urls = ['https://www.81zw.us/book/1379/6970209.html'] 7 8 def parse(self, response): 9 title=response.xpath('//div[@class="bookname"]/h1/text()').extract_first() 10 content=''.join(response.xpath('//div[@id="content"]/text()').extract()).replace(' ','\n') 11 yield {"title":title,"content":content} 12 next_page=response.xpath('//div[@class="bottem2"]/a[3]/@href').extract_first() 13 if next_page.find(".html")!=-1: 14 print("继续下一个url") 15 new_url=response.urljoin(next_page) 16 yield scrapy.Request(new_url,callback=self.parse,dont_filter=True)
四、小说获取示例代码:
1 class BayizhongwenSpider(CrawlSpider): 2 name = 'bayizhongwen' 3 allowed_domains = ['81zw.us'] 4 # start_urls = ['https://www.81zw.us/book/1215/863759.html'] 5 start_urls = ['https://www.81zw.us/book/1215'] 6 7 rules = ( 8 Rule(LinkExtractor(restrict_xpaths=r'//dl/dd[2]/a'), callback='parse_item', follow=True), 9 Rule(LinkExtractor(restrict_xpaths=r'//div[@class="bottem1"]/a[3]'), callback='parse_item', follow=True), 10 ) 11 def parse_item(self, response): 12 title=response.xpath('//div[@class="bookname"]/h1/text()').extract_first() 13 content=''.join(response.xpath('//div[@id="content"]/text()').extract()).replace(' ','\n') 14 print({"title":title,"content":content}) 15 yield {"title":title,"content":content}
一、创建项目
1 (venv) C:\Users\noc\PycharmProjects>scrapy startproject tupian
二、创建app
1 (venv) C:\Users\noc\PycharmProjects\tupian>scrapy genspider zol zol.com.cn
三、修改配置信息
settings.py文件:
1 # Crawl responsibly by identifying yourself (and your website) on the user-agent 2 USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36' 3 4 # Obey robots.txt rules 5 ROBOTSTXT_OBEY = False 6 DOWNLOAD_DELAY = 3 7 8 # Configure item pipelines 9 # See https://doc.scrapy.org/en/latest/topics/item-pipeline.html 10 ITEM_PIPELINES = { 11 # 'tupian.pipelines.TupianPipeline': 300, 12 'scrapy.contrib.pipeline.images.ImagesPipeline': 300, 13 } 14 # 增加图片存放目录 15 IMAGES_STORE='e:/img'
四、创建入口执行文件start.py
1 from scrapy.cmdline import execute 2 execute("scrapy crawl zol".split()) # zol为zol文件中的变量定义的名
五、主文件代码:
1 import scrapy 2 3 4 class ZolSpider(scrapy.Spider): 5 name = 'zol' 6 allowed_domains = ['zol.com.cn'] 7 start_urls = ['http://desk.zol.com.cn/bizhi/7239_89590_2.html'] # 爬取图片页面的地址 8 9 def parse(self, response): 10 image_url = response.xpath('//img[@id="bigImg"]/@src').extract() # 爬取第一张图片的地址 11 image_name = response.xpath('string(//h3)').extract_first() # 爬取图片名称 12 yield {"image_url": image_url, "image_name": image_name} # 推送 13 next_page = response.xpath('//a[@id="pageNext"]/@href').extract_first() # 爬取图片下一张按钮的地址 14 if next_page.find('.html') != -1: # 判断最后一张图片地址如果不包含.html 15 yield scrapy.Request(response.urljoin(next_page), callback=self.parse)
六、middlewares文件
1 from tupian.settings import USER_AGENT 2 from random import choice 3 from fake_useragent import UserAgent 4 5 6 # User-Agent设置 7 class UserAgentDownloaderMiddleware(object): 8 def process_request(self, request, spider): 9 # if self.user_agent: 10 # request.headers.setdefault(b'User-Agent',choice(USER_AGENT)) 11 request.headers.setdefault(b'User-Agent', UserAgent().random) 12 13 # 代理设置 14 class ProxyMiddleware(object): 15 def process_request(self, request, spider): 16 # request.meta['proxy']='http://ip:port' 17 request.meta['proxy']='http://124.235.145.79:80' 18 # request.meta['proxy']='http://user:passwd@ip:port' 19 # request.meta['proxy']='http://398707160:j8inhg2g@139.224.116.10:16816'