scrapy框架初级

scrapy入门教程：https://scrapy-chs.readthedocs.io/zh_CN/0.24/intro/tutorial.html

一、安装

python模块网站，应用文件放置在scrips下，whl：https://www.lfd.uci.edu/~gohlke/pythonlibs/

Scrapy框架依赖 Twistid需要再上边网站下载，放置scrips下；

1   pip install C:\python\Anaconda3\Twisted-18.7.0-cp36-cp36m-win_amd64.whl
2   pip install scrapy

二、创建Scrapy项目

1.由于pychram没有集成环境，需要执行命令创建，执行完，用pychram选择新窗口打开；

1 scrapy startproject  projectname

2.创建爬虫文件执行命令如下：

1    命令部分    文件名  爬取得网站
2 scrapy genspider baidu baidu.com
3 scrapy genspider -t crawl baidu baidu.com

3配置文件修改：

settings.py文件

1 USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.79 Safari/537.36 Maxthon/5.2.3.6000'
2 
3 # Obey robots.txt rules
4 ROBOTSTXT_OBEY = False
5 DOWNLOAD_DELAY = 3
6 ITEM_PIPELINES = {
7    'xiaoshuo_pc.pipelines.XiaoshuoPcPipeline': 300,
8 }

4运行程序：

1 scrapy crawl name（变量值）
2 scrapy crawl name -o book.json（输出到文件{json、xml、csv}）
3 scrapy crawl name -o book.json -t json(-t 代表格式输出，一般忽略)

**第一次运行的时候，我遇到no module named win32API错误，这是因为Python没有自带访问windows系统API的库的，需要下载第三方库。库的名称叫pywin32，可以从网上直接下载，下载链接：http://sourceforge.net/projects/pywin32/files%2Fpywin32/ （下载适合你的Python版本）下载后放置到scripts目录下双机运行，即可（（或者pip install pypiwin32））

三、小说获取示例代码：

创建入口执行文件main.py

 1 from scrapy.cmdline import execute
 2 execute("scrapy crawl zol".split()) # zol为zol文件中的变量定义的名
 3 class ShiqikSpider(scrapy.Spider):
 4     name = 'shiqik'
 5     allowed_domains = ['17k.com']
 6     start_urls = ['https://www.81zw.us/book/1379/6970209.html']
 7 
 8     def parse(self, response):
 9         title=response.xpath('//div[@class="bookname"]/h1/text()').extract_first()
10         content=''.join(response.xpath('//div[@id="content"]/text()').extract()).replace('   ','\n')
11         yield {"title":title,"content":content}
12         next_page=response.xpath('//div[@class="bottem2"]/a[3]/@href').extract_first()
13         if next_page.find(".html")!=-1:
14             print("继续下一个url")
15             new_url=response.urljoin(next_page)
16             yield scrapy.Request(new_url,callback=self.parse,dont_filter=True)

四、小说获取示例代码：

 1 class BayizhongwenSpider(CrawlSpider):
 2     name = 'bayizhongwen'
 3     allowed_domains = ['81zw.us']
 4     # start_urls = ['https://www.81zw.us/book/1215/863759.html']
 5     start_urls = ['https://www.81zw.us/book/1215']
 6 
 7     rules = (
 8         Rule(LinkExtractor(restrict_xpaths=r'//dl/dd[2]/a'), callback='parse_item', follow=True),
 9         Rule(LinkExtractor(restrict_xpaths=r'//div[@class="bottem1"]/a[3]'), callback='parse_item', follow=True),
10     )
11     def parse_item(self, response):
12         title=response.xpath('//div[@class="bookname"]/h1/text()').extract_first()
13         content=''.join(response.xpath('//div[@id="content"]/text()').extract()).replace('   ','\n')
14         print({"title":title,"content":content})
15         yield {"title":title,"content":content}

一、创建项目

1 (venv) C:\Users\noc\PycharmProjects>scrapy startproject tupian

二、创建app

1 (venv) C:\Users\noc\PycharmProjects\tupian>scrapy genspider zol zol.com.cn

三、修改配置信息

settings.py文件：

 1 # Crawl responsibly by identifying yourself (and your website) on the user-agent
 2 USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36'
 3 
 4 # Obey robots.txt rules
 5 ROBOTSTXT_OBEY = False
 6 DOWNLOAD_DELAY = 3
 7  
 8 # Configure item pipelines
 9 # See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
10 ITEM_PIPELINES = {
11    # 'tupian.pipelines.TupianPipeline': 300,
12     'scrapy.contrib.pipeline.images.ImagesPipeline': 300,
13 }
14 # 增加图片存放目录
15 IMAGES_STORE='e:/img'

四、创建入口执行文件start.py

1 from scrapy.cmdline import execute
2 execute("scrapy crawl zol".split()) # zol为zol文件中的变量定义的名

五、主文件代码：

 1 import scrapy
 2 
 3 
 4 class ZolSpider(scrapy.Spider):
 5     name = 'zol'
 6     allowed_domains = ['zol.com.cn']
 7     start_urls = ['http://desk.zol.com.cn/bizhi/7239_89590_2.html']  # 爬取图片页面的地址
 8 
 9     def parse(self, response):
10         image_url = response.xpath('//img[@id="bigImg"]/@src').extract()  # 爬取第一张图片的地址
11         image_name = response.xpath('string(//h3)').extract_first()  # 爬取图片名称
12         yield {"image_url": image_url, "image_name": image_name}  # 推送
13         next_page = response.xpath('//a[@id="pageNext"]/@href').extract_first()  # 爬取图片下一张按钮的地址
14         if next_page.find('.html') != -1:  # 判断最后一张图片地址如果不包含.html
15             yield scrapy.Request(response.urljoin(next_page), callback=self.parse)

六、middlewares文件

 1 from tupian.settings import USER_AGENT
 2 from random import choice
 3 from fake_useragent import UserAgent
 4 
 5 
 6 # User-Agent设置
 7 class UserAgentDownloaderMiddleware(object):
 8     def process_request(self, request, spider):
 9         # if self.user_agent:
10         # request.headers.setdefault(b'User-Agent',choice(USER_AGENT))
11         request.headers.setdefault(b'User-Agent', UserAgent().random)
12 
13 # 代理设置
14 class ProxyMiddleware(object):
15     def process_request(self, request, spider):
16         # request.meta['proxy']='http://ip:port'
17         request.meta['proxy']='http://124.235.145.79:80'
18         # request.meta['proxy']='http://user:passwd@ip:port'
19         # request.meta['proxy']='http://398707160:j8inhg2g@139.224.116.10:16816'

posted on 2018-10-25 17:13 returnes 阅读(322) 评论(0) 收藏举报

刷新页面返回顶部

returnes

导航

公告

scrapy框架初级