scrapy

创建项目

srapy startprobject xxx

cd xxx

scrapy genspider  name  name.com         第一个name是爬虫的名字  第二个name是爬取网站地址start_urls  起始url地址

编写items.py             明确需要爬取的数据

编写spiders/xxx.py       编写爬虫文件，处理请求和响应，以及提取数据（yield item）

编写pipelines.py         编写管道文件，处理spider返回item数据

编写settings.py          启动管道组件，以及其他相关设置

scrapy crawl 爬虫名      执行爬虫

allowed_domains = ['chouti.com']   允许爬取的域名    想要爬取多个网站chouti.com后面加

重写start_requests 指定出事处理请求的函数
def start_requests(self):
    for url in self.start_urls:
    yield Request(url,callback=self.parse)

采集数据
Selector(response=response).xpath()
    #extract 对象转化为字符串
     #extract_first 对象转化为字符串
    # //子孙中找
    # /儿子中找
    # .//当前对象的子孙中找
    # /div[@id=""] 儿子中的div标签id=
    # obj.extract() 对象转化为字符串
    # obj.extract_first() 列表中每一对象转化为字符串，提取第一个元素
    # div/text()  获取某个元素的文本
    # a/@href 获取属性
    # //a[start-with(@href,"/all/hot/recent")]/@href' 已什么开始
    # //a[re:test(@href,"/all/hot/recent/\d+")]/@href' 正则


递归
 # yield Request(url=url,callback=self.parse) 将要新访问的url添加到调度器


持久化
 # yield Item(name='',title='') 将要新访问的url添加到调度器
pipeline
    class Foo:
          def process_itm(self,item,spider):
                     pass

settings = {
    
       "xx.xx.xx.Foo":300   #注册pipeline  300权重  大的先执行

}

选择器

# hxs = response.xpath('//a')

# hxs = response.xpath('//a[2]')

# hxs = response.xpath('//a[@id]')

# hxs = response.xpath('//a[@id="i1"]')

# hxs = response.xpath('//a[@href="link.html"][@id="i1"]')

# hxs = resresponse.xpath('//a[contains(@href, "link")]')

# hxs = response.xpath('//a[starts-with(@href, "link")]')

# hxs = response.xpath('//a[re:test(@id, "i\d+")]')

# hxs = response.xpath('//a[re:test(@id, "i\d+")]/text()').extract()

# hxs = response.xpath('//a[re:test(@id, "i\d+")]/@href').extract()

# hxs = response.xpath('/html/body/ul/li/a/@href').extract()

# hxs = response.xpath('//body/ul/li/a/@href').extract_first()

1.豆瓣爬取存储文件形式(单纯的爬取，没有将数据返回给pipelines)

import scrapy
class DoubanSpider(scrapy.Spider):
    #爬虫名 启动此爬虫 scrapy crawl douban
    name = 'douban'
    #允许爬取的域名范围
    allowed_domains = ['book.douban.com']
    #起始爬取列表，爬虫启动的第一批请求，将在这个列表里获取
    start_urls = ['https://book.douban.com/tag/%E5%B0%8F%E8%AF%B4']

    def parse(self, response):

        content = response.xpath('//li[@class="subject-item"]')
        #用来存储所有item的字段
        items= []
        for book in content:

            # .extract() 将xpath对象转化为Unicode字符串
            book_name = book.xpath('normalize-space(.//h2//a/text())').extract()
            author = book.xpath("normalize-space(.//div[@class='pub']/text())").extract()
            score = book.xpath('.//span[@class="rating_nums"]/text()').extract()

            item['book_name'] = book_name[0]
            item['author'] = author[0]
            item['score'] = score[0]
            items.append(item)
        return items

    #  命令行执行命令
    #  scrapy crawl douban  -o 文件名.json   输出文件 json格式
    #  scrapy crawl douban  -o 文件名.csv
    #  scrapy list 查看当前爬虫的名称

posted @ 2019-05-15 09:42 Yzc_start 阅读(176) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

Yzc_start

scrapy

公告