scrapy基础之数据爬取

1.创建scrapy项目,命令: scrapy startproject scrapyspider(项目名称)
2.在创建项目的根目录下创建spider,命令:scrapy genspider myspider(爬虫名称) www.baidu.com(爬取url)
3.使用pycharm打开爬虫项目,爬虫模板如下

    class JobboleSpider(scrapy.Spider):
        name = 'jobbole'
        allowed_domains = ['blog.jobbole.com']
        start_urls = ['http://blog.jobbole.com/all-posts/']

        def parse(self, response):
            pass

4.如上代码parse函数是对start_urls中的url进行解析的函数,如下代码

    def parse(self, response):
        # 1.获取文章列表页中文章的url交给scrapy下载后并交给解析函数进行具体字段的解析
        post_nodes =  response.xpath("//div[@id='archive']/div[contains(@class,'floated-thumb')]/div[@class='post-thumb']/a")
        for post_node in post_nodes:
            image_url = post_node.xpath("img/@src").extract_first()
            url = post_node.xpath("@href").extract_first()
            yield Request(url=parse.urljoin(response.url, url), meta={"front_image_url":parse.urljoin(response.url, image_url)}, callback=self.parse_detail)
        # 2.获取下一页的url交给scrapy进行下载
        next_url = response.xpath("//a[@class='next page-numbers']/@href").extract_first("")
        if next_url:
            yield Request(url=parse.urljoin(response.url, next_url), callback=self.parse) 

(1)调用scrapy中request函数将具体页面内容交给callback(回调函数)parse_detail进行处理,并且在request中传入参数图片的url
     添加的参数 : meta={"front_image_url":parse.urljoin(response.url, image_url)}
(2)将获取的下一页的列表页request出去,交给回调函数parse,就是这个函数进行列表页处理

5.具体页面解析函数parse_detail

    def parse_detail(self, response):
        article_item = JobBoleArticleItem()
        front_image_url = response.meta.get("front_image_url", "")
        title = response.xpath("//div[@class='entry-header']/h1/text()").extract_first()
        create_date = response.xpath("//p[@class='entry-meta-hide-on-mobile']/text()").extract_first().replace('·','').strip()
        praise_nums = int(response.xpath("//span[contains(@class,'vote-post-up')]/h10/text()").extract_first())
        if praise_nums:
            praise_nums = int(praise_nums)
        else:
            praise_nums = 0
        fav_nums =  response.xpath("//span[contains(@class,'bookmark-btn')]/text()").extract_first()
        match_re = re.match(".*?(\d+).*", fav_nums)
        if match_re:
            fav_nums = int(match_re.group(1))
        else:
            fav_nums = 0
        comment_nums = response.xpath("//a[@href='#article-comment']/span/text()").extract_first()
        match_re = re.match(".*?(\d+).*", comment_nums)
        if match_re:
            comment_nums = int(match_re.group(1))
        else:
            comment_nums = 0
        tag_list = response.xpath("//p[@class='entry-meta-hide-on-mobile']/a/text()").extract()
        #TODO 有问题,原因:endswith函数拼写错误
        tag_list = [element for element in tag_list if not element.strip().endswith("评论")]
        tags = ",".join(tag_list)
        article_item["title"] = get_md5(response.url)
        article_item["title"] = title
        article_item["url"] = response.url
        try:
            create_date = datetime.datetime.strftime(create_date, "%Y/%m/%d").date()
        except Exception as e:
            create_date = datetime.datetime.now().date()
        article_item["create_date"] = create_date
        # scrapy中获取的是image数组,需要将值改为数组类型
        article_item["front_image_url"] = [front_image_url]
        article_item["praise_nums"] = praise_nums
        article_item["comment_nums"] = comment_nums
        article_item["fav_nums"] = fav_nums
        article_item["tags"] = tags

        yield article_item

  (1)从response中解析出具体内容,并对内容进行判断
  (2)其中涉及到从response中手动传入的参数front_image_url,获取如下:front_image_url = response.meta.get("front_image_url", "")
  (3)将处理后的内容放到item中,yield(抛出)item

6.item需要在items.py中自己定义,定义如下:

class JobBoleArticleItem(scrapy.Item):
    title = scrapy.Field()
    create_date = scrapy.Field()
    url = scrapy.Field()
    url_object_id = scrapy.Field()
    front_image_url = scrapy.Field()
    front_image_path = scrapy.Field()
    praise_nums = scrapy.Field()
    comment_nums = scrapy.Field()
    fav_nums = scrapy.Field()
    tags = scrapy.Field()

7.数据的导出(导出到数据库或本地文件中)以及图片的下载,需要在piplines.py中定义文件下载和数据存储的piplines以及在settings.py文件中配置
  (1)数据存储到mysql中,调用scrapy中twisted框架进行异步存储(原因:爬取速度过快时,数据存储会限制爬取)

        class JobBoleMysqlTwistedPipline(object):

            def __init__(self, dbpool):
                self.dbpool = dbpool
            # python静态函数,从settings中读取数据库的配置
            @classmethod
            def from_settings(cls, settings):
                dbparms = dict(
                    host = settings["MYSQL_HOST"],
                    db = settings["MYSQL_DBNAME"],
                    user = settings["MYSQL_USER"],
                    passwd = settings["MYSQL_PASSWORD"],
                    charset = "utf8",
                    cursorclass = MySQLdb.cursors.DictCursor,
                    use_unicode = True
                )
                dbpool = adbapi.ConnectionPool("MySqldb", **dbparms)
                return cls(dbpool)
            def process_item(self, item, spider):
                # 使用twisted将mysql插入变成异步执行
                self.dbpool.runInteraction(self.do_insert, item)
            def do_insert(self, cursor, item):
                insert_sql = """
                            insert into jobbole(title, create_date, url, url_object_id, front_image_url, comment_nums, fav_nums, praise_nums, tags)
                            values (%s, %s, %s, %s, %s, %s, %s, %s, %s)
                        """
                cursor.execute(insert_sql, (item["title"], item["create_date"], item["url"], item["url_object_id"],
                                                item["front_image_url"], item["comment_nums"], item["fav_nums"],
                                                item["praise_nums"],
                                                item["tags"]))

  (2)图片下载

    # 继承了scrapy中ImagesPipeline,重写图片地址方法,具体下载是scapy完成
    class JobBoleImagePipeline(ImagesPipeline):
        def item_completed(self, results, item, info):
            for ok, value in results:
                image_file_path = value["path"]
                item["front_image_path"] = image_file_path
                return item

  (3)修改settings中对pipline的配置,将添加的pipeline添加到里面即可,后面数字越小越先执行

        ITEM_PIPELINES = {
       # 'webspider.pipelines.JobBoleImagePipeline': 1,
        'webspider.pipelines.JobBoleMysqlTwistedPipline': 1,
    }

 

posted @ 2019-02-25 20:52  菜鸟的future  阅读(1127)  评论(0编辑  收藏  举报