利用Scrapy爬取自己的CSDN博客

最近开始接触Scrapy这个开源的爬虫，看了一些文档和人家的技术博客，模仿一下，来爬取自己博客。

首先创建项目：

scrapy startproject myblog

items.py的编写：

我准备爬取博客文章标题，文章链接及文章被阅读的次数

# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html
import scrapy
class MyBlogItem(scrapy.Item):
    article_name = scrapy.Field()
    article_url = scrapy.Field()
    article_readcount = scrapy.Field()

pipelines.py的编写：

# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
import json
import codecs
class MyBlogPipeline(object):
    def __init__(self):
        self.file = codecs.open('myblog_data.json',mode='wb',encoding='utf-8')
    def process_item(self, item, spider):
        line = json.dumps(dict(item))+'\n'
        self.file.write(line.decode('unicode_escape'))
        return item

Scrapy爬虫框架抓取的中文结果为Unicode编码，对于如何转换为UTF-8编码。下面部分的代码算是比较好的解决了这个问题。

settings.py的编写：

# -*- coding: utf-8 -*-
# Scrapy settings for myblog project
#
# For simplicity, this file contains only the most important settings by
# default. All the other settings are documented here:
#
#     http://doc.scrapy.org/en/latest/topics/settings.html
#
BOT_NAME = 'myblog'
SPIDER_MODULES = ['myblog.spiders']
NEWSPIDER_MODULE = 'myblog.spiders'
COOKIES_ENABLED = False
ITEM_PIPELINES = {
    'myblog.pipelines.MyBlogPipeline':300
}
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'myblog (+http://www.yourdomain.com)'

这里将COOKIES_ENABLED参数置为True，使根据cookies判断访问的站点不能发现爬虫轨迹，防止被ban。

ITEM_PIPELINES类型为字典，用于设置启动的pipeline，其中key为定义的pipeline类，value为启动顺序，默认0-1000。

爬虫的编写：

#!/usr/bin/env python
# __author__ = 'root'
from scrapy.spider import Spider
from scrapy.selector import  Selector
from scrapy.http import Request
from myblog.items import MyBlogItem
import  re
class MyBlogSpider(Spider):
    name = "myblog"
    download_delay = 1
    allowed_domains=["blog.csdn.net"]
    start_urls=[
        "http://blog.csdn.net/bnxf00000/article/details/2785136"
    ]
    def parse(self, response):
        sel = Selector(response)
        item = MyBlogItem()
        templist=[]
        article_url = str(response.url)
        article_name = sel.xpath('//div[@id="article_details"]/div/h1/span/a/text()').extract()
        article_readcount = sel.xpath('//div[@id="article_details"]/div[2]/span[@class="link_view"]/text()').extract()
        for temp in article_readcount:
            result = re.match('(\d+)',temp)
            if result:
                templist.append(result.group(0))
        #article_readcount = re.match('\d+',article_readcount)
        item['article_name'] = [n.encode('utf-8') for n in article_name]
        item['article_url'] = article_url.encode('utf-8')
        item['article_readcount']=[n.encode('utf-8') for n in templist]
        yield item
        urls = sel.xpath('//li[@class="next_article"]/a/@href').extract()
        for url in urls:
            #print url
            url = "http://blog.csdn.net" + url
            #print url
            yield Request(url, callback=self.parse)

原理是分析网页得到“下一篇”的链接，并返回Request对象。进而继续爬取下一篇文章，直至没有。

执行：

scrapy crawl myblog

部分结果图示：

第一个爬虫程序，参照别人的代码和讲解依葫芦画瓢，自己添加了对阅读次数的处理，后续准备对Scrapy爬虫源码进行阅读学习。

参考链接：http://blog.csdn.net/u012150179/article/details/34486677

posted @ 2015-05-04 10:49 ShadonSniper 阅读(467) 评论(0) 收藏举报

刷新页面返回顶部

利用Scrapy爬取自己的CSDN博客

公告