爬虫失败 - 傻傻de

公告

昨天使用crawlspider写了一个爬虫，爬取中国图书网上的图书的分类、书名、作者、出版社、出版日期、页数、价格、评分和评论数，数据通过xpath直接提取出来，下一页和详情页的url也可以通过xpath提取出来。
程序一开始运行没什么问题，可是运行几分钟之后，就悲剧了！如下图，直接使用浏览器都进不去网站了。。。

在Request Headers中只添加了User-Agent字段，只使用了一个User-Agent，没有使用代理ip，也没有设置延时，以为不会被反爬，结果。。。

部分代码如下：

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from bookschina.items import BookschinaItem

class BcSpider(CrawlSpider):
    name = 'bc'
    allowed_domains = ['bookschina.com']
    start_urls = ['http://www.bookschina.com/books/kinder/']

    rules = (
        Rule(LinkExtractor(restrict_xpaths="//div[@class='categoriesList']//ul/li/a"), follow=True),
        Rule(LinkExtractor(restrict_xpaths="//a[text()='下一页']"), follow=True),
        Rule(LinkExtractor(restrict_xpaths="//div[@class='bookList']/ul/li/div[1]/a"), callback='parse_item')
    )

    def parse_item(self, response):
        item = BookschinaItem()
        item["b_cate"] = response.xpath("//div[@class='crumbsNav clearfix']/div[2]/a/text()").extract_first()
        item["s_cate"] = response.xpath("//div[@class='crumbsNav clearfix']/div[3]/a/text()").extract_first()
        item["book_name"] = response.xpath("//div[@class='padLeft10']/h1/text()").extract_first()
        item["book_author"] = response.xpath("//div[@class='padLeft10']/div[@class='author']/a/text()").extract_first()
        item["book_press"] = response.xpath("//div[@class='padLeft10']/div[@class='publisher']/a/text()").extract_first()
        item["pub_date"] = response.xpath("//div[@class='padLeft10']/div[@class='publisher']/i/text()").extract_first()
        item["pages_num"] = response.xpath("//div[@class='padLeft10']/div[@class='otherInfor']/i/text()").extract_first()
        item["book_score"] = response.xpath("//div[@class='padLeft10']/div[@class='startWrap']/em/text()").extract_first()
        item["comments_num"] = response.xpath("//div[@class='padLeft10']/div[@class='startWrap']/a/text()").extract_first()
        item["book_price"] = response.xpath("//div[@class='bookInfo']//span[@class='sellPrice']/text()").extract_first()

        yield item

写爬虫一定要做好反反爬虫措施！

posted on 2018-12-02 22:03 傻傻de 阅读(252) 评论(0) 收藏举报

刷新页面返回顶部