scrapy 圣墟

复制代码
# -*- coding: utf-8 -*-
import scrapy
from sx.items import SxItem


class SkSpider(scrapy.Spider):
    name = 'sk'
    allowed_domains = ['biqiuge.com']
    start_urls = ['https://www.biqiuge.com/book/4772/']

    def parse(self, response):

        for box in response.xpath("//div[@class='listmain']/dl/dd"):
            #print(box)
            a = box.xpath('./a/@href')
            b = box.xpath('./a/text()')
            url =  'https://www.biqiuge.com' + a.extract()[0]
            yield scrapy.Request(url,callback=self.parse_2)

    def parse_2(self, response):
        item = SxItem()
        title = content = response.xpath('//div[@class="content"]/h1/text()').extract()
        item['title']=title[0]
        content = response.xpath('//div[@id="content"]/text()').extract()

        allcontent = ''
        for i in content:
            allcontent = allcontent + i + '\n'

        item['content'] = allcontent
        yield item
复制代码

settings.py配置文件,要加延迟设置

复制代码
BOT_NAME = 'sx'

SPIDER_MODULES = ['sx.spiders']
NEWSPIDER_MODULE = 'sx.spiders'


ROBOTSTXT_OBEY = False


DOWNLOAD_DELAY = 3

DEFAULT_REQUEST_HEADERS = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Accept-Language': 'en',
}



ITEM_PIPELINES = {
    'sx.pipelines.SxPipeline': 300,
}
复制代码

 

复制代码
class SxPipeline(object):
    def __init__(self):
        self.file = open('圣墟.txt','a+')
    def process_item(self, item, spider):

        str = item['content']
        self.file.write(item['title']+'\n')
        self.file.write(str)


        return item
复制代码

 

posted @   东哥加油!!!  阅读(203)  评论(0编辑  收藏  举报
编辑推荐:
· 开发者必知的日志记录最佳实践
· SQL Server 2025 AI相关能力初探
· Linux系列:如何用 C#调用 C方法造成内存泄露
· AI与.NET技术实操系列(二):开始使用ML.NET
· 记一次.NET内存居高不下排查解决与启示
阅读排行:
· Manus重磅发布:全球首款通用AI代理技术深度解析与实战指南
· 被坑几百块钱后,我竟然真的恢复了删除的微信聊天记录!
· 没有Manus邀请码?试试免邀请码的MGX或者开源的OpenManus吧
· 园子的第一款AI主题卫衣上架——"HELLO! HOW CAN I ASSIST YOU TODAY
· 【自荐】一款简洁、开源的在线白板工具 Drawnix
点击右上角即可分享
微信分享提示