Scrapy 框架的 spider 参数

Spider 的用法
- 变量
- 函数
Spider 的常用解析函数
response 对象相关函数

Spider 的用法

变量

# 名字，要求全局唯一
name = 'douban'
# 运行爬取的网址
allowed_domains = ['movie.douban.com']
# 开始时爬取的 URL
start_urls = ['https://movie.douban.com/top250?start=0&filter=']
# 设置 settings 配置，会覆盖 settings.py 中配置的内容
custom_settings = {
    "LOG_LEVEL": "WARNING"
}

函数

# 爬取之前调用函数
def start_requests(self):
    url = 'https://movie.douban.com/top250?start=0&filter='
    yield scrapy.Request(
        # 爬之前把 URL 给改了
        url=url,
        # 然后调用爬取函数
        callback=self.parse,
    )

# 处理请求的函数
def parse(self, response):
    print(response)

# 请求处理结束之后调用的函数
def close(spider, reason):
    print("关闭时调用")

Spider 的常用解析函数

imgs = response.xpath('//ol[@class="grid_view"]/li')
for img in imgs:
    # get() 用于获取某一个 xpath 的值
    img_url = img.xpath('./div[@class="item"]/div[@class="pic"]/a/img/@alt').get()
    print(img_url)
    # getall() 用于获取多个 xpath 的值
    img_url = img.xpath('./div[@class="item"]/div[@class="pic"]/a/img/@alt').getall()
    print(img_url)
    # extract() 等价于 getall()
    img_url = img.xpath('./div[@class="item"]/div[@class="pic"]/a/img/@alt').extract()
    print(img_url)
    # extract_first() 等价于 get()
    img_url = img.xpath('./div[@class="item"]/div[@class="pic"]/a/img/@alt').extract_first()
    print(img_url)
    # re() 根据正则表达式获取数据
    img_url = img.xpath('./div[@class="item"]/div[@class="pic"]/a/img/@alt').re('肖.*')
    print(img_url)
    # re_first() 根据正则表达式获取第一个数据
    img_url = img.xpath('./div[@class="item"]/div[@class="pic"]/a/img/@alt').re_first('肖.*')
    print(img_url)

response 对象相关函数

# response.status 状态码
print(response.status)
# response.body 字节类型的请求内容
print(response.body)
# response.body.decode('utf-8') 转换成 utf-8 类型的请求内容
print(response.body.decode('utf-8'))
# response.url 请求的 URL
print(response.url)
# response.urljoin('abc') 拼接 URL
print(response.urljoin('abc'))
# response.encoding 当前页面中的 HTML 字符集编码
print(response.encoding)

posted @ 2023-02-11 16:00 淦丘比阅读(57) 评论(0) 收藏举报

刷新页面返回顶部

摆烂的revue

活着就是为了摆烂，向往自由的摆烂迎难而退，绝不逞强，能跑就跑我，再生产，要成为最没用的废物这就是我五彩斑斓的世界哒，丛雨酱

Scrapy 框架的 spider 参数

Spider 的用法

变量

函数

Spider 的常用解析函数

response 对象相关函数

公告

摆烂的revue

活着就是为了摆烂，向往自由的摆烂 迎难而退，绝不逞强，能跑就跑 我，再生产，要成为最没用的废物 这就是我五彩斑斓的世界哒，丛雨酱

Scrapy 框架的 spider 参数

Spider 的用法

变量

函数

Spider 的常用解析函数

response 对象相关函数

公告

活着就是为了摆烂，向往自由的摆烂迎难而退，绝不逞强，能跑就跑我，再生产，要成为最没用的废物这就是我五彩斑斓的世界哒，丛雨酱