scrapy

scrapy是一个网络爬虫框架

1.环境搭建

安装依赖：pip install Scrapy，这里安装的版本为2.8.0
查看scrapy创建爬虫模块可以使用的模板：

(base) E:\>scrapy genspider --list
Available templates:
  basic
  crawl
  csvfeed
  xmlfeed

创建爬虫项目：这里使用默认模板basic创建一个项目名称为scrapy_test的爬虫项目为例
1. 创建爬虫项目，一个爬虫项目下可以有多个爬虫模块：scrapy startproject scrapy_test
2. 切换到项目根目录下：cd scrapy_test
3. 创建爬虫模块：scrapy genspider -t basic csdn csdn.com
4. 运行：scrapy crawl csdn。
通常为了方便调试，需要进行一些操作。调试的方式有如下两种
1. 在项目根目录下新建一个脚本文件main.py，编辑如下：
```
from scrapy.cmdline import execute

import sys
import os
sys.path.append(os.path.dirname(os.path.abspath(__file__)))

# execute(["scrapy", "crawl", "爬虫名称"])
execute(["scrapy", "crawl", "csdn"])
```
1. 在DOS窗口下进入shell：scrapy shell 需要下载的url
修改项目的配置文件settings.py

# 不遵守robots.txt规则
ROBOTSTXT_OBEY = False

2.Selector

这个类提供了css方法，可以使用CSS选择器的语法用于选择HTML的元素，方便后续提取数据。示例如下：

from scrapy import Selector

html = """
<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>这是一个测试HTML文档</title>
</head>
<body>
    <div class="info first" id="intro">
        <p class="age">年龄：23</p>
        <p class="name">姓名：XCER</p>
        <p class="work">职业：student</p>
        <p>性别：男</p>
    </div>
    <div class="info second" id="photo">
        <image src="" alt="帅照"/>
    </div>
</body>
</html>
"""

sel = Selector(text=html)

# 使用XPath语法时可以结合方法使用
# 选择包含class属性为info的div元素
div = sel.css(".info")
print(div)

info = sel.css("div p::text").extract()
# ['年龄：23', '姓名：XCER', '职业：student', '性别：男']
print(info)

age = sel.css("div p.age::text").extract()
if age:
    # 年龄：23
    print(age[0])

name = sel.css("div p[class='name']::text").extract()
if name:
    # 姓名：XCER
    print(name[0])

work = sel.css("div p:nth_child(3)::text").extract()
if work:
    # 职业：student
    print(work[0])

这个类提供了xpath方法，可以使用XPath语法用于选择HTML的元素，方便后续提取数据。示例如下

from scrapy import Selector

html = """
<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>这是一个测试HTML文档</title>
</head>
<body>
    <div class="info first" id="intro">
        <p class="age">年龄：23</p>
        <p class="name">姓名：XCER</p>
        <p class="work">职业：student</p>
        <p>性别：男</p>
    </div>
    <div class="info second" id="photo">
        <image src="" alt="帅照"/>
    </div>
</body>
</html>
"""

sel = Selector(text=html)

# 使用XPath语法时可以结合方法使用
# 选择包含class属性为info的div元素
div = sel.xpath("//div[contains(@class, 'info')]").extract()
print(div)

info = sel.xpath("//div/p/text()").extract()
# ['年龄：23', '姓名：XCER', '职业：student', '性别：男']
print(info)

age = sel.xpath("//p[@class='age']/text()").extract()
if age:
    # 年龄：23
    print(age[0])

name = sel.xpath("//div[@id='intro']/p[last()-2]/text()").extract()
if name:
    # 姓名：XCER
    print(name[0])

work = sel.xpath("//div/p[3]/text()").extract()
if work:
    # 职业：student
    print(work[0])

3.scrapy反爬

通过downloadmiddleware更换user-agent，使用fake-uaeragent库
设置IP代理池：西刺免费代理IP ，爬取西刺的IP
限制下载速度
不设置cookie
自定义spider的设置

scrapy项目的部署

使用scrapyd

posted on 2024-01-31 21:39 scrutiny-span 阅读(2) 评论(0) 编辑收藏举报

刷新页面返回顶部

导航

scrapy

1.环境搭建

2.Selector

3.scrapy反爬

scrapy项目的部署