Scrapy爬虫day2——简单运行爬虫
设置setting.py
修改机器人协议
ROBOTSTXT_OBEY = False
设置User-Agent
DEFAULT_REQUEST_HEADERS = { 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 'Accept-Language': 'en', 'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3294.99 Safari/537.36' }
添加start.py
为了能在IDE中使用,方便爬虫运行在爬虫组件同目录下创建start.py文件
from scrapy import cmdline cmdline.execute("scrapy crawl wx_spider".split())
目录树
E:. │ scrapy.cfg │ │ └─BookSpider │ items.py │ middlewares.py │ pipelines.py │ settings.py │ start.py │ __init__.py │ ├─spiders │ │ biqubao_spider.py │ │ __init__.py │ │ │ └─__pycache__ │ biqubao_spider.cpython-36.pyc │ __init__.cpython-36.pyc │ └─__pycache__ settings.cpython-36.pyc __init__.cpython-36.pyc
在爬虫下添加以下代码,打印出页面信息
#biqubao_spider.py def parse(self, response): print("*"*50) print(response.text) print("*" * 50)