scrapy爬虫 简单入门
1. 使用cmd+R命令进入命令行窗口,并进入你需要创建项目的目录
cd 项目地址
2. 创建项目
scrapy startproject <项目名>
cd <项目名>
例如
scrapy startproject quote
cd quote
3. 编写item.py文件(定义需要爬取的文件)
import scrapy
class QuoteItem(scrapy.Item):
# define the fields for your item here like:
text = scrapy.Field()
author = scrapy.Field()
tags = scrapy.Field()
4. 创建爬虫文件
scrapy genspider <爬虫文件名> <域名>
例如
scrapy genspider myspider quotes.toscrape.com
5. 编写myspider.py文件
import scrapy
from quote.items import QuoteItem
class MyspiderSpider(scrapy.Spider):
name = 'myspider'
allowed_domains = ['quotes.toscrape.com']
start_urls = ['http://quotes.toscrape.com/']
def parse(self, response):
for each in response.xpath('//div[@class="quote"]'):
item = QuoteItem()
item['text'] = each.xpath('./span/text()').extract()[0]
item['author'] = each.xpath('.//small/text()').extract()[0]
list = each.xpath('.//a[@class="tag"]/text()').extract()
#列表形式的文件不能存入mysql,需要弄成str形式
item['tags']= '/'.join(list)
yield item
next = response.xpath('//li[@class="next"]/a/@href').extract()[0]
url = response.urljoin(next)
yield scrapy.Request(url=url,callback=self.parse)
6. 编写pipelines.py(如果需要存入到数据库)
import pymysql.cursors
class QuotePipeline(object):
def __init__(self):
self.connect = pymysql.connect(
host='localhost',
user='root',
password='',
database='quotes',
charset='utf8',
)
self.cursor = self.connect.cursor()
def process_item(self, item, spider):
item = dict(item)
sql = 'insert into quote(text,author,tags) values(%s,%s,%s)'
self.cursor.execute(sql,(item['text'],item['author'],item['tags']))
self.connect.commit()
return item
def close_spider(self,spider):
self.cursor.close()
self.connect.close()
7. 修改setting.py文件
ROBOTSTXT_OBEY = True
8. 运行
scrapy crawl myspider
9. 保存为csv文件
scrapy crawl quotes –o quotes.csv
参考资料
https://www.cnblogs.com/shuimohei/p/10491833.html
https://blog.csdn.net/ck784101777/article/details/104468780/