步骤:
目标文件夹下,按Shift+右键进控制台:(Win10则在scrapy各命令前加个python -m )
①scrapy startproject project;执行两句提示,要改name(别和项目同名;另外项目名及爬虫名都不要用test、fang等有莫名意义的单词)和domain(有的不加www.);
②各主要的.py文件:手建的main、items、spiders下的name、pipelines、settings;
③scrapy crawl name -s JOBDIR=job666(或不加此-s命令,改为在settings中加1行:JOBDIR='job666')。在项目根目录下执行,即scrapy.cfg所在的那层文件夹。
*******************分割线*******************
杂项:
步骤①中所提示的scrapy genspider name命令,name前缺省的是 -t basic ,若用CrawlSpider类的rule正则过滤全站网址则改用 -t crawl 。爬xml网页也可用 -t xmlfeed ,爬表格也可用 -t csvfeed 。
③用的-s命令settings:9个请求就停-s CLOSESPIDER_PAGECOUNT=9;CLOSESPIDER_ITEMCOUNT。
而参数JOBDIR,可使cmd下按1次Ctrl+C以暂停,下次断点续爬再执行此命令,而连按2次是完全终止。按了1次Ctrl+C后,有的网站还持续爬个五六秒,静等而不要按第2次。
另外,有的网站如拉勾,爬虫要Cookie验证,并且Cookie中的个别行为分析key还有有效期,或许会在爬着爬着或下次断点续爬时过期了。
cmd中的各命令也可在PyCharm的Terminal中执行。若选择os.system(*)运行,按钮Pause双竖线只是暂停了Run窗口的输出,是假的interrupt,和Terminal窗口下按1次Ctrl+C并不同;Stop红方块等同7秒内按了2次Ctrl+C,job666下的requests.queue文件夹会被清空,下次运行就报错no more duplicates will be shown,只能删除job666后再从头爬。
而cmd或Terminal卡死了、不慎关了反倒无妨,等同1次Ctrl+C,重启执行命令依然可断点续爬。
*******************分割线*******************
spiders文件夹下的主程序.py:
key=scrapy.Field(),key要通过键即item['key']操作(item=…Item()),不支持item.key的属性访问。
Filtered offsite request to 'url':和allowed_domains冲突。改allowed_domains,或在yield此url句加dont_filter=True。
纯url在fiddler能请求,yield Request它就卡住:设了代理且过期了,设DOWNLOAD_TIMEOUT = 3可在3秒后放弃,解放能力去处理其他url。
Request的2类参数:请求のurl,method,body,headers,cookies;回调のcallback,meta,dont_filter。
response的属性:body、meta['*']、url、urljoin(*)、status、request.headers或headers.getlist('Set-Cookie')[num].decode()。
各scrapy.Request或.FormRequest,在自身及其callback函数间才类似requests.Session(),而在同一个函数内却类似requests。故各请求间若要传递cookies,每个请求都要独立写个函数,而不能挤在一个函数内。
如start_urls般写为Spider类的类属性,以自定义更高优先级的settings:
custom_settings=dict(ROBOTSTXT_OBEY=False,DOWNLOAD_DELAY=random.random()/5,
DEFAULT_REQUEST_HEADERS={'User-Agent':UserAgent().random})
重写start_requests函数,则类属性start_urls里的各url不再被请求。如:
def start_requests(self):
with open('E:/data.txt',encoding='utf8') as f:
rows=f.readlines()
for r in rows:
url,id=r.replace('\n','').split(';')
yield Request(url,self.parse_xxx,meta={'id':id})
*******************分割线*******************
LinkExtractor:
from scrapy.linkextractors import LinkExtractor
①目标url是src属性的值,则改参数attrs的默认值为('href','src');②匹配到的链接用js写的,如a href = "javascript: ggg('http://……/p/'); return false">,则设参数process_value=self.pv,内容如下:
import re
def pv(value):
m=re.findall("http.+?p/",value)
if m:return m[0]
*******************分割线*******************
标签定位&内容提取:
def parse…(self,response):
parse()的2参response,css、xpath中,在定位标签后即可提取文本属性,并没把定位和提取隔开写。不提取则返回标签的list,各标签可继续用它俩。最后的.extract()[0],返回str等基本类型,等同.extract_first()。
css&xpath可混用:response.xpath('//div[contains(@alt, "→")]').css('.a ::attr(src)').extract()。
xpath在定位后的提取用//text()和//@alt,css是 ::text和 ::attr(alt)。
抓取xml,定位无返回结果:response.selector.remove_namespaces(); 再定位如response.css('a')。
css等定位处报错Response no attribute body_as_unicode,因网站的headers无content-type字段,scrapy不知所抓网页的类型:把此处的response.text改为response.body。
*******************分割线*******************
scrapy shell url:等效浏览器,无需scrapy.Request(url,callback,headers=xx)般模拟头。
①提取大淘客领券下的?条/页:
scrapy shell www.dataoke.com/qlist/
response.css('b+ b::text').extract_first()
②不用css、xpath而是用re,匹配本地1个html(编码gb2312)中的链接文字:
python -m scrapy shell E:/index之日记本.html
response=response.replace(encoding='gb2312') #response.encoding
from scrapy.linkextractors import LinkExtractor
LinkExtractor(('Diary\?id=\w+',)).extract_links(response)[0].text #.url
*******************分割线*******************
scrapy默认的去重指纹是sha1(method+url+body+header),判重指纹偏多致过滤的url少,如有时间戳的,网址改了但网页没变依然会重复请求。这时得自定义判重规则,如只根据去除了时间戳的url判重。
settings所在目录新建个内容如下的customDupefilter.py,并在settings中写入DUPEFILTER_CLASS='项目名.customDupefilter.UrlDupeFilter':
import re
from scrapy.dupefilter import RFPDupeFilter
class UrlDupeFilter(RFPDupeFilter):
def __init__(self,path=None):
self.urls_seen=set()
RFPDupeFilter.__init__(self,path)
def request_seen(self,request):
pattern=re.compile('&?time=\d{10}&?')
url=pattern.sub('&',request.url)
if not url in self.urls_seen:
self.urls_seen.add(url)
****************************************分割线****************************************
gerapy:
本机pip install gerapy
前提:本机及各分机都需pip install scrapyd,并在cmd执行scrapyd,浏览器打开http://127.0.0.1:6800/(云服务器是ip:6800,要先改*scrapyd.conf文件的bind_address = 0.0.0.0),看看是否都能正常启动。
⒈cmd切至目标目录:gerapy init→cd gerapy→gerapy migrate→gerapy runserver(云要补个 0.0.0.0:8000);
⒉单机能正常运行的scrapy项目复制到gerapy\projects下;
⒊浏览器打开http://127.0.0.1:8000(云是ip:8000):①主机管理:创建→name、本机是127.0.0.1、6800、云的认证、创建;②项目管理:目标scrapy项目行尾的部署→输入打包描述并打包→勾上复选框点击批量部署;③返回主机管理:调度→运行。
*******************分割线*******************
scrapy_redis:
settings.py:
REDIS_HOST及REDIS_PORT,为Redis所在电脑即主机的IP、端口,主机可不写而各辅机必须写;
SCHEDULER_PERSIST=True等同给JOBDIR赋值,不删爬过的urls以供续爬时做去重判断;
Redis中的key若为set类型,则加REDIS_START_URLS_AS_SET=True,list等其他类型保持默认值;
调度SCHEDULER和去重DUPEFILTER_CLASS,由scrapy家的改为redis家的。
爬虫.py:
默认继承的父类scrapy.Spider改为RedisSpider,整站爬的CrawlSpider则改为RedisCrawlSpider;
start_urls行换为值随意如tempStartUrls的类变量redis_key,爬虫启动后若redis中无此key则一直监听等待。
redis的客户端redis-cli.exe里的常用命令:
flushdb:清空数据库缓存以方便从头开爬,若以后还要断点续爬则不要用。
rpush name url:list类型的key的插值,Python中写为redis.Redis('主机的ip',6379).rpush(key,*urlList),反义词lpush在左侧插入;key若为set型则用.sadd(key,*urlList),有序的zset型用.zadd(key,**urlDict)。
启动redis爬虫:双击(若要加载.conf文件中所配置的密码则通过命令行)以启动redis的服务端redis-server.exe,若存储用的mongodb则要net start mongodb;执行main.py中的scrapy crawl xxx。
redis配置密码:
①redis.windows.conf文件(Linux叫redis.conf),在# requirepass foobared下加1行requirepass 123456。
②在redis的安装目录下进入命令窗口→redis-server redis.windows.conf。在爬虫等各任务结束前,此命令窗口要一直开着。若没出现木桶图案,则进任务管理器把redis-server.exe结束任务后再执行②。
③打开RDM可视化软件→Connect …→name随意,host为127.0.0.1,port为6379,auth为123456→Test …→OK。
*******************分割线*******************
用scrapyd部署爬虫到阿里云等服务器:
①服务器后台的入方向选项卡,克隆个默认的授权策略,端口范围改为6800/6800,授权对象0.0.0.0/0等都不变。
②在最新版PyCharm创建的虚拟环境中(下文各命令都是在此虚拟环境的Terminal中执行):pip isntall scrapyd→scrapyd。
③浏览器访问服务器外网ip:6800,发现打不开:法1是改scrapyd包下的default_scrapyd.conf文件的bind_address为0.0.0.0;但scrapyd没提供http验证登录,故法2不改bind_address,而是把scrapyd挂到反向代理服务器nginx上:http://wsxiangchen.com/details/?id=14。
④修改scrapy项目根下的scrapy.cfg:[deploy]改为[deploy:随意比如spider名],url的localhost改为外网ip,project名不动,末尾加两行username=user和password=passwd(user和passwd是在③的nginx中设的);上传项目到服务器。
⑤部署爬虫到服务器:点击Terminal左上的绿+再新开个窗口→路径切到项目根(⑥亦在此路径下)→scrapyd-deploy spider名 -p project名。
⑥运行爬虫:curl http://外网ip:6800/schedule.json -u username:password -d project=project名 -d spider=spider名。开爬后,可通过ip:6800/Jobs及logs查看运行情况。
****************************************分割线****************************************
Egの校花网的图,用传统的open法下载:
1、在目标文件夹校花网下进入控制台,依次执行如下3条语句:
scrapy startproject xiaohuar;cd xiaohuar;scrapy genspider xhr www.xiaohuar.com
2、spiders目录下的xhr.py:
import scrapy,re,os
pattern=re.compile('alt="(.+?)"\s+?src="(/d/file/.+?)"')
noName=r'[\\/:*?"<>|]'
folder='E:/xiaohuar'
if not os.path.isdir(folder):os.makedirs(folder)
class XhrSpider(scrapy.Spider):
name = 'xhr'
allowed_domains = ['www.xiaohuar.com']
spideredCount = 0
# 启动爬虫时才接收的参数end,要写为类的__init__()形参;self.end可在别的类函数使用
def __init__(self, end=None, *args, **kwargs):
super(XhrSpider, self).__init__(*args, **kwargs)
self.start_urls=[f'http://www.xiaohuar.com/list-1-{x}.html' for x in range(0,int(end))]
self.end=int(end)
def parse(self, response):
result = pattern.findall(response.text)
for name,imgUrl in result:
if not '.php' in imgUrl: #有的图已挂;用response.urljoin(imgUrl)拼接域名和imgUrl
self.spideredCount += 1
if self.spideredCount>self.end: break #循环内用exit()的效果不理想
yield scrapy.Request(response.urljoin(imgUrl),self.downLoad,meta={'name':name})
def downLoad(self, response):
fileName=response.meta['name']+'.'+response.url.split('.')[-1]
fileName=re.sub(noName, ' ', fileName).strip() # 英文标点换为空格,首尾的不要
with open(f'E:/xiaohuar/{fileName}', 'wb') as f:
f.write(response.body)
fileName=response.meta['name']+'.'+response.url.split('.')[-1]
fileName=re.sub(noName, ' ', fileName).strip() # 英文标点换为空格,首尾的不要
with open(f'E:/xiaohuar/{fileName}', 'wb') as f:
f.write(response.body)
#运行方法①:不创建工程,单文件运行1个scrapy的Spider
if __name__ == '__main__':
from fake_useragent import UserAgent
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings #导入本项目的settings
#settings=dict({'User-Agent':UserAgent().random},ROBOTSTXT_OBEY=False)
settings=get_project_settings()
process=CrawlerProcess(settings=settings)
process.crawl(XhrSpider,end=5) #process.crawl(S2)……添加几个Spider类就写几句crawl
process.start()
****运行方法①&②的分割线****
#运行方法②:
在cmd或Terminal断点续爬:scrapy crawl xhr -a end=5 -s JOBDIR=jobXiaohuar。
或在PyCharm中运行:项目根目录或子目录下,建个项目的运行文件main.py——
import os #PyCharm中的输出若在Run窗口而非Terminal,则不加断点续爬参数-s
os.system('scrapy crawl xh')
*****注意事项*****
①PyCharm下运行main.py时若报错"Unknown command: crawl":Run→Edit Configurations→双击打开左栏的main→Working Directory设置为项目根目录。
②输出结果若导出为csv(或json),则在python -m scrapy crawl xxx后加参数: -o "file://E:/xiaohuar.csv" -s FEED_EXPORT_ENCODING=gbk',也可改为在settings中加1句FEED_EXPORT_ENCODING='gbk'。若有特殊字符,则输出为csv还是得用默认的utf8编码,最后另存为utf8+BOM就可用excel正常打开。
scrapy导出到csv,各行爬取数据间有空行:*packages\scrapy\exporters.py,类CsvItemExporter的__init__函数内,在file,行和line_buffering*行之间加一句【newline='',】(注意有英文逗号)。
****************************************分割线****************************************
Egの爬大淘客各物品的名称、佣金、销量、存量、券后价,存至csv:
1、scrapy startproject dataoke;
cd dataoke;scrapy genspider dtk www.dataoke.com
main.py:只有它一个py文件是手动创建的,放在项目根目录或子目录均可
import os #PyCharm中的输出若在Run窗口而非Terminal,则不加断点续爬参数-s
os.system('scrapy crawl dtk -o "file://E:/da tao ke.csv" -s FEED_EXPORT_ENCODING=gbk')
*******************分割线*******************
2、items.py:设置要爬的字段——
name = scrapy.Field()
surplus = scrapy.Field()
currentSales = scrapy.Field()
commission = scrapy.Field()
price = scrapy.Field()
*******************分割线*******************
3、dtk.py:
import scrapy
from ..items import DataokeItem #..items等同dataoke.items
#此处名item,则pipelines中也得叫item;yield item在Request前,可把item句写在类外
item=DataokeItem()
from fake_useragent import UserAgent
class DtkSpider(scrapy.Spider):
name = 'dtk'
allowed_domains = ['www.dataoke.com']
start_urls = [f'http://www.dataoke.com/quan_list?page={n}' for n in range(1,12)]
def start_requests(self):
for url in self.start_urls:
#token=hashlib.md5(*random*.encode()).hexdigest(),过期后它俩同时改为最新的
h = {'Cookie':'token=3c6deb668cfcdf659d52dadd17eb281a;random=2834', \
'User-Agent':UserAgent().random,'Referer': url} #取注settings内的COOKIES…
yield scrapy.Request(url,self.parse,headers=h)
def parse(self, response):
for x in response.css('div.quan_goods'):
item['name']=x.css('.quan_title::text').extract_first().strip()
item['surplus']=x.css('[style*=f15482]::text').extract()[1] #属性值无0等特殊字符可不套"
item['currentSales']=x.css('[style*="7DB0D0"]::text').extract_first()
item['commission']=x.css('[style*="20px"]::text').extract_first().strip().replace(' ','')
item['price']=x.css('[style*="30px"]::text').extract_first().strip()
yield item
*******************分割线*******************
5、settings.py:
import random
ROBOTSTXT_OBEY=False #改:True为False;
CONCURRENT_REQUESTS=32 #取注并改:并行请求数32为64
REACTOR_THREADPOOL_MAXSIZE=20 #增:最大线程数
LOG_LEVEL = 'INFO' #增:默认的DEBUG适于开发时,生产环境降级log可降耗CUP
REDIRECT_ENABLED=False #增:禁止重定向
RETRY_TIMES=3 #增:重复请求次数,RETRY_ENABLED=False则禁止重试
DOWNLOAD_TIMEOUT=4 #增:4秒后还没反应则放弃对它的请求
DOWNLOAD_DELAY=random.random()*5 #取注并改:这网站加载慢,延时设久点
COOKIES_ENABLED=False #取注:本例的scrapy.Request()请求头要用独立的Cookie
#JOBDIR='jobDaTaoKe' #若选择Terminal运行可启用本句,Ctrl+C可断点续爬
****************************************分割线****************************************
Egの无图小说网,登录后爬我的书架中的所有藏书:
1、python -m scrapy startproject wtxs;
cd wtxs;python -m scrapy genspider -t crawl wutuxs www.wutuxs.com
main.py:
import os
os.system('python -m scrapy crawl wutuxs')
*******************分割线*******************
3、wutuxs.py:
Rule中各参数的执行顺序:start_url的response被link_extractor→所提取的urls被process_request →各urls的response被callback作自定义parse→各urls的response继续被follow至link_extractor……
本例中,start_requests函数传出登录后的session,而start_url正是在其回调函数中被请求, 若需改头亦在此回调内写;起始url的response在被默认完整parse后传至Rule,开启新征程:
先被LE参数提取所需的urls;若这些urls需要各异的Referer等等随后才能被正确请求,则在 process_request指定的函数中设置,并返回改头换面了的新请求;各完整response传至callback作自定义的parse。
若response也要被追踪,则置follow为True,重走一遍LE→callback的整个流程。
LinkExtractor若有参数指定了函数,正常写为self.func,同scrapy.Request的callback参数没不同;就是Rule的参数,如常用的callback,和process_request等,写为str如'func'。
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider,Rule
from fake_useragent import UserAgent
import random
class WutuxsSpider(CrawlSpider):
name='wutuxs'
allowed_domains=['www.wutuxs.com']
custom_settings=dict(ROBOTSTXT_OBEY=False,DOWNLOAD_DELAY=random.random()/10,
DEFAULT_REQUEST_HEADERS={'User-Agent':UserAgent().random})
rules=(Rule(LinkExtractor(allow='indexflag=1'),
process_request='addReferer',callback='parseNovel',follow=True),)
def start_requests(self):
loginUrl='http://www.wutuxs.com/login.php?do=submit'
h={'Content-Type':'application/x-www-form-urlencoded'}
fd=dict(username='tom_cheng',password='***',
usecookie='315360000',action='login',submit=' 登 录 ')
yield scrapy.FormRequest(loginUrl,headers=h,formdata=fd,callback=self.afterLogin)
def afterLogin(self, response):
startUrl='http://www.wutuxs.com/modules/article/bookcase.php'
yield scrapy.Request(startUrl)
def addReferer(self,request): #此站无需模拟动态referer,只是找不到好例子
print(request.headers) #{},设置cookies等用r*.replace,同response.encoding
request=request.replace(headers={'User-Agent':UserAgent().random,
'Content-Type':'application/x-www-form-urlencoded',
'Referer':'&'.join(request.url.split('&')[:2])})
return request
def parseNovel(self, response):
response=response.replace(encoding='gbk')
chapters=response.css('#at a')
with open('E:/我的书架.txt','a+',encoding='gbk') as f:
for chapter in chapters:
chapterTitle=chapter.css('::text').extract_first()
chapterUrl=response.urljoin(chapter.css('::attr(href)').extract_first())
f.write(chapterTitle+':'+chapterUrl+'\n')
*******************分割线*******************
Egの用scrapy.Selector来解析网页源代码:
import requests
from scrapy import Selector
def novelChapters():
indexUrl='http://www.wutuxs.com'
html=requests.get(indexUrl+'/modules/article/reader.php?aid=6301')
html.encoding='gbk'
chapters=Selector(html).css('#at a')
print('总章节数:'+str(len(chapters)))
with open('E:/我的书架.txt','a+') as f:
for chapter in chapters:
chapterTitle=chapter.css('::text').extract_first()
chapterUrl=indexUrl+chapter.css('::attr(href)').extract_first()
f.write(chapterTitle+':'+chapterUrl+'\n')
novelChapters()
****************************************分割线****************************************
Egの豆瓣影评,浏览器登录获取cookies后,用CrawlSpider爬:
1、python -m scrapy startproject doubanMovie;
cd doubanMovie;scrapy genspider comments movie.douban.com
main.py:
import os #有的user或content含特殊字符故不输出为gbk,另存为utf8+BOM
os.system('python -m scrapy crawl comments -o "file://E:/dbMovie.csv"')
*******************分割线*******************
2、items.py:
import scrapy
class DoubanmovieItem(scrapy.Item):
user=scrapy.Field()
userlink=scrapy.Field()
view=scrapy.Field()
rate=scrapy.Field()
time=scrapy.Field()
votes=scrapy.Field()
content=scrapy.Field()
*******************分割线*******************
3、comments.py:
CrawlSpider下,start_requests()不论深层调用了几个yield Request(),最后那个不能用callback,否则会与Rule()脱钩;而parse_start_url()有途径,让末Request()的url也用上Rule()的callback。
import random,string
from selenium import webdriver
from fake_useragent import UserAgent as ua
from scrapy import Request
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider,Rule
from ..items import DoubanmovieItem
userName='904477955@qq.com'
pwd='***'
def loginByBrowser():
options=webdriver.ChromeOptions()
options.binary_location='D:/Program Files/Browser/CentBrowser/Application/chrome.exe'
options.add_argument('disable-infobars')
driver=webdriver.Chrome('C:/Program Files/Python36/chromedriver',0,options)
driver.get('https://www.douban.com/accounts/login?source=movie')
css=driver.find_element_by_css_selector
css('#email').send_keys(userName)
css('#password').send_keys(pwd)
css('.btn-submit').click()
try: #登录界面的3种情况:有验证码,点击登录后跳出验证码,无需验证码
if css('#captcha_field'): #无此标签:selenium报错,scrapy用extract_first()不报
input('浏览器端手动输完验证码并点击登录后,在本句句尾任敲一字母:')
except:pass
cookies={d['name']:d['value'].replace('"','') for d in driver.get_cookies()}
driver.quit()
return cookies
cookies=loginByBrowser()
#cookies={'ue':'用户名','__yadk_uid':'密码加盐的MD5','bid':'11位随机str'}
class CommentsSpider(CrawlSpider):
name='comments'
allowed_domains=['movie.douban.com']
rules=(Rule(LinkExtractor(restrict_css='.next'),follow=True,
process_request='update_cookies',callback='parse_item'),)
def start_requests(self):
firstPage='https://movie.douban.com/subject/20495023/comments?start=0'
headers={'User-Agent':ua().random}
#settings保持#COOKIES…行,则cookies用过1次,会共享给后面的请求自动引用
yield Request(firstPage,headers=headers,cookies=cookies)
def parse_start_url(self,response):
return self.parse_item(response)
def update_cookies(self,request):
global cookies
cookies['bid']=''.join(random.sample(string.ascii_letters+string.digits,11))
request=request.replace(cookies=cookies,headers={'User-Agent':ua().random})
return request
def parse_item(self,response):
item=DoubanmovieItem()
for x in response.css('.comment'):
item['user']=x.css('.comment-info> a::text').extract_first()
item['userlink']=x.css('.comment-info> a::attr(href)').extract_first()
item['view']=x.css('.comment-info span::text').extract_first()
item['rate']=x.css('.rating::attr(title)').extract_first('无')
item['time']=x.css('.comment-time::text').extract_first().strip()
item['votes']=x.css('.votes::text').extract_first()
item['content']=x.css('p::text').extract_first().strip()
yield item
*******************分割线*******************
5、settings.py:
import random
from fake_useragent import UserAgent
ROBOTSTXT_OBEY=False #改
CONCURRENT_REQUESTS=64 #取注&改
REACTOR_THREADPOOL_MAXSIZE=20 #增
RETRY_TIMES=3 #增
DOWNLOAD_TIMEOUT=4 #增
DOWNLOAD_DELAY=random.random()/5 #取注&改
DEFAULT_REQUEST_HEADERS={'User-Agent':UserAgent().random} #取注&改
****************************************分割线****************************************
Egの豆瓣9分榜单,用RedisSpider爬,存至Excel或MongoDB:
1、scrapy startproject doubanBook;
cd doubanBook;scrapy genspider dbbook www.douban.com
main.py:
import os
os.system('scrapy crawl dbbook')
*******************分割线*******************
2、items.py:
bookName = scrapy.Field()
rate = scrapy.Field()
author = scrapy.Field()
url=scrapy.Field()
*******************分割线*******************
3、dbbook.py:
本例选用Redis的db3而非默认的db0;redis-server.exe本次也没通过双击启动,而是执行带密码的命令行redis-server redis.windows.conf,conf文件中已新加了一行requirepass 123456。
from redis import Redis;r=Redis('127.0.0.1',6379,3,'123456')
from ..items import DoubanbookItem;item=DoubanbookItem()
from scrapy_redis.spiders import RedisSpider
class DbbookSpider(RedisSpider):
name='dbbook'
allowed_domains=['www.douban.com']
r.flushdb() #清空当前所选用的db3
redis_key='tempStartUrls'
urls=[f'https://www.douban.com/doulist/1264675/?start={x*25}' for x in range(2)]
r.sadd(redis_key,*urls) #.sadd()把redis_key初始为set对象,.rpush则是list对象
#exit() #另选它库及输密码,不仅Redis对象要加,scrapy的settings也要有
custom_settings=dict(REDIS_PARAMS={'db':3,'password':'123456'})
def parse(self, response):
books=response.css('.bd.doulist-subject')
for book in books:
item['bookName']=book.css('.title a::text').extract_first().strip()
item['rate']=book.css('.rating_nums::text').extract_first()
item['author']=book.css('.abstract::text').extract_first().strip()
item['url']=book.css('.title a::attr(href)').extract_first()
yield item
*******************分割线*******************
4、pipelines.py:
# 存储之法1:存至excel
# from openpyxl import Workbook
# class DoubanbookPipeline(object):
# wb = Workbook()
# ws = wb.active
# ws.append(['书名','评分','作者','网址'])
#
# def process_item(self, item, spider):
# row = [item['bookName'],item['rate'],item['author'],item['url']]
# self.ws.append(row)
# self.wb.save('E:\douban.xlsx')
# return item
# 存储之法2:存至MongoDB,先执行命令行net start mongodb启动它
from pymongo import MongoClient
class DoubanbookPipeline(object):
def open_spider(self, spider):
self.client = MongoClient('localhost', 27017)
self.db = self.client.豆瓣读书
self.table = self.db.九分以上榜单
def close_spider(self, spider):
self.client.close()
def process_item(self, item, spider):
self.table.insert_one(dict(item))
return item
*******************分割线*******************
5、settings.py:
import random
from fake_useragent import UserAgent
ROBOTSTXT_OBEY=False #改
CONCURRENT_REQUESTS=64 #取注&改
REACTOR_THREADPOOL_MAXSIZE=20 #增
RETRY_TIMES=3 #增
DOWNLOAD_TIMEOUT=4 #增
DOWNLOAD_DELAY=random.random()/5 #取注&改
DEFAULT_REQUEST_HEADERS={'User-Agent':UserAgent().random} #取注&改
ITEM_PIPELINES={……} #取注,写入表格、数据库或下文件时要用到
FEED_FORMAT='csv'
FEED_URI='file://E:/douban.csv'
FEED_EXPORT_ENCODING='gbk'
#使用scrapy_redis缓存库的6行配置:
REDIS_HOST = '127.0.0.1'
REDIS_PORT = 6379
SCHEDULER_PERSIST = True
REDIS_START_URLS_AS_SET=True
SCHEDULER = 'scrapy_redis.scheduler.Scheduler'
DUPEFILTER_CLASS = 'scrapy_redis.dupefilter.RFPDupeFilter'
****************************************分割线****************************************
Egの知乎Live的各条主题及其演讲者:
1、scrapy startproject zhihuLive;cd zhihuLive;scrapy genspider zHLive api.zhihu.com
main.py:
import os
os.system('scrapy crawl zHLive')
*******************分割线*******************
2、items.py:
title=scrapy.Field()
speaker=scrapy.Field()
*******************分割线*******************
3、zHLive.py:
import scrapy,json
from ..items import ZhihuliveItem
item=ZhihuliveItem()
class ZhliveSpider(scrapy.Spider): #CrawlSpider要用LinkExtractor,参数tags和attrs限制取json里的url
name='zHLive'
allowed_domains=['api.zhihu.com']
start_urls=['https://api.zhihu.com/lives/homefeed?limit=10&offset=10&includes=live']
def parse(self, response):
result=json.loads(response.text)
if result['data']: #后面内容为空的网页,依然有['paging']['next']等,但data为[]
for x in result['data']:
item['title']=x['live']['subject']
item['speaker']=x['live']['speaker']['member']['name']
yield item
nextPageUrl=result['paging']['next'] + '&includes=live'
yield scrapy.Request(url=nextPageUrl,callback=self.parse)*******************分割线*******************
4の前言——SQLite在PyCharm中的使用:
①配置SQLite:(MySql类似)
最右侧的Database→加号→DataSource→选Sqlite(Xerial); sqlite在Pycharm中首次用时,先安装驱动,即点击左下角的Download链接。
②创建sqlite数据库和表:
选一路径如本项目所在的livespider,数据库命名为zhihulive.db ,点击apply和ok→PyCharm主界面右侧,依次点击Database 、zhihulive.db 、main,加号、table,表命名为LiveTable→依次点击加号来增加3个字段:id为INTEGER 主键 自增,title和speaker均为TEXT型(除数据类型在点击加号时已自动写出外,主键啦、自增啦等其他特性,只需敲1个字母如k、a,即自动弹出全称)→点击Execute。
补充の表操作如清空:右键表并选Open Console→DELETE FROM LiveTable→点左上角的绿三角。
***************分割线***************
4、pipelines.py:
import sqlite3
import MySQLdb
class ZhihulivePipeline(object):
def open_spider(self,spider):
#sqlite要写绝对路径,否则可能报错sqlite3.OperationalError: no such table: LiveTable
#self.conn = sqlite3.connect('E:/py/zhihulive.sqlite') #若无此sqlite库,则自动创建个
#连接mysql要写4+2个参数;而sqlite只需完整路径。与软件Navicat连接它俩的界面一致
self.conn=MySQLdb.connect(host='localhost',port=3306,user='chengy',password='',
db='novel',charset='utf8mb4') #现有utf8库的键名:mysql是db,django是name
self.cur=self.conn.cursor()
#self.cur.execute('create table if not exists LiveTable(speaker varchar(19),title text)')
def close_spider(self,spider):
self.cur.close()
self.conn.close()
def process_item(self, item, spider):
speaker = item['speaker']
title = item['title']
#格式化方式{},Sqlite和MySQL都支持;Sqlite还支持?,MySQL还支持%s
# sql='insert into LiveTable(speaker,title) values(?,?)' #Sqlite用?
# sql='insert into LiveTable(speaker,title) values(%s,%s)' #MySQL用%s
# self.cur.executemany(sql,[(speaker,title),]) #.execute(sql,(speaker,title))
sql=f'insert into LiveTable(speaker,title) values("{speaker}","{title}")'
self.cur.execute(sql)
self.conn.commit()
return item
*******************分割线*******************
5、settings.py:
import random
from fake_useragent import UserAgent
ROBOTSTXT_OBEY=False #改
CONCURRENT_REQUESTS=64 #取注&改
REACTOR_THREADPOOL_MAXSIZE=20 #增
RETRY_TIMES=3 #增
DOWNLOAD_TIMEOUT=4 #增
DOWNLOAD_DELAY=random.random()/5 #取注&改
DEFAULT_REQUEST_HEADERS={'User-Agent':UserAgent().random} #取注&改,反爬头Authorization已弃用
ITEM_PIPELINES={……} #取注
#JOBDIR='jobZhihuLive' #若选择Terminal运行可启用本句,Ctrl+C可断点续爬
****************************************分割线****************************************
Egの整站取百科的词条:
1、scrapy startproject baiduBaike;cd baiduBaike;scrapy genspider -t crawl baike baike.baidu.com
main.py:
import os
os.system('scrapy crawl zhihu')
*******************分割线*******************
2、items.py:
词条 = scrapy.Field()
网址 = scrapy.Field()
编辑 = scrapy.Field()
更新 = scrapy.Field()
创建者 = scrapy.Field()
*******************分割线*******************
3、baike.py:
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from ..items import BaidubaikeItem
item = BaidubaikeItem()
class BaikeSpider(CrawlSpider): #scrapy.Spider:NotImplementedError
name = 'baike'
allowed_domains = ['baike.baidu.com']
start_urls = ['https://baike.baidu.com']
#用start_requests()请求个start_urls后,再用LinkExtractor的正则匹配它返回的response
#Rule的首参LinkExtractor,allow等主要参数的值也可是();二参callback的值是str,指函数名
rules = (Rule(LinkExtractor(('/item/',)), 'parseList', follow=True),)
def parseList(self, response):
try: #有的词条是个白板,没有编辑次数、最近更新等内容
item['词条']=response.css('#query::attr(value)').extract_first()
item['网址']=response.url
item['编辑']=response.css('.description li:nth-child(2)::text').extract_first().split(':')[1]
item['更新']=response.css('.j-modified-time::text').extract_first()
item['创建者']=response.css('.description .show-userCard::text').extract_first()
yield item
except:
pass
*******************分割线*******************
4、pipelines.py:
import sqlite3
class BaidubaikePipeline(object):
def open_spider(self,spider):
self.conn = sqlite3.connect('E:/BaiduBaike.sqlite')
self.cur = self.conn.cursor()
self.cur.execute('create table if not exists baike(词条 varchar(19),\
网址 varchar(90),编辑 varchar(7),更新 varchar(10),创建者 varchar(19))')
def close_spider(self,spider):
self.cur.close()
self.conn.close()
def process_item(self, item, spider):
keys=','.join(item.keys())
values=','.join(len(item) * '?') #Sqlite用的?而非MySql那种%s
sql=f'insert into baike({keys}) values({values})' #key和value是一一对应的,序并没乱
self.cur.executemany(sql,[tuple(item.values()),])
self.conn.commit()
return item
*******************分割线*******************
5、settings.py:
import random
from fake_useragent import UserAgent
ROBOTSTXT_OBEY=False #改
CONCURRENT_REQUESTS=64 #取注&改
REACTOR_THREADPOOL_MAXSIZE=20 #增
RETRY_TIMES=3 #增
DOWNLOAD_TIMEOUT=4 #增
DOWNLOAD_DELAY=random.random()/5 #取注&改
DEFAULT_REQUEST_HEADERS={'User-Agent':UserAgent().random} #取注&改
ITEM_PIPELINES={……} #取注
#JOBDIR='jobBaiduBaike'
****************************************分割线****************************************
Egの整站下mzitu的图片:
1、scrapy startproject mzitu;cd mzitu;scrapy genspider-t crawl mv www.mzitu.com
main.py:
import os
os.system('scrapy crawl mv')
*******************分割线*******************
2、items.py:
name = scrapy.Field()
referer = scrapy.Field()
imgUrls = scrapy.Field()
*******************分割线*******************
3、mv.py:
方法①:假如真实的图片网址有规律,每个人的所有图址可直接推算出——
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider,Rule
from ..items import MzituItem #yield item在Request后,item赋值不可写在类函数外
import re
noName=r'[\\/:*?"<>|]'
pattern=re.compile('\d+(?=.jpg)') #预搜索;方法②无需pattern句
#延用查找栏第n个()内的值,PyCharm用…$n…,re.sub用r'…\n…'
#pattern=re.compile('\d+(.jpg)') #若用此pattern,则下文替换处改为.sub(r'%0.2d\1'
class MvSpider(CrawlSpider):
name = 'mv'
allowed_domains = ['www.mzitu.com']
start_urls = ['http://www.mzitu.com/']
rules=(Rule(LinkExtractor(('mzitu.com/\d{1,6}$',)),'parseAlbum',follow=True),)
def parseAlbum(self, response):
item=MzituItem() #用法②,若item句在函数外,A的部分图去了E目录,E的图在W等
#num=response.css('.pagenavi').xpath('a//text()').extract()[-2]
#print('总页数:',num) #xpath中/为子,//为后代;当前节点下的子标签,前无/
num=response.css('.pagenavi ::text').extract()[-3]
num=int(num)+1
name=response.css('.main-title::text').extract_first()
item['name']=re.sub(noName,' ',name).strip() #英文标点换为空格,首尾的不要
item['referer']=response.url
#上述代码是法①和法②所通用的,下面开始出现分歧:
#2017年的真实图片网址有规律,就不请求各page来获取了;前几年无规律的图址用法②
realUrl=response.css('.main-image img::attr(src)').extract_first()
if 'net/2017/' in realUrl:
item['imgUrls']=[pattern.sub('%0.2d' %page,realUrl) for page in range(1,num)]
yield item
****方法①&②的分割线****
方法②:假如真实的图片网址无规律,每人的所有图址,只能逐一请求她的各page页后才能取出:
#本例定位出的图址在[ ]内没提取,为兼容pipelines的get_media_requests()的循环写法
#很多网站如本例17年的图片网址有规律,可直接全推算出放在[ ]内,不必请求各page页来提取
item['imgUrls'] = response.css('.main-image img::attr(src)').extract()
yield item #此response.url同第1页,循环始于1会报错no more duplicates,故第1页独写
#或在settings中加一行HTTPERROR_ALLOWED_CODES=[301,404]?有空了验证下
for page in range(2, num):
pageUrl=response.url+'/'+str(page)
yield scrapy.Request(pageUrl,self.getRealUrl,meta={'item':item})
def getRealUrl(self, response):
item=response.meta['item']
item['imgUrls']=response.css('.main-image img::attr(src)').extract()
yield item #直接写在yield Request下面会存放混乱,故每解析出1张图片网址后就去下载
*******************分割线*******************
4、pipelines.py:
from scrapy import Request
from scrapy.pipelines.images import ImagesPipeline
from fake_useragent import UserAgent
#对于自动生成的类,父类改用ImagesPipeline,小改下父类的俩方法
class MzituPipeline(ImagesPipeline):
def get_media_requests(self, item, info):
#①cannot create weak reference to 'list' object——
#yield和列表解析不能共用:或仿父类的源码用return+列表解析,或用for循环+yield
#②Request中添个meta参数path,供下文的file_path()的request调用;path最好当场写全
h={'User-Agent':UserAgent().random,'referer':item['referer']}
return [Request(x,headers=h,meta={'path':item['name']+'/'+x.split('/')[-1]}) \
for x in item['imgUrls']]
def file_path(self, request, response=None, info=None):
#除非上文的Request中加了callback=self.file_path,这里或许能用response.meta['path']
return request.meta['path']
*******************分割线*******************
5、settings.py:
import random
from fake_useragent import UserAgent
ROBOTSTXT_OBEY=False #改
CONCURRENT_REQUESTS=64 #取注&改
REACTOR_THREADPOOL_MAXSIZE=20 #增
RETRY_TIMES=3 #增
DOWNLOAD_TIMEOUT=4 #增
DOWNLOAD_DELAY=random.random()*3 #取注&改,下图比下数据慢故延时久点
DEFAULT_REQUEST_HEADERS={'User-Agent':UserAgent().random} #取注&改
ITEM_PIPELINES={……} #取注
#JOBDIR = 'jobMzitu'
IMAGES_STORE = 'F:/mzitu/'
IMAGES_MIN_WIDTH = 400; IMAGES_MIN_HEIGHT = 400
IMAGES_EXPIRES = 30 #30天内爬过的不再爬?
****************************************分割线****************************************
Egの登录知乎,并爬各用户的主页、回答数、被关注等信息,存至excel:
1、scrapy startproject zhihuLogin;
cd zhihuLogin;scrapy genspider zhihuUsers www.zhihu.com
main.py:
import os
os.system('scrapy crawl zhihuUsers')
*******************分割线*******************
2、items.py:
name = scrapy.Field()
gender = scrapy.Field()
answer_count = scrapy.Field()
articles_count = scrapy.Field()
follower_count = scrapy.Field()
following_count = scrapy.Field()
url_token = scrapy.Field()
*******************分割线*******************
3、zhihuUsers.py:
phone='1388908**41'
password='登录密码'
topUsers=set()
domain='https://www.zhihu.com/'
firstUser='liaoxuefeng'
firstUrl = domain+'api/v4/members/{0}/followees?include=data[*].\
answer_count,articles_count,follower_count,following_count'
from scrapy import Spider,Request,FormRequest
from ..items import ZhihuloginItem
from io import BytesIO
from PIL import Image
import json,time,urllib
class ZhihuusersSpider(Spider):
name = 'zhihuUsers'
allowed_domains = ['www.zhihu.com']
# start_urls = ['https://www.zhihu.com/']
def start_requests(self): #settings中的COOKIES_ENABLED = False,保持默认的注销状态
#return [Request(domain,self.captcha,dont_filter=True)]
yield Request(domain,self.captcha,dont_filter=True)
def captcha(self,response):
xsrf=response.css('[name=_xsrf]::attr(value)').extract_first()
r=int(time.time() * 1000)
captchaUrl = domain+f'captcha.gif?r={r}&type=login&lang=cn'
yield Request(captchaUrl,self.getCaptcha,meta={'xsrf':xsrf}) #Request请求的写法①
#写法②:有时不得不把1个Request生成器分3句写,如拉勾网给早期请求所取到的cookies
#再添个键:request.cookies['LGUID']=……['user_trace_token'][0]
#request=Request(captchaUrl,self.getCaptcha);request.meta['xsrf']=xsrf;yield request
def getCaptcha(self,response):
#cookies=response.request.headers.get(b'Cookie')
#if cookies: #Cookie的值依然是b节码,先转为str并去除空格,最后用parse_qs()转为{}
# xsrf=urllib.parse.parse_qs(cookies.decode().replace(' ', ''))['_xsrf'][0]
#print('cookies中的某参:',response.meta['xsrf'],xsrf,type(xsrf),sep='\n')
Image.open(BytesIO(response.body)).show()
captcha=tuple(int(x)*23 for x in input('输入各倒字的序号如1-3,自1始,以-隔:').split('-'))
if len(captcha)==2: # 目前的验证码,大部分时候是两个倒立汉字,偶尔是一个
captcha='{"img_size":[200,44],"input_points":[[%s,23],[%s,23]]}' % captcha
elif len(captcha)==1:
captcha='{"img_size":[200,44],"input_points":[[%s,23]]}' % captcha
fd={'captcha_type': 'cn','captcha': captcha,'_xsrf': response.meta['xsrf'],\
'phone_num': phone, 'password': password}
yield FormRequest(domain+'login/phone_num',self.login,formdata=fd)
def login(self,response):
loginResult=json.loads(response.text)
if loginResult['r']==0:
print('登录成功,开始抓取用户信息。。。')
yield Request(firstUrl.format(firstUser),self.followers)
else:
print('登录失败。。。',loginResult,sep='\n')
def followers(self,response):
item=ZhihuloginItem()
result=json.loads(response.text)
if result['data']:
for user in result['data']:
item['name']=user['name']
item['gender']='男' if user['gender'] else '女'
item['answer_count']=user['answer_count']
item['articles_count']=user['articles_count']
item['follower_count']=user['follower_count']
item['following_count']=user['following_count']
item['url_token']='https://www.zhihu.com/people/'+user['url_token']
if user['follower_count']>10000: topUsers.add(user['url_token'])
yield item
if result['paging']['is_end']==False: #网页源码中的false没加引号,是布尔值
nextPageUrl=result['paging']['next']
yield Request(nextPageUrl,self.followers)
else:
nextTopUser=topUsers.pop()
yield Request(firstUrl.format(nextTopUser),self.followers)
*******************分割线*******************
4、pipelines.py:
import time
from openpyxl import Workbook
class ZhihuloginPipeline(object):
wb = Workbook()
ws = wb.active
ws.append(['姓名','性别','回答','文章','关注他','他关注','主页'])
def process_item(self, item, spider):
row = [item['name'],item['gender'],item['answer_count'],item['articles_count'],\
item['follower_count'],item['following_count'],item['url_token']]
self.ws.append(row)
path = time.strftime('zhihuUsers %x.xlsx', time.localtime()).replace('/', '-')
self.wb.save(f'E:/{path}') #time.strftime()的首参字串不能有中文
return item
*******************分割线*******************
5、settings.py:
import random
from fake_useragent import UserAgent
ROBOTSTXT_OBEY=False #改
CONCURRENT_REQUESTS=64 #取注&改
REACTOR_THREADPOOL_MAXSIZE=20 #增
RETRY_TIMES=3 #增
DOWNLOAD_TIMEOUT=4 #增
DOWNLOAD_DELAY=random.random()/5 #取注&改
DEFAULT_REQUEST_HEADERS={'User-Agent':UserAgent().random} #取注&改,反爬头Authorization已弃用
ITEM_PIPELINES={……} #取注
#JOBDIR='jobZhihuLogin'