scrapy 设置cookie池
代码已经很详细了,可以直接拿来使用了。
包含了:
- 从网页获取cookie
- 存入mongodb
- 定期删除cookie
- scrapy中间件对cookie池的取用
#!/usr/bin/python #coding=utf-8 #__author__='dahu' #data=2017- # import requests import time from pymongo import MongoClient import cookielib import urllib2 from bson.objectid import ObjectId url = 'https://www.so.com' # url = 'https://cn.bing.com/translator' client = MongoClient('localhost', 27017) db = client['save_cookie'] collection = db['san60cookie'] def get_header(): header={ "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8", "Accept-Encoding": "gzip, deflate, br", "Accept-Language": "en-US,en;q=0.8,zh-CN;q=0.6,zh;q=0.4", "Cache-Control": "max-age=0", "Connection": "keep-alive", "Host": "www.so.com", "Upgrade-Insecure-Requests": "1", "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.101 Safari/537.36", } return headerdef get_cookie_lib(): cookie = cookielib.CookieJar() handler = urllib2.HTTPCookieProcessor(cookie) opener = urllib2.build_opener(handler) response = opener.open(url) # for item in cookie: # print "%s : %s" % (item.name, item.value) cookie_dict = {} for cook in cookie: cookie_dict[cook.name] = cook.value return cookie_dict def save_cookie_into_mongodb(cookie): print 'insert' insert_data = {} insert_data['cookie'] = cookie insert_data['insert_time'] = time.strftime('%Y-%m-%d %H:%M:%S') insert_data['request_url']=url insert_data['insert_timestamp'] = time.time() collection.insert(insert_data) def delete_timeout_cookie(request_url): time_out = 300 for data in collection.find({'request_url':request_url}): if (time.time() - data.get('insert_timestamp')) > time_out: print 'delete: %s' % data.get('_id') collection.delete_one({'_id': ObjectId(data.get('_id'))})
#这里有疑问的话可以参考http://api.mongodb.com/python/current/tutorial.html#querying-by-objectid
def get_cookie_from_mongodb(): cookies = [data.get('cookie') for data in collection.find()] return cookies if __name__ == '__main__': num=0 while 1: if num == 2: print 'deleting' delete_timeout_cookie(url) num = 0 else: cookie = get_cookie_lib() save_cookie_into_mongodb(cookie) num += 1 time.sleep(5)
对应的middleware文件,可以写成这样
import random class CookiesMiddleware(object): def process_request(self,request,spider): cookie = random.choice(get_cookie_from_mongodb()) request.cookies = cookie
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· AI与.NET技术实操系列:基于图像分类模型对图像进行分类
· go语言实现终端里的倒计时
· 如何编写易于单元测试的代码
· 10年+ .NET Coder 心语,封装的思维:从隐藏、稳定开始理解其本质意义
· .NET Core 中如何实现缓存的预热?
· 25岁的心里话
· 闲置电脑爆改个人服务器(超详细) #公网映射 #Vmware虚拟网络编辑器
· 基于 Docker 搭建 FRP 内网穿透开源项目(很简单哒)
· 零经验选手,Compose 一天开发一款小游戏!
· 一起来玩mcp_server_sqlite,让AI帮你做增删改查!!