[python]Mongodb
文档:
http://api.mongodb.com/python/current/tutorial.html
安装:
官网直接下载安装, mac上brew安装的下载太慢, 打算手动安装
使用:
开启服务:
1 mongod #默认配置开启服务 2 mongod -- dpath <db path> # 指定数据库文件路径
连接服务:
1 mongo # 默认配置连接 2 mongo [options] [db address] [file names (ending in .js)]
图形可视化程序:
https://www.robomongo.org/
shell:
1 > help 2 db.help() help on db methods 3 db.mycoll.help() help on collection methods 4 sh.help() sharding helpers 5 rs.help() replica set helpers 6 help admin administrative help 7 help connect connecting to a db help 8 help keys key shortcuts 9 help misc misc things to know 10 help mr mapreduce 11 12 show dbs show database names 13 show collections show collections in current database 14 show users show users in current database 15 show profile show most recent system.profile entries with time >= 1ms 16 show logs show the accessible logger names 17 show log [name] prints out the last segment of log in memory, 'global' is default 18 use <db_name> set current database 19 db.foo.find() list objects in collection foo 20 db.foo.find( { a : 1 } ) list objects in foo where a == 1 21 it result of the last line evaluated; use to further iterate 22 DBQuery.shellBatchSize = x set default number of items to display on shell 23 exit quit the mongo shell
more helps...
1 > db.help() 2 DB methods: 3 db.adminCommand(nameOrDocument) - switches to 'admin' db, and runs command [just calls db.runCommand(...)] 4 db.aggregate([pipeline], {options}) - performs a collectionless aggregation on this database; returns a cursor 5 db.auth(username, password) 6 db.cloneDatabase(fromhost) 7 db.commandHelp(name) returns the help for the command 8 db.copyDatabase(fromdb, todb, fromhost) 9 db.createCollection(name, {size: ..., capped: ..., max: ...}) 10 db.createView(name, viewOn, [{$operator: {...}}, ...], {viewOptions}) 11 db.createUser(userDocument) 12 db.currentOp() displays currently executing operations in the db 13 db.dropDatabase() 14 db.eval() - deprecated 15 db.fsyncLock() flush data to disk and lock server for backups 16 db.fsyncUnlock() unlocks server following a db.fsyncLock() 17 db.getCollection(cname) same as db['cname'] or db.cname 18 db.getCollectionInfos([filter]) - returns a list that contains the names and options of the db's collections 19 db.getCollectionNames() 20 db.getLastError() - just returns the err msg string 21 db.getLastErrorObj() - return full status object 22 db.getLogComponents() 23 db.getMongo() get the server connection object 24 db.getMongo().setSlaveOk() allow queries on a replication slave server 25 db.getName() 26 db.getPrevError() 27 db.getProfilingLevel() - deprecated 28 db.getProfilingStatus() - returns if profiling is on and slow threshold 29 db.getReplicationInfo() 30 db.getSiblingDB(name) get the db at the same server as this one 31 db.getWriteConcern() - returns the write concern used for any operations on this db, inherited from server object if set 32 db.hostInfo() get details about the server's host 33 db.isMaster() check replica primary status 34 db.killOp(opid) kills the current operation in the db 35 db.listCommands() lists all the db commands 36 db.loadServerScripts() loads all the scripts in db.system.js 37 db.logout() 38 db.printCollectionStats() 39 db.printReplicationInfo() 40 db.printShardingStatus() 41 db.printSlaveReplicationInfo() 42 db.dropUser(username) 43 db.repairDatabase() 44 db.resetError() 45 db.runCommand(cmdObj) run a database command. if cmdObj is a string, turns it into {cmdObj: 1} 46 db.serverStatus() 47 db.setLogLevel(level,<component>) 48 db.setProfilingLevel(level,slowms) 0=off 1=slow 2=all 49 db.setWriteConcern(<write concern doc>) - sets the write concern for writes to the db 50 db.unsetWriteConcern(<write concern doc>) - unsets the write concern for writes to the db 51 db.setVerboseShell(flag) display extra information in shell output 52 db.shutdownServer() 53 db.stats() 54 db.version() current version of the server 55 >
1 > db.mycoll.help() 2 DBCollection help 3 db.mycoll.find().help() - show DBCursor help 4 db.mycoll.bulkWrite( operations, <optional params> ) - bulk execute write operations, optional parameters are: w, wtimeout, j 5 db.mycoll.count( query = {}, <optional params> ) - count the number of documents that matches the query, optional parameters are: limit, skip, hint, maxTimeMS 6 db.mycoll.copyTo(newColl) - duplicates collection by copying all documents to newColl; no indexes are copied. 7 db.mycoll.convertToCapped(maxBytes) - calls {convertToCapped:'mycoll', size:maxBytes}} command 8 db.mycoll.createIndex(keypattern[,options]) 9 db.mycoll.createIndexes([keypatterns], <options>) 10 db.mycoll.dataSize() 11 db.mycoll.deleteOne( filter, <optional params> ) - delete first matching document, optional parameters are: w, wtimeout, j 12 db.mycoll.deleteMany( filter, <optional params> ) - delete all matching documents, optional parameters are: w, wtimeout, j 13 db.mycoll.distinct( key, query, <optional params> ) - e.g. db.mycoll.distinct( 'x' ), optional parameters are: maxTimeMS 14 db.mycoll.drop() drop the collection 15 db.mycoll.dropIndex(index) - e.g. db.mycoll.dropIndex( "indexName" ) or db.mycoll.dropIndex( { "indexKey" : 1 } ) 16 db.mycoll.dropIndexes() 17 db.mycoll.ensureIndex(keypattern[,options]) - DEPRECATED, use createIndex() instead 18 db.mycoll.explain().help() - show explain help 19 db.mycoll.reIndex() 20 db.mycoll.find([query],[fields]) - query is an optional query filter. fields is optional set of fields to return. 21 e.g. db.mycoll.find( {x:77} , {name:1, x:1} ) 22 db.mycoll.find(...).count() 23 db.mycoll.find(...).limit(n) 24 db.mycoll.find(...).skip(n) 25 db.mycoll.find(...).sort(...) 26 db.mycoll.findOne([query], [fields], [options], [readConcern]) 27 db.mycoll.findOneAndDelete( filter, <optional params> ) - delete first matching document, optional parameters are: projection, sort, maxTimeMS 28 db.mycoll.findOneAndReplace( filter, replacement, <optional params> ) - replace first matching document, optional parameters are: projection, sort, maxTimeMS, upsert, returnNewDocument 29 db.mycoll.findOneAndUpdate( filter, update, <optional params> ) - update first matching document, optional parameters are: projection, sort, maxTimeMS, upsert, returnNewDocument 30 db.mycoll.getDB() get DB object associated with collection 31 db.mycoll.getPlanCache() get query plan cache associated with collection 32 db.mycoll.getIndexes() 33 db.mycoll.group( { key : ..., initial: ..., reduce : ...[, cond: ...] } ) 34 db.mycoll.insert(obj) 35 db.mycoll.insertOne( obj, <optional params> ) - insert a document, optional parameters are: w, wtimeout, j 36 db.mycoll.insertMany( [objects], <optional params> ) - insert multiple documents, optional parameters are: w, wtimeout, j 37 db.mycoll.mapReduce( mapFunction , reduceFunction , <optional params> ) 38 db.mycoll.aggregate( [pipeline], <optional params> ) - performs an aggregation on a collection; returns a cursor 39 db.mycoll.remove(query) 40 db.mycoll.replaceOne( filter, replacement, <optional params> ) - replace the first matching document, optional parameters are: upsert, w, wtimeout, j 41 db.mycoll.renameCollection( newName , <dropTarget> ) renames the collection. 42 db.mycoll.runCommand( name , <options> ) runs a db command with the given name where the first param is the collection name 43 db.mycoll.save(obj) 44 db.mycoll.stats({scale: N, indexDetails: true/false, indexDetailsKey: <index key>, indexDetailsName: <index name>}) 45 db.mycoll.storageSize() - includes free space allocated to this collection 46 db.mycoll.totalIndexSize() - size in bytes of all the indexes 47 db.mycoll.totalSize() - storage allocated for all data and indexes 48 db.mycoll.update( query, object[, upsert_bool, multi_bool] ) - instead of two flags, you can pass an object with fields: upsert, multi 49 db.mycoll.updateOne( filter, update, <optional params> ) - update the first matching document, optional parameters are: upsert, w, wtimeout, j 50 db.mycoll.updateMany( filter, update, <optional params> ) - update all matching documents, optional parameters are: upsert, w, wtimeout, j 51 db.mycoll.validate( <full> ) - SLOW 52 db.mycoll.getShardVersion() - only for use with sharding 53 db.mycoll.getShardDistribution() - prints statistics about data distribution in the cluster 54 db.mycoll.getSplitKeysForChunks( <maxChunkSize> ) - calculates split points over all chunks and returns splitter function 55 db.mycoll.getWriteConcern() - returns the write concern used for any operations on this collection, inherited from server/db if set 56 db.mycoll.setWriteConcern( <write concern doc> ) - sets the write concern for writes to the collection 57 db.mycoll.unsetWriteConcern( <write concern doc> ) - unsets the write concern for writes to the collection 58 db.mycoll.latencyStats() - display operation latency histograms for this collection 59 >
1 > sh.help() 2 sh.addShard( host ) server:port OR setname/server:port 3 sh.addShardToZone(shard,zone) adds the shard to the zone 4 sh.updateZoneKeyRange(fullName,min,max,zone) assigns the specified range of the given collection to a zone 5 sh.disableBalancing(coll) disable balancing on one collection 6 sh.enableBalancing(coll) re-enable balancing on one collection 7 sh.enableSharding(dbname) enables sharding on the database dbname 8 sh.getBalancerState() returns whether the balancer is enabled 9 sh.isBalancerRunning() return true if the balancer has work in progress on any mongos 10 sh.moveChunk(fullName,find,to) move the chunk where 'find' is to 'to' (name of shard) 11 sh.removeShardFromZone(shard,zone) removes the shard from zone 12 sh.removeRangeFromZone(fullName,min,max) removes the range of the given collection from any zone 13 sh.shardCollection(fullName,key,unique,options) shards the collection 14 sh.splitAt(fullName,middle) splits the chunk that middle is in at middle 15 sh.splitFind(fullName,find) splits the chunk that find is in at the median 16 sh.startBalancer() starts the balancer so chunks are balanced automatically 17 sh.status() prints a general overview of the cluster 18 sh.stopBalancer() stops the balancer so chunks are not balanced automatically 19 sh.disableAutoSplit() disable autoSplit on one collection 20 sh.enableAutoSplit() re-enable autoSplit on one collection 21 sh.getShouldAutoSplit() returns whether autosplit is enabled 22 >
1 > rs.help() 2 rs.status() { replSetGetStatus : 1 } checks repl set status 3 rs.initiate() { replSetInitiate : null } initiates set with default settings 4 rs.initiate(cfg) { replSetInitiate : cfg } initiates set with configuration cfg 5 rs.conf() get the current configuration object from local.system.replset 6 rs.reconfig(cfg) updates the configuration of a running replica set with cfg (disconnects) 7 rs.add(hostportstr) add a new member to the set with default attributes (disconnects) 8 rs.add(membercfgobj) add a new member to the set with extra attributes (disconnects) 9 rs.addArb(hostportstr) add a new member which is arbiterOnly:true (disconnects) 10 rs.stepDown([stepdownSecs, catchUpSecs]) step down as primary (disconnects) 11 rs.syncFrom(hostportstr) make a secondary sync from the given member 12 rs.freeze(secs) make a node ineligible to become primary for the time specified 13 rs.remove(hostportstr) remove a host from the replica set (disconnects) 14 rs.slaveOk() allow queries on secondary nodes 15 16 rs.printReplicationInfo() check oplog size and time range 17 rs.printSlaveReplicationInfo() check replica set members and replication lag 18 db.isMaster() check who is primary 19 20 reconfiguration helpers disconnect from the database so the shell will display 21 an error, even if the command succeeds. 22 >
1 > help admin 2 ls([path]) list files 3 pwd() returns current directory 4 listFiles([path]) returns file list 5 hostname() returns name of this host 6 cat(fname) returns contents of text file as a string 7 removeFile(f) delete a file or directory 8 load(jsfilename) load and execute a .js file 9 run(program[, args...]) spawn a program and wait for its completion 10 runProgram(program[, args...]) same as run(), above 11 sleep(m) sleep m milliseconds 12 getMemInfo() diagnostic 13 >
1 > help connect 2 3 Normally one specifies the server on the mongo shell command line. Run mongo --help to see those options. 4 Additional connections may be opened: 5 6 var x = new Mongo('host[:port]'); 7 var mydb = x.getDB('mydb'); 8 or 9 var mydb = connect('host[:port]/mydb'); 10 11 Note: the REPL prompt only auto-reports getLastError() for the shell command line connection. 12 13 >
1 > help keys 2 Tab completion and command history is available at the command prompt. 3 4 Some emacs keystrokes are available too: 5 Ctrl-A start of line 6 Ctrl-E end of line 7 Ctrl-K del to end of line 8 9 Multi-line commands 10 You can enter a multi line javascript expression. If parens, braces, etc. are not closed, you will see a new line 11 beginning with '...' characters. Type the rest of your expression. Press Ctrl-C to abort the data entry if you 12 get stuck. 13 14 >
1 > help misc 2 b = new BinData(subtype,base64str) create a BSON BinData value 3 b.subtype() the BinData subtype (0..255) 4 b.length() length of the BinData data in bytes 5 b.hex() the data as a hex encoded string 6 b.base64() the data as a base 64 encoded string 7 b.toString() 8 9 b = HexData(subtype,hexstr) create a BSON BinData value from a hex string 10 b = UUID(hexstr) create a BSON BinData value of UUID subtype 11 b = MD5(hexstr) create a BSON BinData value of MD5 subtype 12 "hexstr" string, sequence of hex characters (no 0x prefix) 13 14 o = new ObjectId() create a new ObjectId 15 o.getTimestamp() return timestamp derived from first 32 bits of the OID 16 o.isObjectId 17 o.toString() 18 o.equals(otherid) 19 20 d = ISODate() like Date() but behaves more intuitively when used 21 d = ISODate('YYYY-MM-DD hh:mm:ss') without an explicit "new " prefix on construction 22 >
1 > help mr 2 3 See also http://dochub.mongodb.org/core/mapreduce 4 5 function mapf() { 6 // 'this' holds current document to inspect 7 emit(key, value); 8 } 9 10 function reducef(key,value_array) { 11 return reduced_value; 12 } 13 14 db.mycollection.mapReduce(mapf, reducef[, options]) 15 16 options 17 {[query : <query filter object>] 18 [, sort : <sort the query. useful for optimization>] 19 [, limit : <number of objects to return from collection>] 20 [, out : <output-collection name>] 21 [, keeptemp: <true|false>] 22 [, finalize : <finalizefunction>] 23 [, scope : <object where fields go into javascript global scope >] 24 [, verbose : true]} 25 26 >
python驱动
pip install pymongo
scrapy:
settings.py
1 ITEM_PIPELINES = ['stack.pipelines.MongoDBPipeline', ] 2 3 MONGODB_SERVER = "localhost" 4 MONGODB_PORT = 27017 5 MONGODB_DB = "stackoverflow" 6 MONGODB_COLLECTION = "questions"
piplines.py
1 import pymongo 2 3 from scrapy.conf import settings 4 from scrapy.exceptions import DropItem 5 from scrapy import log 6 7 8 class MongoDBPipeline(object): 9 10 def __init__(self): 11 connection = pymongo.MongoClient( 12 settings['MONGODB_SERVER'], 13 settings['MONGODB_PORT'] 14 ) 15 db = connection[settings['MONGODB_DB']] 16 self.collection = db[settings['MONGODB_COLLECTION']] 17 18 def process_item(self, item, spider): 19 valid = True 20 for data in item: 21 if not data: 22 valid = False 23 raise DropItem("Missing {0}!".format(data)) 24 if valid: 25 self.collection.insert(dict(item)) 26 log.msg("Question added to MongoDB database!", 27 level=log.DEBUG, spider=spider) 28 return item
scrapy 官方文档 https://doc.scrapy.org/en/latest/topics/item-pipeline.html#write-items-to-mongodb:
piplines.py
1 import pymongo 2 3 class MongoPipeline(object): 4 5 collection_name = 'scrapy_items' 6 7 def __init__(self, mongo_uri, mongo_db): 8 self.mongo_uri = mongo_uri 9 self.mongo_db = mongo_db 10 11 @classmethod 12 def from_crawler(cls, crawler): 13 return cls( 14 mongo_uri=crawler.settings.get('MONGO_URI'), 15 mongo_db=crawler.settings.get('MONGO_DATABASE', 'items') 16 ) 17 18 def open_spider(self, spider): 19 self.client = pymongo.MongoClient(self.mongo_uri) 20 self.db = self.client[self.mongo_db] 21 22 def close_spider(self, spider): 23 self.client.close() 24 25 def process_item(self, item, spider): 26 self.db[self.collection_name].insert_one(dict(item)) 27 return item