mrjob 使用 mongoldb 数据源【转】

最近做的事情是用mrjob写mapreduce程序,从mongo读取数据。我的做法很容易也很好懂,因为mrjob可以支持sys.stdin的读取,所以我考虑用一个python程序读mongo中的数据,然后同时让mrjob脚本接受输入,处理,输出。

 

具体方式:

readInMongoDB.py:

 

#coding:UTF-8
'''
Created on 2014年5月28日

@author: hao
'''
import pymongo
pyconn = pymongo.Connection(host,port=27017)
pycursor = pyconn.userid_cid_score.find().batch_size(30)
for i in pycursor:
    userId = i['userId']
    cid = i['cid']
    score = i['score']
#     temp = list()
#     temp.append(userId)
#     temp.append(cid)
#     temp.append(score)
    print str(userId)+','+str(cid)+','+str(score)

 

 

step1.py:

 

#coding:UTF-8
'''
Created on 2014年5月27日

@author: hao
'''
from mrjob.job import MRJob
# from mrjob import protocol
import pymongo
import logging
import simplejson as sj

class step(MRJob):
    '''
    '''
#     logging.c
    def parseMatrix(self, _, line):
        '''
        input one stdin for pymongo onetime search
        output contentId, (userId, rating)
        '''
        line = (str(line))
        line=line.split(',')
        userId = line[0]
#         print userId
        cid = line[1]
#         print cid
        score = float(line[2])
#         print score
        yield cid, (userId, float(score))        

    
    def scoreCombine(self, cid, userRating):
        '''
        将对同一个内容的(用户,评分)拼到一个list里
        '''
        userRatings = list()
        for i in userRating:
            userRatings.append(i)
        yield cid, userRatings
        
    def userBehavior(self, cid, userRatings):
        '''        
        '''
        scoreList = list()
        for doc in userRatings:
            # 每个combiner结果
            for i in doc:
                scoreList.append(i)
        for user1 in scoreList:
            for user2 in scoreList:
                if user1[0] == user2[0]:
                    continue
                yield (user1[0], user2[0]), (user1[1], user2[1])
    
    
    def steps(self):
        return [self.mr(mapper = self.parseMatrix,
                        reducer = self.scoreCombine),
                self.mr(reducer = self.userBehavior),]
    
    
if __name__=='__main__':
    
    fp = open('a.txt','w')
    fp.write('[')
    step.run()
    fp.write(']')
    fp.close()

 

 

然后执行脚本  python readInMongoDB.py | python step1.py >> out.txt 

这个方式在本地执行的非常好,没有任何问题(除开mrjob速度的问题,其实在本次应用中影响不大)

 

原文:http://blog.csdn.net/whzhcahzxh/article/details/29587059

posted @ 2015-11-05 14:30  高山流的不是水  阅读(326)  评论(0编辑  收藏  举报