MapReduce实现PageRank算法(邻接矩阵法)
前言
之前写过稀疏图的实现方法,这次写用矩阵存储数据的算法实现,只要会矩阵相乘的话,实现这个就很简单了。如果有不懂的可以先看一下下面两篇随笔。
算法实现
我们需要输入两个矩阵A和B,我一开始想的是两个矩阵分别存在两个文件里然后分别读取,但是我发现好像不行,无法区分打上A、B的标签。
所以我一开始就把A、B矩阵合起来存在一个文件里,一次读取。
map.py
#!/usr/bin/env python3 import os import sys flag = 0 # 0表示处理A矩阵,1表示处理B矩阵 current_row = 1 # 记录现在处理矩阵的第几行 def read_input(): for lines in sys.stdin: yield lines if __name__ == '__main__': row_a = int(os.environ.get('row_a')) col_a = int(os.environ.get('col_a')) row_b = int(os.environ.get('row_b')) col_b = int(os.environ.get('col_b')) for line in read_input(): if line.count('\n') == len(line): # 去空行 pass data = line.strip().split('\t') if flag == 0: for i in range(col_b): for j in range(col_a): print("%s,%s\tA:%s,%s" % (current_row, i+1, j+1, data[j])) current_row += 1 if current_row > row_a: flag = 1 current_row = 1 elif flag == 1: for i in range(row_a): for j in range(col_b): print("%s,%s\tB:%s,%s" % (i+1, j+1, current_row, data[j])) current_row += 1
reduce.py
#!/usr/bin/env python3 import os import sys from itertools import groupby from operator import itemgetter def read_input(splitstr): for line in sys.stdin: line = line.strip() if len(line) == 0: continue yield line.split(splitstr) if __name__ == '__main__': alpha = float(os.environ.get('alpha')) row_b = int(os.environ.get('row_b')) data = read_input('\t') lstg = (groupby(data, itemgetter(0))) try: for flag, group in lstg: matrix_a, matrix_b = {}, {} total = 0.0 for element, g in group: matrix = g.split(':')[0] pos = g.split(':')[1].split(',')[0] value = g.split(',')[1] if matrix == 'A': matrix_a[pos] = value else: matrix_b[pos] = value for key in matrix_a: total += float(matrix_a[key]) * float(matrix_b[key]) page_rank = alpha * total + (1.0 - alpha) / row_b print("%s" % page_rank) except Exception: pass
算法运行
由于每次迭代会产生新的值,又因为我无法分两个文件读取,所以每次迭代后我要将网络矩阵和新的pageRank值合起来再上传至HDFS。
下面的Python代码的功能就是合并和记录迭代值。
#!/usr/bin/env python3 import sys number = sys.stdin.readline().strip() f_src = open("tmp.txt","r") f_dst1 = open("result.txt", "a") f_dst2 = open("B.txt", "a") mat = "{:^30}\t" f_dst1.write('\n' + number) lines = f_src.readlines() for line in lines: if line.count('\n') == len(line): continue line = line.strip() f_dst1.write(mat.format(line)) f_dst2.write(line) f_dst2.write('\n')
再贴一下运行脚本run.sh
#!/bin/bash pos="/usr/local/hadoop" max=10 for i in `seq 1 $max` do $pos/bin/hadoop jar $pos/hadoop-streaming-2.9.2.jar \ -mapper $pos/mapper.py \ -file $pos/mapper.py \ -reducer $pos/reducer.py \ -file $pos/reducer.py \ -input B.txt \ -output out \ -cmdenv "row_a=4" \ -cmdenv "col_a=4" \ -cmdenv "row_b=4" \ -cmdenv "col_b=1" \ -cmdenv "alpha=0.8" \ rm -r ~/Desktop/B.txt cp ~/Desktop/A.txt ~/Desktop/B.txt rm -r ~/Desktop/tmp.txt $pos/bin/hadoop fs -get out/part-00000 ~/Desktop/tmp.txt echo $i | ~/Desktop/slove.py $pos/bin/hadoop fs -rm B.txt $pos/bin/hadoop fs -rm -r -f out $pos/bin/hadoop fs -put ~/Desktop/B.txt B.txt done
我这里就随便迭代了10次:
我感觉mapreduce用来处理迭代计算实在是太麻烦了,所以才会有twister、haloop、spark这些开源框架的出现吧。