MinHash算法+实现

参考:

原理ppt:  http://wenku.baidu.com/view/089e85c42cc58bd63186bdfc.html 

求解实现算法: http://fuliang.iteye.com/blog/1025638​ 最后部分. 感谢原作者.   算法原理+数学证明见原文.

简单的实现python代码如下:

 1 import os
2 import sys
3
4
5 def hash_func_demo1(x):
6 return x % 5
7
8 def hash_func_demo2(x):
9 return (2 * x + 1) % 5
10
11
12 ### data:[C1, C2, C3, ... CM]; C1:[a1, a2, a3 ... an]. thus D: n * m
13 ### hash_funcs;[h1, h2, ..., hr]
14 ### return: r * m matrix
15 def min_hash(data, hash_funcs):
16 MAX = 100000000
17 M, N, R = len(data), len(data[0]), len(hash_funcs)
18
19 rt = []
20 for i in range(0, R):
21 rt.append(map(lambda x : x, [MAX] * M ))
22
23 for r in range(0, N):
24 hashes = map(lambda x : x(r + 1), hash_funcs)
25 for col in range(0, M):
26 if data[col][r] == 0:
27 continue
28 for k in range(0, R):
29 rt[col][k] = min(rt[col][k], hashes[k])
30
31 return rt
32
33
34 if __name__ == "__main__":
35 data = [[1, 0, 1, 1, 0],
36 [0, 1, 1, 0, 1],
37 ]
38
39 hash_funcs = [hash_func_demo1, hash_func_demo2]
40 rt = min_hash(data, hash_funcs)
41 print rt



posted on 2012-02-27 22:04  foreveryl  阅读(2115)  评论(0编辑  收藏  举报