pagerank算法
思想
在互联网中,如果一个网页被很多其他网页所链接,说明它受到了普遍的承认和信赖,那么它的排名就会比较高,即它的pagerank比较高。
对于每个网页自身而言,它的重要程度由所有链接到它的网页贡献而来。对于一个网页,它的pagerank越大,那么它“说话”的“分量”也会越大,因此如果它链接到了一个其他网页,那么它对这个网页pagerank的贡献也会越大;相对的如果一个网页的pagerank较低,那么它能够给它链接到的网页贡献的pagerank就会比较小。
在整个互联网中会出现悬挂点以及排他水槽,如果不加以解决的话它们就会消耗掉整个网络中的pagerank。因此对于每个节点,都让其有一定的概率跳转到整个网络中任意的一个点中,这样就避免了悬挂点和排他水槽消耗掉整个网络的pagerank。
步骤
- 根据各个网页之间的链接关系,得到该网络对应图,之后根据图中每个点的出度得到该网络的超链矩阵H;
- 选择适合的阻尼系数\(\alpha\),通常取\(\alpha = 0.85\),将超链矩阵转化为谷歌矩阵,公式为\(G=\alpha H + \frac{(1 - \alpha)}{n} J\),其中j为大小为n的全1矩阵,\(n\)为图中点的个数;
- 选择合适的\(n\)维向量\(I\),该向量需要满足向量中所有数相加的和为\(1\),即\(\sum _{j=1}^n I[j]=1\);
- 不断进行\(I=GI\)的矩阵相乘迭代,直到迭代前后向量\(I\)几乎不发生变化,即\(Max_{j=1}^{n}\{\Delta I[j]\}<\epsilon\)时停止迭代,其中\(\epsilon\)为设置的允许的前后变化的最大值;
代码
import numpy
from operator import itemgetter
def readLineInt():
line = input()
line = line.split()
length = len(line)
for i in range(length):
line[i] = int(line[i])
return line
def isSmallEnough(S, T):
esp = 1e-5
length = len(S)
for i in range(length):
if numpy.abs(S[i] - T[i]) > esp:
return False
return True
def Print(nv, I):
print("The importance of each page is:")
for i in range(nv):
print("%d: %.5f" % (i + 1, I[i][0]), end="\t\n"[i == nv - 1])
res = []
for i in range(nv):
res.append({"pageNum":i+1, "pageImportance":I[i][0]})
res = sorted(res, key=lambda x: x["pageImportance"])
res.reverse()
print()
print("The pages are ranked by importance as follows:")
print("Rank\tNum")
for i in range(nv):
print("%d\t\t%d" %(i+1, res[i]["pageNum"]))
def solve():
line = readLineInt()
nv = line[0]
ne = line[1]
graph = [[0 for j in range(nv)] for i in range(nv)]
inDegree = [0 for i in range(nv + 1)]
for i in range(ne):
line = readLineInt()
u = line[0] - 1
v = line[1] - 1
graph[v][u] = 1
inDegree[u] = inDegree[u] + 1
S = numpy.zeros((1, nv, nv))[0]
''' vertex从0开始标号!!!! '''
for i in range(nv):
for j in range(nv):
if graph[j][i]:
S[j][i] = 1.0 / inDegree[i]
J = numpy.ones((1, nv, nv))
alpha = 0.85
G = S * alpha + J * ((1 - alpha) / nv)
G = G[0]
I = numpy.zeros((1, nv, 1))
I = I[0]
I[0][0] = 1
while True:
I1 = I
I = numpy.dot(G, I)
if(isSmallEnough(I1, I)):
break
Print(nv, I)
if __name__=='__main__':
solve()
''' test case 1 '''
'''
6 12
1 2
1 3
1 4
2 3
2 4
2 6
3 4
4 3
4 6
5 6
6 2
6 4
'''
''' test case 2 '''
'''
8 16
1 2
1 3
3 2
2 4
4 2
3 5
4 5
4 6
5 6
5 7
7 5
5 8
6 8
8 6
7 8
8 7
'''
运行结果
Test1 input
6 12
1 2
1 3
1 4
2 3
2 4
2 6
3 4
4 3
4 6
5 6
6 2
6 4
Test1 output
The importance of each page is:
1: 0.02500 2: 0.13253 3: 0.22218 4: 0.35894 5: 0.02500 6: 0.23635
The pages are ranked by importance as follows:
Rank Num
1 4
2 6
3 3
4 2
5 5
6 1
Test2 input
8 16
1 2
1 3
3 2
2 4
4 2
3 5
4 5
4 6
5 6
5 7
7 5
5 8
6 8
8 6
7 8
8 7
Test2 output
The importance of each page is:
1: 0.01875 2: 0.05715 3: 0.02672 4: 0.06733 5: 0.12849 6: 0.20568 7: 0.18660 8: 0.30928
The pages are ranked by importance as follows:
Rank Num
1 8
2 6
3 7
4 5
5 4
6 2
7 3
8 1