pagerank算法

思想

在互联网中,如果一个网页被很多其他网页所链接,说明它受到了普遍的承认和信赖,那么它的排名就会比较高,即它的pagerank比较高。

对于每个网页自身而言,它的重要程度由所有链接到它的网页贡献而来。对于一个网页,它的pagerank越大,那么它“说话”的“分量”也会越大,因此如果它链接到了一个其他网页,那么它对这个网页pagerank的贡献也会越大;相对的如果一个网页的pagerank较低,那么它能够给它链接到的网页贡献的pagerank就会比较小。

在整个互联网中会出现悬挂点以及排他水槽,如果不加以解决的话它们就会消耗掉整个网络中的pagerank。因此对于每个节点,都让其有一定的概率跳转到整个网络中任意的一个点中,这样就避免了悬挂点和排他水槽消耗掉整个网络的pagerank。

步骤

  1. 根据各个网页之间的链接关系,得到该网络对应图,之后根据图中每个点的出度得到该网络的超链矩阵H;
  2. 选择适合的阻尼系数\(\alpha\),通常取\(\alpha = 0.85\),将超链矩阵转化为谷歌矩阵,公式为\(G=\alpha H + \frac{(1 - \alpha)}{n} J\),其中j为大小为n的全1矩阵,\(n\)为图中点的个数;
  3. 选择合适的\(n\)维向量\(I\),该向量需要满足向量中所有数相加的和为\(1\),即\(\sum _{j=1}^n I[j]=1\)
  4. 不断进行\(I=GI\)的矩阵相乘迭代,直到迭代前后向量\(I\)几乎不发生变化,即\(Max_{j=1}^{n}\{\Delta I[j]\}<\epsilon\)时停止迭代,其中\(\epsilon\)为设置的允许的前后变化的最大值;

代码

import numpy
from operator import itemgetter

def readLineInt():
    line = input()
    line = line.split()
    length = len(line)
    for i in range(length):
        line[i] = int(line[i])
    return line

def isSmallEnough(S, T):
    esp = 1e-5
    length = len(S)
    for i in range(length):
        if numpy.abs(S[i] - T[i]) > esp:
            return False
    return True

def Print(nv, I):
    print("The importance of each page is:")
    for i in range(nv):
        print("%d: %.5f" % (i + 1, I[i][0]), end="\t\n"[i == nv - 1])

    res = []
    for i in range(nv):
        res.append({"pageNum":i+1, "pageImportance":I[i][0]})
    res = sorted(res, key=lambda x: x["pageImportance"])
    res.reverse()

    print()
    print("The pages are ranked by importance as follows:")
    print("Rank\tNum")
    for i in range(nv):
        print("%d\t\t%d" %(i+1, res[i]["pageNum"]))

def solve():
    line = readLineInt()
    nv = line[0]
    ne = line[1]

    graph = [[0 for j in range(nv)] for i in range(nv)]
    inDegree = [0 for i in range(nv + 1)]

    for i in range(ne):
        line = readLineInt()
        u = line[0] - 1
        v = line[1] - 1

        graph[v][u] = 1

        inDegree[u] = inDegree[u] + 1

    S = numpy.zeros((1, nv, nv))[0]

    ''' vertex从0开始标号!!!! '''
    for i in range(nv):
        for j in range(nv):
            if graph[j][i]:
                S[j][i] = 1.0 / inDegree[i]

    J = numpy.ones((1, nv, nv))
    alpha = 0.85

    G = S * alpha + J * ((1 - alpha) / nv)
    G = G[0]

    I = numpy.zeros((1, nv, 1))
    I = I[0]
    I[0][0] = 1

    while True:
        I1 = I
        I = numpy.dot(G, I)
        if(isSmallEnough(I1, I)):
            break

    Print(nv, I)


if __name__=='__main__':
    solve()

''' test case 1 '''
'''
6 12
1 2
1 3
1 4
2 3
2 4
2 6
3 4
4 3
4 6
5 6
6 2
6 4
'''

''' test case 2 '''
'''
8 16
1 2
1 3
3 2
2 4
4 2
3 5
4 5
4 6
5 6
5 7
7 5
5 8
6 8
8 6
7 8
8 7
'''

运行结果

Test1 input

6 12
1 2
1 3
1 4
2 3
2 4
2 6
3 4
4 3
4 6
5 6
6 2
6 4

Test1 output

The importance of each page is:
1: 0.02500      2: 0.13253      3: 0.22218      4: 0.35894      5: 0.02500      6: 0.23635

The pages are ranked by importance as follows:
Rank    Num
1               4
2               6
3               3
4               2
5               5
6               1

Test2 input

8 16
1 2
1 3
3 2
2 4
4 2
3 5
4 5
4 6
5 6
5 7
7 5
5 8
6 8
8 6
7 8
8 7

Test2 output

The importance of each page is:
1: 0.01875      2: 0.05715      3: 0.02672      4: 0.06733      5: 0.12849      6: 0.20568      7: 0.18660      8: 0.30928

The pages are ranked by importance as follows:
Rank    Num
1               8
2               6
3               7
4               5
5               4
6               2
7               3
8               1
posted @ 2021-10-01 17:16  牟翔宇  阅读(36)  评论(0编辑  收藏  举报