[IR] Link Analysis
网络信息的特点在于:
Query: "IBM" --> "Computer" --> documentIDs.
In degree i 正比于 1/iα , 例如: α = 2.1
即:i越大,量越少。
Query processing
§ First retrieve all pages meeting the text query (say venture capital).
§ Order these by their link popularity (either variant on the previous slide).
§ More nuanced – use link counts as a measure of static goodness (Lecture 7), combined with text match score.
link多,但不一定意味着都是重要的推荐(link).
可以让PageRank Scoring通过"Flow" Model来获得,即访问量。
-
方法一:
解方程得:
y+a+m = 1
y = 2/5, a = 2/5, m = 1/5
Gaussian elimina*on method works for small examples, but we need a better method for large graphs.
-
方法二:
利用Markov chains。xi = x0 Pi
初始值,可以假设是uniform distribution,最后也将达到稳定状态。
若干可能的问题:
- Spider traps
Sol: Random teleports - 随机瞬间移动,防止掉入死胡同
可见,1变为了7/11,但貌似并不是效果特别满意。
- Dead Ends
§ Follow random teleport links with probability 1.0 from dead-ends
§ Adjust matrix accordingly. How?
Sol:
0.8 * [0,0,0] 这里是触发条件。
发现,0.2*[1/3, 1/3, 1/3]这么下去,趋势必然为0。
那就,0.2这个随机处理去掉好了。[1/15,1/15,1/15] --> [1/3, 1/3, 1/3]