[IR] Link Analysis

网络信息的特点在于：

Query: "IBM" --> "Computer" --> documentIDs.

In degree i 正比于 1/i^α, 例如: α = 2.1

即：i越大，量越少。

Query processing

§  First retrieve all pages meeting the text query (say venture capital).
§  Order these by their link popularity (either variant on the previous slide).
§  More nuanced – use link counts as a measure of static goodness (Lecture 7), combined with text match score.

link多，但不一定意味着都是重要的推荐（link）.

可以让PageRank Scoring通过"Flow" Model来获得，即访问量。

方法一：

解方程得：

y+a+m = 1
y = 2/5, a = 2/5, m = 1/5

Gaussian elimina*on method works for small examples, but we need a better method for large graphs.

方法二：

利用Markov chains。x_i=x₀Pⁱ

初始值，可以假设是uniform distribution，最后也将达到稳定状态。

若干可能的问题：

Spider traps

Sol: Random teleports - 随机瞬间移动，防止掉入死胡同

可见，1变为了7/11，但貌似并不是效果特别满意。

Dead Ends

　　

§  Follow random teleport links with probability 1.0 from dead-ends
§ Adjust matrix accordingly. How?

Sol:

0.8 * [0,0,0] 这里是触发条件。

发现，0.2*[1/3, 1/3, 1/3]这么下去，趋势必然为0。

那就，0.2这个随机处理去掉好了。[1/15,1/15,1/15] --> [1/3, 1/3, 1/3]

posted @ 2016-11-08 09:05 郝壹贰叁阅读(830) 评论(0) 收藏举报

刷新页面返回顶部