[Paper Reading]--Fast top-k search in knowledge graphs
<<Fast top-k search in knowledge graphs>>
Publication: ICDE 2016
Authors: Shengqi Yang∗, Fangqiu Han∗, Yinghui Wu†, Xifeng Yan∗
Affiliation: UCSB, WSU
1. Short description:
it shows a fast top-k graph search framework. It has two component:
(1) a fast top-k algorithm for star query (2) general graph query algorithm based on star queries assembling
2. Focus: graph query, subgraph matching, top-k
3. Novelty: the general query divided into star query and a fast join?
4. Motivation:
traditional query based on thresholdalgorithm and belief propagation algorithm are not fast for top-k query in large knowledge graph (but how is it not fast enough?)
So it proposed a new framework for star queries and general queries.
Generally, the subgraph matching is find the subgraph specified by a one-to-one matching function ϕ.
Recently proposed methods based on probablistic approaches try to get the subgraph matching by aggregate the nodes/edge matching score.
the matching score between the query graph and data graph is defined by:
where fi(v,ϕ(v)) is the matching score under the ith similarity measure.
generally,FV(v,ϕ(v)) and FE(e,ϕ(e)) has constraint that the value is above a certain threshold to get good matches in an answer.
5. Threshold Algorithms:
It first shows the popular threshold algorithm “graphTA”,
Then the author proposed the limitation of graphTA mentioned in this paper.
It mainly include inaccurate top match with higher matching score, difficult to estimate tight upper bound, expensive cost ofeach node expansion involving subgraph isomorphism search.
But for the first limitation I dont quite understand.
It is said “(1) Matches for nodes and edges with high matching score
alone do not necessarily indicate top answers. For example, the
top-1 answer is joined from a set of node and edge matches
with quite low matching scores, if ranked independently (Figure 3)
My understanding is that, According to this description, seems the matching score of each node is not reliable.
Hence the combination of scores for subgraph matching is not reliable.
So the the match from higher matching score must be inexact match, the match with lower score matching is probably strict exact match or inexact match with less edges?
I don’t know whether it is right or not
6. Proposed algorithm
The authors proposed a new framework to find top-k match.
First they find the star-queries quickly by decomposing the whole query graph, and then find the match, then agggregate the star-query result to find the complete match of query graph, that is,
(1) query decomposition
(2) star querying
(3) top-k rank join
Star-query algorithm:
(1) starK algorithm. stark for exact match
Example of top-3 star querying
The star-querying complexity is reduce to $O(m|V^*| + klogk)$
How?
There is no need to find the top-k w.r.t the score function F, only need to find k+s-1 numbers in the union of the lists.
S could be $|V^*|$
but I don’t fully understand the proof.
(2) stard algorithm for inexact match; d-boundary match
It proosed to use a message passing algorithm to iteratively propagate and aggregate the message of score and path length to get the top-k inexact matching
(3) Top-k star join:
star querying algorithm serve as a foundation to answer general graph queries.
It has two challenging to conquer:
(a) query decomposition: how to decompose, what is the good decomposition strategy?
(b)top-k ranked join. How to assembling ? how to derive the upper bound of the unseen matches except from the previous star queries’ match for good quality join?
First, about starJoin:
starjoin performs in a similar way as the hash rank join strategy (HRJN [21]).
There is difference between HRJN and starJoin?
How is it different?
the scores for the joint nodes shared by several stars are counted multiple times according to the matching score equation F? Hence, it introduces the rank join with $\alpha$-scheme.
$\alpha$-scheme:
it use the a parameter to eliminate the commonnodes’s double influence when estimating the valid upper bounds.
Also, the selection of $\alpha$ affects the number of matches to be fetched for assembling.
Second, about graph decomposition into star query graphs:
important observances:
-
A reasonable decomposition derives as small number of stars as possible, which intuitively reduces the number of joins.
-
To make the upper bound estimation tighter in Eq. 3 (Section VI-A), we shall make 𝐹(𝜙𝑖𝑛𝑖) as small as possible. Therefore, a large score decrement for the matches in 𝐿𝑖 will
likely lead to small search depth.
-
We observe that many real-world star queries share the similar distribution of the match scores with a long-tail effect,
Hence, he objective of the query decomposition is to derive a minimum number of stars with similar features, such that the score decrement of the matches for each star Q* can be maximized.
The cons:
In my opinion only:
It considers the efficient general query based from star query, but the quality of decomposition of star-query is not effectively demonstrated well. The decomposition of general query into star query would affect greatly to the star query
Reference:
Yang, Shengqi, et al. "Fast top-k search in knowledge graphs." Data Engineering (ICDE), 2016 IEEE 32nd International Conference on. IEEE, 2016.