[scalability] Find all documents that contain a list of words

Given a list of millions of documents, how would you find all documents that contain a list of words? The words do not need to appear in any particular order, but they must be complete words, that is, "book" does not match "bookkeeper".

Solution:

Firstly, we need a hashtable to store a mapping from a word to a list of documents. But considering that there are millions of documents, maybe we cannot store the whole hashtable on one machine. We need to split the table into several parts. Like below, we use another higher level lookup table to tell us which machine stroes which part of the whole table.

posted @ 2014-06-25 04:05  门对夕阳  阅读(334)  评论(0编辑  收藏  举报