lucene MoreLikeThis的实现原理
MoreLikeThis可以用于实现相似文章的查询,其实现原理剖析如下:
MoreLikeThis位于lucene捐赠模块Queries目录下,在此转一下实现该类的初衷:
Lucene does let you access the document frequency of terms, with IndexReader.docFreq().
Term frequencies can be computed by re-tokenizing the text, which, for a single document,
is usually fast enough. But looking up the docFreq() of every term in the document is
probably too slow.
You can use some heuristics to prune the set of terms, to avoid calling docFreq() too much,
or at all. Since you're trying to maximize a tfidf score, you're probably most interested
in terms with a high tf. Choosing a tf threshold even as low as two or three will radically
reduce the number of terms under consideration. Another heuristic is that terms with a
high idf (i.e., a low df) tend to be longer. So you could threshold the terms by the
number of characters, not selecting anything less than, e.g., six or seven characters.
With these sorts of heuristics you can usually find small set of, e.g., ten or fewer terms
that do a pretty good job of characterizing a document.
It all depends on what you're trying to do. If you're trying to eek out that last percent
of precision and recall regardless of computational difficulty so that you can win a TREC
competition, then the techniques I mention above are useless. But if you're trying to
provide a "more like this" button on a search results page that does a decent job and has
good performance, such techniques might be useful.
An efficient, effective "more-like-this" query generator would be a great contribution, if
anyone's interested. I'd imagine that it would take a Reader or a String (the document's
text), analyzer Analyzer, and return a set of representative terms using heuristics like those
above. The frequency and length thresholds could be parameters, etc.
1)调用该函数生成查询
public Query like(Reader r, String fieldName) throws IOException {
return createQuery(retrieveTerms(r, fieldName));
}
2)计算Term的TF,IDF,Score...
public PriorityQueue<Object[]> retrieveTerms(Reader r, String fieldName) throws IOException {
Map<String, Int> words = new HashMap<String, Int>();
addTermFrequencies(r, words, fieldName);
return createQueue(words);
}
3)计算文档的中Term的频率,并存放在termFreqMap中返回
private void addTermFrequencies(Reader r, Map<String, Int> termFreqMap, String fieldName)
throws IOException {
if (analyzer == null) {
throw new UnsupportedOperationException("To use MoreLikeThis without " +
"term vectors, you must provide an Analyzer");
}
TokenStream ts = analyzer.tokenStream(fieldName, r);
int tokenCount = 0;
// for every token
CharTermAttribute termAtt = ts.addAttribute(CharTermAttribute.class);
ts.reset();
while (ts.incrementToken()) {
String word = termAtt.toString();
tokenCount++;
if (tokenCount > maxNumTokensParsed) {
break;
}
if (isNoiseWord(word)) {
continue;
}
// increment frequency
Int cnt = termFreqMap.get(word);
if (cnt == null) {
termFreqMap.put(word, new Int());
} else {
cnt.x++;
}
}
ts.end();
ts.close();
}
4)计算Term的TF,IDF,Score...
private PriorityQueue<Object[]> createQueue(Map<String, Int> words) throws IOException {
// have collected all words in doc and their freqs
int numDocs = ir.numDocs();
FreqQ res = new FreqQ(words.size()); // will order words by score
for (String word : words.keySet()) { // for every word
int tf = words.get(word).x; // term freq in the source doc
if (minTermFreq > 0 && tf < minTermFreq) {
continue; // filter out words that don't occur enough times in the source
}
// go through all the fields and find the largest document frequency
String topField = fieldNames[0];
int docFreq = 0;
for (String fieldName : fieldNames) {
int freq = ir.docFreq(new Term(fieldName, word));
topField = (freq > docFreq) ? fieldName : topField;
docFreq = (freq > docFreq) ? freq : docFreq;
}
if (minDocFreq > 0 && docFreq < minDocFreq) {
continue; // filter out words that don't occur in enough docs
}
if (docFreq > maxDocFreq) {
continue; // filter out words that occur in too many docs
}
if (docFreq == 0) {
continue; // index update problem?
}
float idf = similarity.idf(docFreq, numDocs);
float score = tf idf;
// only really need 1st 3 entries, other ones are for troubleshooting
res.insertWithOverflow(new Object[]{word, // the word
topField, // the top field
score, // overall score
idf, // idf
docFreq, // freq in all docs
tf
});
}
return res;
}
Tips:函数体内有几个关键的类变量
1.minTermFreq:如果被设置(>0),那么Doc中Term的频率必须大于该数值,否则不作为文档词向量的一维。
2.minDocFreq:如果被设置(>0),那么Term在索引中的逆文档频率必须大于该数值,否则不作为文档词向量的一维。
3.maxDocFreq:minDocFreq反过来...
4.similarity:在MoreLikeThis中对应的set函数,可以通过传入不同的similarity实现不同的计算IDF的方式。
5)通过PriorityQueue,生成BooleanQuery
private Query createQuery(PriorityQueue<Object[]> q) {
BooleanQuery query = new BooleanQuery();
Object cur;
int qterms = 0;
float bestScore = 0;
while ((cur = q.pop()) != null) {
Object[] ar = (Object[]) cur;
TermQuery tq = new TermQuery(new Term((String) ar[1], (String) ar[0]));
if (boost) {
if (qterms == 0) {
bestScore = ((Float) ar[2]);
}
float myScore = ((Float) ar[2]);
tq.setBoost(boostFactor myScore / bestScore);
}
try {
query.add(tq, BooleanClause.Occur.SHOULD);
}
catch (BooleanQuery.TooManyClauses ignore) {
break;
}
qterms++;
if (maxQueryTerms > 0 && qterms >= maxQueryTerms) {
break;
}
}
return query;
}
Tips:有几个类变量需要注意
1.boost:如果设置为false,那么每一个TermQuery将有相同的重要性,如果设置为true,那么Term的重要性将由TFIDF(Score)决定。
总结:
从以上的分析过程可以看出,lucene把文章相似的计算最终处理为一次查询,lucene实现查询的方式是通过向量空间模型进行的;
本次调查MoreLikeThis是因为我要做一个相似文章推荐的实验,从而考评出lucene是否能够胜任海量数据的快速处理(千万数量级)。