在用户传入搜索词queryString(例如:String queryString = "中国";)suggestionsNumber 推荐的最大数量后,
调用suggestSimilar方法返回提示词的排序结果
suggestions = spellChecker.suggestSimilar(queryString,suggestionsNumber);
下面具体说suggestSimilar()方法,它主要分为以下几步(有一些还不大明白在做什么,在此列出较为明晰的)
word是queryString,numSug是suggestionsNumber,
ir是用户索引的IndexReader(可为空,局限于field参数),
field是用户索引的域(若这个域不为空,被建议的词就局限在指定域中存在的词),
suggestMode (如果indexReader为null and/or field==null, 那么就必须重写SuggestMode.SUGGEST_ALWAYS)
accuracy一个建议必须有的最小得分
public String[] suggestSimilar(String word, int numSug, IndexReader ir,String field, SuggestMode suggestMode, float accuracy)
throws IOException {
// obtainSearcher calls ensureOpen final IndexSearcher indexSearcher = obtainSearcher(); try { if (ir == null || field == null) { suggestMode = SuggestMode.SUGGEST_ALWAYS;//1. } if (suggestMode == SuggestMode.SUGGEST_ALWAYS) { ir = null; field = null; } final int lengthWord = word.length(); final int freq = (ir != null && field != null) ? ir.docFreq(new Term(field, word)) : 0; final int goalFreq = suggestMode==SuggestMode.SUGGEST_MORE_POPULAR ? freq : 0; // if the word exists in the real index and we don't care for word frequency, return the word itself if (suggestMode==SuggestMode.SUGGEST_WHEN_NOT_IN_INDEX && freq > 0) { return new String[] { word }; } BooleanQuery query = new BooleanQuery(); String[] grams; String key; for (int ng = getMin(lengthWord); ng <= getMax(lengthWord); ng++) {//2. key = "gram" + ng; // form key grams = formGrams(word, ng); // form word into ngrams (allow dups too) if (grams.length == 0) { continue; // hmm } if (bStart > 0) { // should we boost prefixes? add(query, "start" + ng, grams[0], bStart); // matches start of word } if (bEnd > 0) { // should we boost suffixes add(query, "end" + ng, grams[grams.length - 1], bEnd); // matches end of word } for (int i = 0; i < grams.length; i++) { add(query, key, grams[i]); } } int maxHits = 10 * numSug; ScoreDoc[] hits = indexSearcher.search(query, null, maxHits).scoreDocs;//3. SuggestWordQueue sugQueue = new SuggestWordQueue(numSug, comparator); // go thru more than 'maxr' matches in case the distance filter triggers int stop = Math.min(hits.length, maxHits); SuggestWord sugWord = new SuggestWord(); for (int i = 0; i < stop; i++) { sugWord.string = indexSearcher.doc(hits[i].doc).get(F_WORD); // get orig word // don't suggest a word for itself, that would be silly if (sugWord.string.equals(word)) {//4.
continue; } // edit distance sugWord.score = sd.getDistance(word,sugWord.string);//5.
if (sugWord.score < accuracy) { continue; } if (ir != null && field != null) { // use the user index sugWord.freq = ir.docFreq(new Term(field, sugWord.string)); // freq in the index // don't suggest a word that is not present in the field if ((suggestMode==SuggestMode.SUGGEST_MORE_POPULAR && goalFreq > sugWord.freq) || sugWord.freq < 1) { continue; } } sugQueue.insertWithOverflow(sugWord); if (sugQueue.size() == numSug) { // if queue full, maintain the minScore score accuracy = sugQueue.top().score; } sugWord = new SuggestWord(); } // convert to array string String[] list = new String[sugQueue.size()]; for (int i = sugQueue.size() - 1; i >= 0; i--) { list[i] = sugQueue.pop().string; } return list; } finally { releaseSearcher(indexSearcher); } }
- 设置建议模式,分为三种:SUGGEST_ALWAYS(总是尝试提供建议)、SUGGEST_WHEN_NOT_IN_INDEX(条目不在索引中时建议,是默认形式)、SUGGEST_MORE_POPULAR(建议查询频率高的词)。这里将模式设为SUGGEST_ALWAYS,并将ir 设为null,field 设为 null。
- NGrams切分算法
- Lucene本来是按照查询的关键字占当前文档出现的频率来排序的;如果另外要对某个特定的字段进行排序,可以构造一个对象作为参数调用IndexSearcher的search方法.
- 不要建议查询词本身
- 向用户展示搜索结果:SpellChecker类并不返回每条提示与初始用户输入项之间的相似距离,但可以通过调用StringDistance实例中的getDistance方法来重新计算该距离值,若所得分值低于accuracy,则跳出当前次循环。在这里距离度量方式分为:Levenshtein、JaroWinkler、自定义。这里的getDistance方法属于第一种LevenshteinDistance
- 最后返回建议结果并输出
这里还有很多具体的方法,后续会继续研究,若有错误,请读者纠正指教