在用户传入搜索词queryString（例如：String queryString = "中国";）suggestionsNumber 推荐的最大数量后，

调用suggestSimilar方法返回提示词的排序结果

suggestions = spellChecker.suggestSimilar(queryString,suggestionsNumber);

下面具体说suggestSimilar()方法，它主要分为以下几步（有一些还不大明白在做什么，在此列出较为明晰的）

word是queryString，numSug是suggestionsNumber，

ir是用户索引的IndexReader（可为空，局限于field参数），

field是用户索引的域（若这个域不为空，被建议的词就局限在指定域中存在的词），

suggestMode (如果indexReader为null and/or field==null, 那么就必须重写SuggestMode.SUGGEST_ALWAYS)

accuracy一个建议必须有的最小得分

public String[] suggestSimilar(String word, int numSug, IndexReader ir,String field, SuggestMode suggestMode, float accuracy) 
                                                         throws IOException {   
 // obtainSearcher calls ensureOpen
    final IndexSearcher indexSearcher = obtainSearcher();
    try {
      if (ir == null || field == null) {
        suggestMode = SuggestMode.SUGGEST_ALWAYS;//1.
      }
      if (suggestMode == SuggestMode.SUGGEST_ALWAYS) {
        ir = null;
        field = null;
      }

      final int lengthWord = word.length();
      final int freq = (ir != null && field != null) ? ir.docFreq(new Term(field, word)) : 0;
      final int goalFreq = suggestMode==SuggestMode.SUGGEST_MORE_POPULAR ? freq : 0;
      // if the word exists in the real index and we don't care for word frequency, return the word itself
      if (suggestMode==SuggestMode.SUGGEST_WHEN_NOT_IN_INDEX && freq > 0) {
        return new String[] { word };
      }

      BooleanQuery query = new BooleanQuery();
      String[] grams;
      String key;

      for (int ng = getMin(lengthWord); ng <= getMax(lengthWord); ng++) {//2.
        key = "gram" + ng; // form key
        grams = formGrams(word, ng); // form word into ngrams (allow dups too)
        if (grams.length == 0) {
          continue; // hmm
        }
        if (bStart > 0) { // should we boost prefixes?
          add(query, "start" + ng, grams[0], bStart); // matches start of word
        }
        if (bEnd > 0) { // should we boost suffixes
          add(query, "end" + ng, grams[grams.length - 1], bEnd); // matches end of word
        }
        for (int i = 0; i < grams.length; i++) {
          add(query, key, grams[i]);
        }
      }

      int maxHits = 10 * numSug;
      ScoreDoc[] hits = indexSearcher.search(query, null, maxHits).scoreDocs;//3.
      SuggestWordQueue sugQueue = new SuggestWordQueue(numSug, comparator);
      // go thru more than 'maxr' matches in case the distance filter triggers
      int stop = Math.min(hits.length, maxHits);
      SuggestWord sugWord = new SuggestWord();
      for (int i = 0; i < stop; i++) {
        sugWord.string = indexSearcher.doc(hits[i].doc).get(F_WORD); // get orig word

        // don't suggest a word for itself, that would be silly
        if (sugWord.string.equals(word)) {//4.
　　　　　　　　　　continue;
        }

        // edit distance
        sugWord.score = sd.getDistance(word,sugWord.string);//5.

        if (sugWord.score < accuracy) {
          continue;
        }

        if (ir != null && field != null) { // use the user index
          sugWord.freq = ir.docFreq(new Term(field, sugWord.string)); // freq in the index
          // don't suggest a word that is not present in the field
          if ((suggestMode==SuggestMode.SUGGEST_MORE_POPULAR && goalFreq > sugWord.freq) || sugWord.freq < 1) {
            continue;
          }
        }
        sugQueue.insertWithOverflow(sugWord);
        if (sugQueue.size() == numSug) {
          // if queue full, maintain the minScore score
          accuracy = sugQueue.top().score;
        }
        sugWord = new SuggestWord();
      }

      // convert to array string
      String[] list = new String[sugQueue.size()];
      for (int i = sugQueue.size() - 1; i >= 0; i--) {
        list[i] = sugQueue.pop().string;
      }

      return list;
    } finally {
      releaseSearcher(indexSearcher);
    }
  }

设置建议模式，分为三种：SUGGEST_ALWAYS（总是尝试提供建议）、SUGGEST_WHEN_NOT_IN_INDEX（条目不在索引中时建议，是默认形式）、SUGGEST_MORE_POPULAR（建议查询频率高的词）。这里将模式设为SUGGEST_ALWAYS，并将ir 设为null，field 设为 null。
NGrams切分算法
Lucene本来是按照查询的关键字占当前文档出现的频率来排序的；如果另外要对某个特定的字段进行排序，可以构造一个对象作为参数调用IndexSearcher的search方法.
不要建议查询词本身
向用户展示搜索结果：SpellChecker类并不返回每条提示与初始用户输入项之间的相似距离，但可以通过调用StringDistance实例中的getDistance方法来重新计算该距离值,若所得分值低于accuracy，则跳出当前次循环。在这里距离度量方式分为：Levenshtein、JaroWinkler、自定义。这里的getDistance方法属于第一种LevenshteinDistance
最后返回建议结果并输出

这里还有很多具体的方法，后续会继续研究，若有错误，请读者纠正指教

posted on 2013-12-04 16:30 瞌睡的美人鱼阅读(359) 评论(0) 编辑收藏举报

刷新页面返回顶部

公告