Lucene自定义分词算法实现

分词功能

       将文档分成一个一个单词,去除标点符号,去除停止词(一些修饰词,例如and ,to,for等等),去除过滤词,经过分词处理后即得到token

lucene自带的分词机制

       StopAnalyzer分词:利用空格和各种符号进行单词切分,同时会去除停止词。停止词包括 isareinonthe等无实际意义的词。

       StandardAnalyzer分词:混合切分,包括去除停止词,并且可以支持中文,对于中文支持的效果很差。

       WhitespaceAnalyzer分词:利用空格进行单词切分。

       SimpleAnalyzer分词:利用空格以及各种符号进行单词切分。

       KeyWordAnalyzer分词:主要用于处理类似身份证,邮编,产品编码,订单编码等关键信息。

lucene分词自定义

       TokenStream继承关系图如下:

       

 

       StopAnalyzerStandardAnalyzeWhitespaceAnalyzerSimpleAnalyzerKeyWordAnalyzer都继承自父类Analyzer

       因此只要实现父类的虚方法tokenStream 就可以实现分析。

       分词的切分算法由继承自父类Tokenizer的方法

       public final boolean incrementToken() throws IOException 来实现。

        因此自定义继承类Tokenizer并实现其incrementToken算法就可以实现自定义的分词。

 lucene分词的demo实现

       根据产品名称的信息描述,分词不但要涵盖字母,而且需要涵盖数字,例如某些产品的型号,如果忽略数字并进行切分,可能就会产生语义上的歧义。此外还需要一些字符支持,例如”-”,”#”等。

       例如如下产品名称:

       MSI Wind Top AE1920-001US 18.5" Intel Atom D525(1.80GHz) 2GB DDR3 250GB Intel GMA 3150 Windows 7 Home Premium 64-bit

       其中包含了AE1920-001US,如果忽略数字,那么当用户输入 AE1920-00US,系统可能无法认识该关键字。

    

    

public class ReviewTokenizer extends Tokenizer {
    private TermAttribute termAtt;
    private OffsetAttribute offsetAtt;
    private int offset = 0;//分词的结束位置
    private int bufferIndex = 0;//每个分词的开始位置
    private int dataLen = 0;//数据长度
    private static int MAX_WORD_LEN = 255;
    private static int IO_BUFFER_SIZE = 1024;
    private char[] buffer = new char[MAX_WORD_LEN];
    private char[] ioBuffer = new char[IO_BUFFER_SIZE];
    private char[] chars={'-','#','.'};
    private Set<String> ignoreword=new HashSet<String>();
    public ReviewTokenizer(Reader input)
    {
       super(input);
       initIgnoreWord();
        this.offsetAtt = ((OffsetAttribute)addAttribute(OffsetAttribute.class));
        this.termAtt = ((TermAttribute)addAttribute(TermAttribute.class));
    }
    protected void initIgnoreWord()
    {
        String[] STOP_WORDS ={"and", "are", "as", "at", "be", "but", "by",
         "for", "if", "in", "into", "is", "it",
         "no", "not", "of", "on", "or", "such",
         "that", "the", "their", "then", "there", "these",
         "they", "this", "to", "was", "will", "with"};
        for(String word:STOP_WORDS)
        {
           ignoreword.add(word); 
        }
    }
    //判断是否是字母或者数字
    protected boolean isTokenChar(char c)
    {
       int i=0;
       while(i<chars.length)
       {
           if(c==chars[i++])
              return true;
       }
       return Character.isLetterOrDigit(c);
    }
    //小写
    protected char normalize(char c)
    {
       return Character.toLowerCase(c);
    }
    //分词处理算法(不支持中文)
    public boolean incrementToken() throws IOException 
    {
       clearAttributes();
       int length = 0;
       int start = this.bufferIndex;
       char[] buffer = this.termAtt.termBuffer();
       
       while(true)
       {
            if (bufferIndex >= dataLen)//当分词处理到ioBuffer末尾时,继续从input读取数据
            {
               offset += dataLen;
               dataLen = input.read(ioBuffer);//从input中将数据读取至ioBuffer
               if (dataLen == -1)
               {
                   dataLen = 0;                            // so next offset += dataLen won't decrement offset
                   if (length > 0)
                       break;
                     else
                       return false;
                 }
                bufferIndex = 0;
            }
            final char c = ioBuffer[bufferIndex++];
            if (isTokenChar(c))//仅处理字母,以及自定义的'#','-'
            {
               if (length == 0)
               {
                   start = offset + bufferIndex - 1;
                }
               else if (length == buffer.length)//>255
               {
                   buffer = termAtt.resizeTermBuffer(1+length);
               }
               buffer[length++]=normalize(c);//默认转化为小写
               if (length == MAX_WORD_LEN)      // buffer overflow!
               {
                     break;
               }
             }
            else if (length > 0) //将空格,符号作为分隔,退出循环,形成一个分词.
            {
               String word=new String(buffer);
               word=word.substring(0,length);
               if(ignoreword.contains(word)==true)//排出一些常用词,例如or等
               {
                   length=0;
                   continue
               }
               else
               {
                   break;
               }
            }
       }
       this.termAtt.setTermLength(length);
       this.offsetAtt.setOffset(correctOffset(start), correctOffset(start + length));
       return true;
    }
      public final void end() {
        // set final offset
        int finalOffset = correctOffset(offset);
        offsetAtt.setOffset(finalOffset, finalOffset);
      }
      public void reset(Reader input) throws IOException {
        super.reset(input);
        bufferIndex = 0;
        offset = 0;
        dataLen = 0;
      }
}
 



posted on 2012-03-22 16:44  reck for zhou  阅读(2258)  评论(1编辑  收藏  举报

导航