Lucene自定义分词算法实现

分词功能

将文档分成一个一个单词，去除标点符号，去除停止词(一些修饰词，例如and ,to,for等等)，去除过滤词，经过分词处理后即得到token。

lucene自带的分词机制

StopAnalyzer分词：利用空格和各种符号进行单词切分，同时会去除停止词。停止词包括 is，are，in，on，the等无实际意义的词。

StandardAnalyzer分词：混合切分，包括去除停止词，并且可以支持中文，对于中文支持的效果很差。

WhitespaceAnalyzer分词：利用空格进行单词切分。

SimpleAnalyzer分词：利用空格以及各种符号进行单词切分。

KeyWordAnalyzer分词：主要用于处理类似身份证，邮编，产品编码，订单编码等关键信息。

lucene分词自定义

TokenStream继承关系图如下：

StopAnalyzer，StandardAnalyze，WhitespaceAnalyzer，SimpleAnalyzer，KeyWordAnalyzer都继承自父类Analyzer。

因此只要实现父类的虚方法tokenStream 就可以实现分析。

分词的切分算法由继承自父类Tokenizer的方法

public final boolean incrementToken() throws IOException 来实现。

因此自定义继承类Tokenizer并实现其incrementToken算法就可以实现自定义的分词。

lucene分词的demo实现

根据产品名称的信息描述，分词不但要涵盖字母，而且需要涵盖数字，例如某些产品的型号，如果忽略数字并进行切分，可能就会产生语义上的歧义。此外还需要一些字符支持，例如”-”,”#”等。

例如如下产品名称：

MSI Wind Top AE1920-001US 18.5" Intel Atom D525(1.80GHz) 2GB DDR3 250GB Intel GMA 3150 Windows 7 Home Premium 64-bit

其中包含了AE1920-001US，如果忽略数字，那么当用户输入 AE1920-00US，系统可能无法认识该关键字。

public class ReviewTokenizer extends Tokenizer {
    private TermAttribute termAtt;
    private OffsetAttribute offsetAtt;
    private int offset = 0;//分词的结束位置
    private int bufferIndex = 0;//每个分词的开始位置
    private int dataLen = 0;//数据长度
    private static int MAX_WORD_LEN = 255;
    private static int IO_BUFFER_SIZE = 1024;
    private char[] buffer = new char[MAX_WORD_LEN];
    private char[] ioBuffer = new char[IO_BUFFER_SIZE];
    private char[] chars={'-','#','.'};
    private Set<String> ignoreword=new HashSet<String>();
    public ReviewTokenizer(Reader input)
    {
       super(input);
       initIgnoreWord();
        this.offsetAtt = ((OffsetAttribute)addAttribute(OffsetAttribute.class));
        this.termAtt = ((TermAttribute)addAttribute(TermAttribute.class));
    }
    protected void initIgnoreWord()
    {
        String[] STOP_WORDS ={"and", "are", "as", "at", "be", "but", "by",
         "for", "if", "in", "into", "is", "it",
         "no", "not", "of", "on", "or", "such",
         "that", "the", "their", "then", "there", "these",
         "they", "this", "to", "was", "will", "with"};
        for(String word:STOP_WORDS)
        {
           ignoreword.add(word); 
        }
    }
    //判断是否是字母或者数字
    protected boolean isTokenChar(char c)
    {
       int i=0;
       while(i<chars.length)
       {
           if(c==chars[i++])
              return true;
       }
       return Character.isLetterOrDigit(c);
    }
    //小写
    protected char normalize(char c)
    {
       return Character.toLowerCase(c);
    }
    //分词处理算法(不支持中文)
    public boolean incrementToken() throws IOException 
    {
       clearAttributes();
       int length = 0;
       int start = this.bufferIndex;
       char[] buffer = this.termAtt.termBuffer();
       
       while(true)
       {
            if (bufferIndex >= dataLen)//当分词处理到ioBuffer末尾时，继续从input读取数据
            {
               offset += dataLen;
               dataLen = input.read(ioBuffer);//从input中将数据读取至ioBuffer
               if (dataLen == -1)
               {
                   dataLen = 0;                            // so next offset += dataLen won't decrement offset
                   if (length > 0)
                       break;
                     else
                       return false;
                 }
                bufferIndex = 0;
            }
            final char c = ioBuffer[bufferIndex++];
            if (isTokenChar(c))//仅处理字母,以及自定义的'#','-'
            {
               if (length == 0)
               {
                   start = offset + bufferIndex - 1;
                }
               else if (length == buffer.length)//>255
               {
                   buffer = termAtt.resizeTermBuffer(1+length);
               }
               buffer[length++]=normalize(c);//默认转化为小写
               if (length == MAX_WORD_LEN)      // buffer overflow!
               {
                     break;
               }
             }
            else if (length > 0) //将空格，符号作为分隔，退出循环，形成一个分词.
            {
               String word=new String(buffer);
               word=word.substring(0,length);
               if(ignoreword.contains(word)==true)//排出一些常用词，例如or等
               {
                   length=0;
                   continue; 
               }
               else
               {
                   break;
               }
            }
       }
       this.termAtt.setTermLength(length);
       this.offsetAtt.setOffset(correctOffset(start), correctOffset(start + length));
       return true;
    }
      public final void end() {
        // set final offset
        int finalOffset = correctOffset(offset);
        offsetAtt.setOffset(finalOffset, finalOffset);
      }
      public void reset(Reader input) throws IOException {
        super.reset(input);
        bufferIndex = 0;
        offset = 0;
        dataLen = 0;
      }
}

posted on 2012-03-22 16:44 reck for zhou 阅读(2276) 评论(1) 编辑收藏举报

刷新页面返回顶部

reck for zhou