Lucene in action 笔记 analysis篇

Analysis, in Lucene, is the process of converting field text into its most fundamental indexed representation, terms.

这个分析的步骤不一定的, 一般会包含如下:extracting words, discarding punctuation, removing accents from characters, lowercasing (also called normalizing), removing common words, reducing words to a root form (stemming), or changing words into the basic form (lemmatization).

Choosing the right analyzer is a crucial development decision with Lucene. One size doesn’t fit all when it comes to choosing an analyzer. Language is one factor in choosing an analyzer,because each has its own unique features. Another factor to consider in choosing an analyzer is the domain of the text being analyzed.

那么google是怎样分析的, 对于“to be or not to be”不加引号在google里查询, 只有''not''会被google考虑, 其他的词都被当做common词丢了, 而google在index的时候确没有丢掉这些stop words, 因为你加引号在google里面查询, 就可以找到有这个短语的文章. 一个有趣的问题是如果对这种海量的stop words进行index的话, google怎么做才能节省空间, 以不至于out of storage? 后面会讨论

一个例子, 分析同一句, 不用分析器所带来的不同效果:

Analyzing "XY&Z Corporation - xyz@example.com"
WhitespaceAnalyzer:
[XY&Z] [Corporation] [-] [xyz@example.com]
SimpleAnalyzer:
[xy] [z] [corporation] [xyz] [example] [com]
StopAnalyzer:
[xy] [z] [corporation] [xyz] [example] [com]
StandardAnalyzer:
[xy&z] [corporation] [xyz@example.com]

一. Using analyzers

1. Indexing analysis

在index里面用就是这样用的.

Analyzer analyzer = new StandardAnalyzer();
IndexWriter writer = new IndexWriter(directory, analyzer, true);

还可以为每个document指定特有的anlysis,

writer.addDocument(doc, analyzer);

2. QueryParser analysis

QueryParser时再分析输入的查询时也需要用到analysis

QueryParser parser = new QueryParser("contents", analyzer);
query = parser.parse(expression);

或者

Query query = QueryParser.parse(expression, "contents", analyzer);

二. Analyzing the analyzer
那么要来分析这个analyzer, 它是干吗的了, 前面说了, 他就是把文本变成一个token序列, 称为TokenStream. 所以他应该有这个一个接口:
public TokenStream tokenStream(String fieldName, Reader reader)

对于最简单的SimpleAnalyzer而言, 他的接口是这样的:
public final class SimpleAnalyzer extends Analyzer {
    public TokenStream tokenStream(String fieldName, Reader reader) {
        return new LowerCaseTokenizer(reader);
    }
}

1. What’s in a token?
A stream of tokens is the fundamental output of the analysis process.
其实就是把这个text切成一个个小块, 可以说是分词拉, 这个每个小块就是一个token.
Token和Term有什么区别了, 先看token
A token 由什么组成:
text value (the word itself)
start and end offsets in the original text
token type: default is ''word''
position increment:the position relative to the previous token, 离前一个token多远, 一般默认为1

Term 是什么:
After text is analyzed during indexing, each token is posted to the index as a term.
而Term仅仅记录text value, position increment, 其他的token信息都扔了.
为什么都扔了, 非要留position increment?
Position increments factor directly into performing phrase queries (see section 3.4.5) and span queries (see section 5.4), which rely on knowing how far terms are from one another within a field.
因为这个在phrase queries和span queries中要用到.
Position increments默认为1, 当大于1时, 就是要跳过一些词, 这些词因为某种原因不算token, 如stop word.
当等于0时, 这个比较有意思啊, 就是这个词和上一个是重叠的, 可以用于添加word aliases, 在SynonymAnalyzer中会用到.

2. TokenStreams
There are two different styles of TokenStreams: Tokenizer and TokenFilter.
区别是什么?
Tokenizer的输入就是text, 他会产生基本的token, 如有CharTokenizer, WhitespaceTokenizer, StandardTokenizer
TokenFilter的输入本身就是TokenStreams, 也就是说他是用来基于已经生成的TokenStreams进行二次处理的, 是高级的. 如有, LowerCaseFilter, StopFilter, PorterStemFilter.

举个StopAnalyzer的例子:
public TokenStream tokenStream(String fieldName, Reader reader) {
    return new StopFilter(
        new LowerCaseTokenizer(reader), stopTable);
}
可见StopAnalyzer, 就是先用LowerCaseTokenizer就行初步分词, 然后再用StopFilter把stop word给滤掉, 两步走.
这样把analyzer层次化, 有利于灵活运用, 可以组装出各种不同用途的analyzer.

3. Visualizing analyzers
作者给了个可以打印analyzer分析结果的模块, 有个这个可以跟好的了解各个analyzer的作用.
"The quick brown fox...." 分析完token的状况如下:
1: [the:0->3:word]
2: [quick:4->9:word]
3: [brown:10->15:word]
4: [fox:16->19:word]

前面说token变成term的时候会丢掉start and end offsets, token type 是不是说他们就完全没用了?
不是的, 比如search engine现在都提供term highlighter的功能, 这个就用到term的start and end offsets
而对于token type的作用看下面的例子:
“I’ll e-mail you at xyz@example.com” with StandardAnalyzer:
1: [i''ll:0->4:<APOSTROPHE>]
2: [e:5->6:<ALPHANUM>]
3: [mail:7->11:<ALPHANUM>]
4: [you:12->15:<ALPHANUM>]
5: [xyz@example.com:19->34:<EMAIL>]

4. Filtering order can be important
前面讲了, 这个analyzer分层设计后, 便于灵活的组合使用. 但这个顺序是有讲究的.
比如StopFilter是大小写敏感的, 所以如果使用他之前不进行大小写处理就会有问题, 等.

三. Using the built-in analyzers
先介绍如下几种built-in analyzers:
WhitespaceAnalyzer: Splits tokens at whitespace
SimpleAnalyzer:     Divides text at nonletter characters and lowercases
StopAnalyzer:       Divides text at nonletter characters, lowercases, and removes stop words
StandardAnalyzer:   Tokenizes based on a sophisticated grammar that recognizes e-mail addresses, acronyms, Chinese-Japanese-Korean characters, alphanumerics, and more; lowercases; and removes stop words
WhitespaceAnalyzer and SimpleAnalyzer are both trivial, 就不提了

1. StopAnalyzer
其实这个也没的什么好讲的, StopAnalyzer就是在SimpleAnalyzer基础上去掉stop word.
这个stop word list, 可以用默认的, 也可以自己指定.
所以在index和query的时候, 所有stop word都是被滤掉的, 你放再多的stop word在文档里都没用.
所以特定stop word的选取很重要, 选的过大会导致很多语意semantic被丢失, 太小会影响效率.

2. StandardAnalyzer
这个应该是lucene的主打分析器, 也是最通用的, 他能够智能的分析出:
alphanumerics, acronyms, company names, e-mail addresses, computer host names, numbers, words with an interior apostrophe, serial numbers, IP addresses, and CJK (Chinese Japanese Korean) characters.
可以说比较强大.

四. Dealing with keyword fields
处理keyword类型的field, 这个field有什么特殊的了, keyword不会被anlysis, 直接整个作为一个term被index.
所以如果象下面这样query的时候也整个当成一个term, 不会有问题, 能够search到.
Document doc = new Document();
doc.add(Field.Keyword("partnum", "Q36"));
doc.add(Field.Text("description", "Illidium Space Modulator"));
writer.addDocument(doc);

Query query = new TermQuery(new Term("partnum", "Q36"));

但是如果你不幸用了QueryParser, 象下面这样:
Query query = QueryParser.parse("partnum:Q36 AND SPACE", "description", new SimpleAnalyzer());
就无法search到. 为什么了?
因为QueryParser analyzes each term and phrase, 他比较勤劳, 每一个term和phrase都要去分析, 这边用的是SimpleAnalyzer.
所以SimpleAnalyzer strips nonletter characters and lowercases, so Q36 becomes q.
这儿可见indexing and analysis are intimately tied to searching, indexing和analysis跟search息息相关, 一步不对你就屁都找不到.
但这个问题应该是QueryParser的一个issue, 这儿不应该analysis的地方, 他非要去analysis.
怎么解决了:
a.不让用户随便写query, 通过UI让用户选择, 这个好像不能算解决
b.找个方法可以field-specific analysis
c.creating a custom domain specific analyzer, 比如分析"Q36", 完了还是"Q36", 别成''q''了.
d.子类化QueryParser来重载getFieldQuery来提供field-specific handling
我们这儿就采用一个(b)的方案来解决这个问题,
在index篇中我们提到, 我们可以针对IndexWriter对象和document来指定具体的analyzer, 用普通的分析器和方法无法支持对field指定特殊的analyzer.
所以我们就要特殊的方法: PerFieldAnalyzerWrapper 从名字就能看出来per field
The built-in PerFieldAnalyzerWrapper constructor requires the default analyzer as a parameter.
PerFieldAnalyzerWrapper analyzer = new PerFieldAnalyzerWrapper(new SimpleAnalyzer());

To assign a different analyzer to a field, use the addAnalyzer method.
analyzer.addAnalyzer("partnum", new KeywordAnalyzer());

During tokenization, the analyzer specific to the field name is used; the default is used if no field-specific analyzer has been assigned.
Query query = QueryParser.parse("partnum:Q36 AND SPACE", "description", analyzer);
这样就解决了上面的问题. 里面有个KeywordAnalyzer, 这儿给出2种实现:
比较复杂的, 通用一点的实现:
public class KeywordAnalyzer extends Analyzer {
    public TokenStream tokenStream(String fieldName, final Reader reader) {
        return new TokenStream() {
            private boolean done;
            private final char[] buffer = new char[1024];
            public Token next() throws IOException {
            if (!done) {
                done = true;
                StringBuffer buffer = new StringBuffer();
                int length = 0;
                while (true) {
                    length = reader.read(this.buffer);
                    if (length == -1) break;
                    buffer.append(this.buffer, 0, length);
                }
                String text = buffer.toString();
                return new Token(text, 0, text.length());
            }
            return null;
    }
};
}
}
说白了就是overload了token迭代器的next函数, 第一次读出所有的内容当一个token返回, 第二次返回null.
这个方法可以通过调节buffer的大小来设置keyword的最大长度.

比较简单的方法, 不过keyword最长255:
public class SimpleKeywordAnalyzer extends Analyzer {
    public TokenStream tokenStream(String fieldName, Reader reader) {
        return new CharTokenizer(reader) {
            protected boolean isTokenChar(char c) {
                return true;
            }
        };
    }
}
这用了CharTokenizer, 这个分析器就是当isTokenChar == true时, 会一直连续读下去, 直到false才认为一个token结束.
但他有上限, token最大就255. 这儿让他一直为true, 所以就是读全部.
这个CharTokenizer还挺重要, WhitespaceTokenizer, LetterTokenizer都是拿它来实现的, WhitespaceTokenizer仅当是Whitespace时返回false, 而LetterTokenizer返回Character.isLetter.

五. “Sounds like” querying
这个很有意思, 什么意思?直接看例子
用Metaphone分析器, 它可以返回字符串的读音值
Analyzer analyzer = new MetaphoneReplacementAnalyzer();
IndexWriter writer = new IndexWriter(directory, analyzer, true);
Document doc = new Document();
doc.add(Field.Text("contents", "cool cat"));
writer.addDocument(doc);
writer.close();

IndexSearcher searcher = new IndexSearcher(directory);
Query query = QueryParser.parse("kool kat", "contents", analyzer);
结果是找到上面的文章了, 为什么, 因为"cool cat"和"kool kat"读音一样, 至于这个MetaphoneReplacementAnalyzer的详细实现, 书上有, 不详细讲了
比如下面两句经过分析:
"The quick brown fox jumped over the lazy dogs"
"Tha quik brown phox jumpd ovvar tha lazi dogz"
输出的读音值都是:
[0] [KK] [BRN] [FKS] [JMPT] [OFR] [0] [LS] [TKS]
[0] [KK] [BRN] [FKS] [JMPT] [OFR] [0] [LS] [TKS]

这个有什么用了?
a sounds-like feature would be great for situations where a user misspelled every word and no documents were found, but alternative words could be suggested.
One implementation approach to this idea could be to run all text through a sounds-like analysis and build a cross-reference lookup to consult when a correction is needed.
这个还是蛮有用的, search engine你拼错时给的提示, 可能就是这么实现的. 这种纠错也可以通过bayes分类,机器学习的方法来解决, 不过貌似更复杂点.

六. Synonyms, aliases, and words that mean the same
同义词也是用的很多的, 一个例子分析"jumps"
Token[] tokens = AnalyzerUtils.tokensFromAnalysis(synonymAnalyzer, "jumps");
一般token应该就是"jumps", 可现在是{"jumps", "hops", "leaps"}
AnalyzerUtils.assertTokensEqual(tokens, new String[] {"jumps", "hops", "leaps"});
而且"hops", "leaps"的PositionIncrement为0, 这个在前面已经提到.
assertEquals("jumps", 1, tokens[0].getPositionIncrement());
assertEquals("hops", 0, tokens[1].getPositionIncrement());
assertEquals("leaps", 0, tokens[2].getPositionIncrement());

好,来看看这个synonymAnalyzer是怎么实现的?
TokenStream result = new SynonymFilter(
    new StopFilter(
        new LowerCaseFilter(
            new StandardFilter(
                new StandardTokenizer(reader))),
        StandardAnalyzer.STOP_WORDS),
    engine
);
这个例子可以很好的看出lucene analyzer的灵活性, 从StandardTokenizer一步步上去, 最后到了SynonymFilter, 会根据你给的engine往tokenstream里面加同义token.
SynonymFilter的实现可以看书上, 其实很简单, 就是一个个token遍历, 对每个token去engine里面查同义词, 有同义词也封装成token插进去, 同时将PositionIncrement置0
String[] synonyms = engine.getSynonyms(token.termText());
if (synonyms == null) return;
for (int i = 0; i < synonyms.length; i++) {
    Token synToken = new Token(synonyms[i],
    token.startOffset(),
    token.endOffset(),
    TOKEN_TYPE_SYNONYM);
    synToken.setPositionIncrement(0);
}
下面举个例子来说明synonymAnalyzer的效果:
IndexWriter writer = new IndexWriter(directory, synonymAnalyzer, true);
Document doc = new Document();
doc.add(Field.Text("content", "The quick brown fox jumps over the lazy dogs"));
writer.addDocument(doc);
writer.close();
搜同义词: jumps,hops同意
TermQuery tq = new TermQuery(new Term("content", "hops"));
搜短语:
PhraseQuery pq = new PhraseQuery();
pq.add(new Term("content", "fox"));
pq.add(new Term("content", "hops"));
都能匹配

同样当你不幸使用了QueryParser, 可能会有奇怪的问题......天拉, 没事别用QueryParser
public void testWithQueryParser() throws Exception {
Query query = QueryParser.parse("/"fox jumps/"", "content", synonymAnalyzer);
Hits hits = searcher.search(query);
assertEquals("!!!! what?!", 0, hits.length());找不到
query = QueryParser.parse("/"fox jumps/"", "content", new StandardAnalyzer());
hits = searcher.search(query);
assertEquals("*whew*", 1, hits.length());找到
为什么QueryParser用和index一样的synonymAnalyzer时, 确找不到正确的文档了...?
当你把这个query print出来时, 你就明白了
Query query = QueryParser.parse("/"fox jumps/"", "content", synonymAnalyzer);
System.out.println("/"fox jumps/" parses to " + query.toString("content"));
"fox jumps" parses to "fox jumps hops leaps"
it glues all terms from analysis together to form a PhraseQuery and ignores tokenposition increment information.
奥, 你看你只要用synonymAnalyzer, 就会插同义词, 所以这儿不能再用synonymAnalyzer了
难道这又是个issue, 作为token或term, PositionIncrement信息都会保留的, 这边怎么会简单粗暴的把所有词合一块...

You have another option with synonyms: expanding them into each query rather than indexing.
也就是说对同义词的处理有2个思路, 一个是我们讲的在index的时候把同义词都加进去, 当然还有一个就是index的时候不做特别的事, 但在query的时候, 把同义词都列出来, 把含有同义词的文章也搜出来.
比较笨的方法用PhrasePrefixQuery (see section 5.2) is one option to consider
或者可以created through an overridden QueryParser.getFieldQuery method

七. Stemming analysis
就是要把the various forms of a word are reduced to a common root form.
这里介绍一种分析器 PositionalPorterStopAnalyzer, This analyzer removes stop words, leaving positional holes where words are removed, and also leverages a stemming filter.
这个分析器基于Porter stemming algorithm, 由Dr. Martin Porter发明的, 当然还有其他的用于stemming分析的算法, 比如Snowball algorithm, KStem算法
这个分析器不但会把派生词变成基本的形式, 而且会去掉stop word, 这样原来放stop word的地方就空了, 变成hole.
比如 "The quick brown fox jumps over the lazy dogs", 分析完:
2: [quick]
3: [brown]
4: [fox]
5: [jump]
6: [over]
8: [lazi]
9: [dog]
1和7的位置就是空的.
实现这个功能用到了PositionalStopFilter:
public class PositionalStopFilter extends TokenFilter {
    private Set stopWords;
    public PositionalStopFilter(TokenStream in, Set stopWords) {
        super(in);
        this.stopWords = stopWords;
    }
    public final Token next() throws IOException {
        int increment = 0;
        for (Token token = input.next(); token != null; token = input.next()) {
            if (!stopWords.contains(token.termText())) {
                token.setPositionIncrement(token.getPositionIncrement() + increment);
                return token;
            }
            increment++;
        }
        return null;
    }
}
可见就是由几个stop word, 就把PositionIncrement加几.
而PositionalPorterStopAnalyzer的实现如下:
public TokenStream tokenStream(String fieldName, Reader reader) {
    return new PorterStemFilter(
        new PositionalStopFilter(new LowerCaseTokenizer(reader),stopWords)
    );
}
就是在PositionalStopFilter外面加个PorterStemFilter.

...不幸又发生了, 当用PhraseQuery and QueryParser会又问题, 这个好像是介绍完每个分析器后面的都会加上点issue...
IndexWriter writer =
new IndexWriter(directory, porterAnalyzer, true);
Document doc = new Document();
doc.add(Field.Text("contents", "The quick brown fox jumps over the lazy dogs"));
writer.addDocument(doc);
writer.close();

Query query = QueryParser.parse("/"over the lazy/"", "contents", porterAnalyzer);
结果是找不到, 明明有为什么找不到了?
The difficulty lies deeper inside PhraseQuery and its current inability to deal with positional gaps.
就是说PhraseQuery处理不了positional gaps, 所以就直接把the扔了, 变成"over lazy",就找不到了
除非你手工的设这个slop, 就是间隔.
QueryParser parser = new QueryParser("contents", porterAnalyzer);
parser.setPhraseSlop(1);
Query query = parser.parse("/"over the lazy/"");
好了找到了, 艾杯具啊......
所以只能进行“over lazy”这样的inexact phrases搜索, 而With stop-word removal in analysis, doing exact phrase matches is, by definition, not possible: The words removed aren’t there, so you can’t know what they were.
是啊stopword都去掉了, 你也不知道slop应该设几啊.
除了这个小问题,其他这个分析器还是很好用的, 比如你搜"laziness", "/"fox jumped/""都能够正确找到.

八. Language analysis issues
I18N问题对于开发人员永远是个头疼的问题, 尤其是web开发, 网上什么都有啊, 艾晕......
对lucene而言, Lucene stores all characters in the standard UTF-8 encoding.
所以对于开发者而言, 你必须保证的是必须要给出index的文档的正确的encoding信息, 这样lucene才能正确的将你的文档转成utf-8, 否则要么失败, 要么乱码.
Beyond the built-in analyzers we’ve discussed, the core Lucene distribution provides two language-specific analyzers: GermanAnalyzer and RussianAnalyzer.
对于亚洲的语言, 这个才是重点.
The only built-in analyzer capable of doing anything useful with Asian text is the Standard-Analyzer.
However, two analyzers in the Lucene Sandbox are suitable for Asian language analysis, CJKAnalyzer and ChineseAnalyzer.
''道德经''经过不同的分析器, 产生的term如下:
Simple-Analyzer [道德经]
Standard-Analyzer[道][德][经]
CJKAnalyzer [道德][德经]
ChineseAnalyzer[道][德][经]

The CJKAnalyzer pairs characters in overlapping windows of two characters each.Many CJK words are two characters. By pairing characters in this manner, words are likely to be kept together (as well as disconnected characters, increasing the index size).

九. Nutch analysis
下面就是对nutch的analysis进行一点介绍,前面说了google对于stop word在index的时候并没有扔掉, 因为在精确phrase查询时要用到.但是对于大量的stop word在index时应该怎么处理了?
Google怎么做的不得而知, 可是开源的Nutch给出了自己的solution.
Nutch combines an index-time analysis bigram (grouping two consecutive words as a single token) technique with a query-time optimization of phrases.
比如对于“The quick brown…”, 去index “The quick" 而非"The", 因为这种bigram出现的次数要远远小于单独的stop word.
1: [the:<WORD>] [the-quick:gram]
2: [quick:<WORD>]
3: [brown:<WORD>]
4: [fox:<WORD>]
index完的token是这个样的, bigram和stop在同一个位置上.

posted on 2011-07-04 20:21  fxjwind  阅读(669)  评论(0编辑  收藏  举报