Lucene学习总结之九:Lucene的查询对象(2)
5、SpanQuery
所谓SpanQuery也即在查询过程中需要考虑进Term的位置信息的查询对象。
SpanQuery中最基本的是SpanTermQuery,其只包含一个Term,与TermQuery所不同的是,其提供一个函数来得到位置信息:
public Spans getSpans(final IndexReader reader) throws IOException { return new TermSpans(reader.termPositions(term), term); } |
Spans有以下方法:
- next() 得到下一篇文档号,不同的SpanQuery此方法实现不同
- skipTo(int) 跳到指定的文档
- doc() 得到当前的文档号
- start() 得到起始位置,不同的SpanQuery此方法实现不同
- end() 得到结束位置,不同的SpanQuery此方法实现不同
- isPayloadAvailable() 是否有payload
- getPayload() 得到payload
SpanScorer的nextDoc函数如下:
public int nextDoc() throws IOException { if (!setFreqCurrentDoc()) { doc = NO_MORE_DOCS; } return doc; } |
protected boolean setFreqCurrentDoc() throws IOException { if (!more) { return false; } doc = spans.doc(); freq = 0.0f; do { //根据结束位置和起始位置来计算freq从而影响打分 int matchLength = spans.end() - spans.start(); freq += getSimilarity().sloppyFreq(matchLength); more = spans.next(); } while (more && (doc == spans.doc())); return true; } |
5.1、SpanFirstQuery
SpanFirstQuery仅取在开头部分包含查询词的文档,其包含如下成员变量:
- SpanQuery match; 需要满足的查询
- int end; 如何定义开头
其getSpans函数如下:
public Spans getSpans(final IndexReader reader) throws IOException { return new Spans() { private Spans spans = match.getSpans(reader); @Override public boolean next() throws IOException { while (spans.next()) { //仅查询词的位置在设定的end之前的文档才返回。 if (end() <= end) return true; } return false; } @Override public boolean skipTo(int target) throws IOException { if (!spans.skipTo(target)) return false; return spans.end() <= end || next(); } @Override public int doc() { return spans.doc(); } @Override public int start() { return spans.start(); } @Override public int end() { return spans.end(); } }; } |
5.2、SpanNearQuery
SpanNearQuery包含以下成员变量:
- List<SpanQuery> clauses; 一个列表的子SpanQuery
- int slop; 设定这些字SpanQuery之间的距离的最大值,大于此值则文档不返回。
- boolean inOrder; 是否按顺序计算子SpanQuery之间的距离
- String field; 域
- boolean collectPayloads; 是否收集payload
其getSpans函数如下:
public Spans getSpans(final IndexReader reader) throws IOException { if (clauses.size() == 0) return new SpanOrQuery(getClauses()).getSpans(reader); if (clauses.size() == 1) return clauses.get(0).getSpans(reader); return inOrder ? (Spans) new NearSpansOrdered(this, reader, collectPayloads) : (Spans) new NearSpansUnordered(this, reader); } |
是否inorder,举例如下:
假设索引了文档"apple boy cat",如果将SpanNearQuery的clauses依次设为"apple","cat","boy",如果inorder=true,则文档不会被搜索出来,即便slop设为很大,如果inorder=false,则文档会被搜出来,而且slop设为0就能被搜出来。
因为在NearSpansOrdered的next函数如下:
public boolean next() throws IOException { if (firstTime) { firstTime = false; for (int i = 0; i < subSpans.length; i++) { //每个子SpanQuery都取第一篇文档 if (! subSpans[i].next()) { more = false; return false; } } more = true; } if(collectPayloads) { matchPayload.clear(); } return advanceAfterOrdered(); } |
private boolean advanceAfterOrdered() throws IOException { //如果各子SpanQuery指向同一文档 while (more && (inSameDoc || toSameDoc())) { //stretchToOrder要保证各子SpanQuery一定是按照顺序排列的 //shrinkToAfterShortestMatch保证各子SpanQuery之间的距离不大于slop if (stretchToOrder() && shrinkToAfterShortestMatch()) { return true; } } return false; } |
private boolean stretchToOrder() throws IOException { matchDoc = subSpans[0].doc(); for (int i = 1; inSameDoc && (i < subSpans.length); i++) { //docSpansOrdered要保证第i-1个子SpanQuery的start和end都应在第i个之前,否则取下一篇文档。 while (! docSpansOrdered(subSpans[i-1], subSpans[i])) { if (! subSpans[i].next()) { inSameDoc = false; more = false; break; } else if (matchDoc != subSpans[i].doc()) { inSameDoc = false; break; } } } return inSameDoc; } |
static final boolean docSpansOrdered(Spans spans1, Spans spans2) { assert spans1.doc() == spans2.doc() : "doc1 " + spans1.doc() + " != doc2 " + spans2.doc(); int start1 = spans1.start(); int start2 = spans2.start(); return (start1 == start2) ? (spans1.end() < spans2.end()) : (start1 < start2); } |
private boolean shrinkToAfterShortestMatch() throws IOException { //从最后一个子SpanQuery开始 matchStart = subSpans[subSpans.length - 1].start(); matchEnd = subSpans[subSpans.length - 1].end(); int matchSlop = 0; int lastStart = matchStart; int lastEnd = matchEnd; for (int i = subSpans.length - 2; i >= 0; i—) { //不断的取前一个子SpanQuery Spans prevSpans = subSpans[i]; int prevStart = prevSpans.start(); int prevEnd = prevSpans.end(); while (true) { if (! prevSpans.next()) { inSameDoc = false; more = false; break; } else if (matchDoc != prevSpans.doc()) { inSameDoc = false; break; } else { int ppStart = prevSpans.start(); int ppEnd = prevSpans.end(); if (! docSpansOrdered(ppStart, ppEnd, lastStart, lastEnd)) { break; } else { prevStart = ppStart; prevEnd = ppEnd; } } } assert prevStart <= matchStart; if (matchStart > prevEnd) { //总是从下一个的开始位置,减去前一个的结束位置,所以上面的例子中,如果将SpanNearQuery的clauses依次设为"apple","boy","cat",inorder=true, slop=0,是能够搜索的出的。 matchSlop += (matchStart - prevEnd); } matchStart = prevStart; lastStart = prevStart; lastEnd = prevEnd; } boolean match = matchSlop <= allowedSlop; return match; } |
NearSpansUnordered的next函数如下:
public boolean next() throws IOException { if (firstTime) { //将一个Spans生成一个SpansCell,既放入链表中,也放入优先级队列中,在队列中按照第一篇文档号由小到大排列,若文档号相同,则按照位置顺序排列。 initList(true); listToQueue(); firstTime = false; } else if (more) { if (min().next()) { //最上面的取下一篇文档,并调整队列。 queue.updateTop(); } else { more = false; } } while (more) { boolean queueStale = false; if (min().doc() != max.doc()) { //如果队列中最小的文档号和最大的文档号不相同,将队列生成链表。 queueToList(); queueStale = true; } //应该不断的skip每个子SpanQuery直到最小的文档号和最大的文档号相同,不同的是在文档中的位置。 while (more && first.doc() < last.doc()) { more = first.skipTo(last.doc()); firstToLast(); queueStale = true; } if (!more) return false; //调整完毕后,将链表写回队列。 if (queueStale) { listToQueue(); queueStale = false; } //判断是否匹配 if (atMatch()) { return true; } more = min().next(); if (more) { queue.updateTop(); } } return false; } |
private boolean atMatch() { //匹配有两个条件,一个是最小和最大的文档号相同,一个是最大的结束位置减去最小的开始位置再减去最大和最小的自身的长度之和小于等于slop。 //在上面的例子中,如果将SpanNearQuery的clauses依次设为"cat","apple",inorder=false,则slop设为1可以搜索的出来。因为"cat".end = 3, "apple".start=0, totalLength = ("cat".end – "cat".start) + ("apple".end – "apple.start") = 2,所以slop=1即可。 return (min().doc() == max.doc()) && ((max.end() - min().start() - totalLength) <= slop); } |
5.3、SpanNotQuery
SpanNotQuery包含如下两个成员变量:
- SpanQuery include; 必须满足的SpanQuery
- SpanQuery exclude; 必须不能满足的SpanQuery
其next函数从include中取出文档号,如果exclude也包括此文档号,则过滤掉。
其getSpans函数如下:
public Spans getSpans(final IndexReader reader) throws IOException { return new Spans() { private Spans includeSpans = include.getSpans(reader); private boolean moreInclude = true; private Spans excludeSpans = exclude.getSpans(reader); private boolean moreExclude = excludeSpans.next(); @Override public boolean next() throws IOException { //得到下一个include的文档号 if (moreInclude) moreInclude = includeSpans.next(); //此循环查看此文档号是否被exclude,如果是则取下一个include的文档号。 while (moreInclude && moreExclude) { //将exclude跳到include文档号 if (includeSpans.doc() > excludeSpans.doc()) moreExclude = excludeSpans.skipTo(includeSpans.doc()); //当include和exclude文档号相同的时候,不断取得下一个exclude,如果exclude的end大于include的start,则说明当前文档号应该被exclude。 while (moreExclude && includeSpans.doc() == excludeSpans.doc() && excludeSpans.end() <= includeSpans.start()) { moreExclude = excludeSpans.next(); } //如果是因为没有exclude了,或者文档号不相同,或者include的end小于exclude的start,则当前文档不应该被exclude。 if (!moreExclude || includeSpans.doc() != excludeSpans.doc() || includeSpans.end() <= excludeSpans.start()) break; //否则此文档应该被exclude,include取下一篇文档号。 moreInclude = includeSpans.next(); } return moreInclude; } @Override public int doc() { return includeSpans.doc(); } @Override public int start() { return includeSpans.start(); } @Override public int end() { return includeSpans.end(); } }; } |
5.4、SpanOrQuery
SpanOrQuery包含一个列表的子SpanQuery,并对它们取OR的关系,用于满足"apple和boy临近或者cat和dog临近的文档"此类的查询。
其OR的合并算法同BooleanQuery的OR关系的算法DisjunctionSumScorer类似。
public boolean next() throws IOException { if (queue == null) { return initSpanQueue(-1); } if (queue.size() == 0) { return false; } //在优先级队列顶部取下一篇文档或者下一位置,并重新排列队列 if (top().next()) { queue.updateTop(); return true; } //如果最顶部的SpanQuery没有下一篇文档或者下一位置,则弹出 queue.pop(); return queue.size() != 0; } |
5.5、FieldMaskingSpanQuery
在SpanNearQuery中,需要进行位置比较,相互比较位置的Term必须要在同一个域中,否则报异常IllegalArgumentException("Clauses must have same field.").
然而有时候我们需要对不同的域中的位置进行比较,例如:
文档一:
teacherid: 1 studentfirstname: james studentsurname: jones |
我们建索引如下:
Document doc = new Document(); doc.add(new Field("teacherid", "1", Field.Store.YES, Field.Index.NOT_ANALYZED)); doc.add(new Field("studentfirstname", "james", Field.Store.YES, Field.Index.NOT_ANALYZED)); doc.add(new Field("studentsurname", "jones", Field.Store.YES, Field.Index.NOT_ANALYZED)); writer.addDocument(doc); |
文档二:
teacherid: 2 studenfirstname: james studentsurname: smith studentfirstname: sally studentsurname: jones |
我们建索引如下:
doc = new Document(); doc.add(new Field("teacherid", "2", Field.Store.YES, Field.Index.NOT_ANALYZED)); doc.add(new Field("studentfirstname", "james", Field.Store.YES, Field.Index.NOT_ANALYZED)); doc.add(new Field("studentsurname", "smith", Field.Store.YES, Field.Index.NOT_ANALYZED)); doc.add(new Field("studentfirstname", "sally", Field.Store.YES, Field.Index.NOT_ANALYZED)); doc.add(new Field("studentsurname", "jones", Field.Store.YES, Field.Index.NOT_ANALYZED)); writer.addDocument(doc); |
现在我们想找firstname是james,surname是jones的学生的老师,显然如果搜索"studenfirstname: james AND studentsurname: jones",显然上面两个老师都能够搜索出来,可以辨别james和jones属于同一学生的一种方法是位置信息,也即当james和jones处于两个域的同一位置的时候,其属于同一个学生。
这时我们如果声明两个SpanTermQuery:
SpanQuery q1 = new SpanTermQuery(new Term("studentfirstname", "james")); SpanQuery q2 = new SpanTermQuery(new Term("studentsurname", "jones")); |
然后构建SpanNearQuery,子SpanQuery为上述q1, q2,因为在同一位置inorder=false,slop设为-1,因为
"jones".end – "james".start – totallength = 1 – 0 – 2 = -1,这样就能够搜的出来。
然而在构建SpanNearQuery的时候,其构造函数如下:
public SpanNearQuery(SpanQuery[] clauses, int slop, boolean inOrder, boolean collectPayloads) { this.clauses = new ArrayList<SpanQuery>(clauses.length); for (int i = 0; i < clauses.length; i++) { SpanQuery clause = clauses[i]; if (i == 0) { field = clause.getField(); } else if (!clause.getField().equals(field)) { //要求所有的子SpanQuery都属于同一个域 throw new IllegalArgumentException("Clauses must have same field."); } this.clauses.add(clause); } this.collectPayloads = collectPayloads; this.slop = slop; this.inOrder = inOrder; } |
所以我们引入FieldMaskingSpanQuery,SpanQuery q2m = new FieldMaskingSpanQuery(q2, "studentfirstname");
FieldMaskingSpanQuery.getField()得到的是你指定的假的域信息"studentfirstname",从而通过了审核,就可以计算位置信息了。
我们的查询过程如下:
File indexDir = new File("TestFieldMaskingSpanQuery/index"); IndexReader reader = IndexReader.open(FSDirectory.open(indexDir)); IndexSearcher searcher = new IndexSearcher(reader); SpanQuery q1 = new SpanTermQuery(new Term("studentfirstname", "james")); SpanQuery q2 = new SpanTermQuery(new Term("studentsurname", "jones")); SpanQuery q2m = new FieldMaskingSpanQuery(q2, "studentfirstname"); Query query = new SpanNearQuery(new SpanQuery[]{q1, q2m}, -1, false); TopDocs docs = searcher.search(query, 50); for (ScoreDoc doc : docs.scoreDocs) { System.out.println("docid : " + doc.doc + " score : " + doc.score); } |
5.6、PayloadTermQuery及PayloadNearQuery
带Payload前缀的查询对象不会因为payload的存在而使得结果集发生改变,而仅仅改变其评分。
欲使用Payload系列的查询语句:
- 首先在索引阶段,要将payload存入到索引中去:PayloadAttribute..setPayload(new Payload(byte[] b));
- 其次是实现自己的Similarity,并实现其接口float scorePayload(int docId, String fieldName, int start, int end, byte [] payload, int offset, int length),可以指定如何根据读出的二进制payload计算payload的打分。
- 最后在构建PayloadTermQuery及PayloadNearQuery的时候传入PayloadFunction function
PayloadFunction需要实现两个接口:
- float currentScore(int docId, String field, int start, int end, int numPayloadsSeen, float currentScore, float currentPayloadScore)是在上一步用Similarity根据二进制payload计算出payload打分后,此打分作为currentPayloadScore传入,此次计算前的原分数作为currentScore传入,此处可以指定payload如何影响原来的打分。
- float docScore(int docId, String field, int numPayloadsSeen, float payloadScore)当所有的payload都被计算完毕后,如何调整最终的打分。
PayloadFunction有三种实现:
- AveragePayloadFunction,其在currentScore函数中,总是将payload的打分加到原分数中,currentPayloadScore + currentScore,然后在所有的payload都计算完毕后,在docScore函数中,对这些打分取平均值,return numPayloadsSeen > 0 ? (payloadScore / numPayloadsSeen) : 1
- MaxPayloadFunction,其在currentScore函数中,总是取两者的最大值Math.max(currentPayloadScore, currentScore),最后在docScore函数中将最大值返回,return numPayloadsSeen > 0 ? payloadScore : 1
- MinPayloadFunction,其在currentScore函数中,总是取两者的最小值Math.min(currentPayloadScore, currentScore),最后在docScore函数中将最小值返回,return numPayloadsSeen > 0 ? payloadScore : 1
对于PayloadTermQuery来讲,在其生成的PayloadTermSpanScorer中:
- 首先计算出payloadScore
payloadScore = function.currentScore(doc, term.field(), spans.start(), spans.end(), payloadsSeen, payloadScore, similarity.scorePayload(doc, term.field(), spans.start(), spans.end(), payload, 0, positions.getPayloadLength())); |
- 然后在score函数中调用getSpanScore() * getPayloadScore()
protected float getPayloadScore() { return function.docScore(doc, term.field(), payloadsSeen, payloadScore); } |
对于PayloadNearQuery来讲,在其生成的PayloadNearSpanScorer中:
- 首先计算出payloadScore
payloadScore = function.currentScore(doc, fieldName, start, end, payloadsSeen, payloadScore, similarity.scorePayload(doc, fieldName, spans.start(), spans.end(), thePayload, 0, thePayload.length) ); |
- 然后在score函数中
public float score() throws IOException { return super.score() * function.docScore(doc, fieldName, payloadsSeen, payloadScore); } |