huangfox

冰冻三尺,非一日之寒!

  博客园 :: 首页 :: 博问 :: 闪存 :: 新随笔 :: 联系 :: 订阅 订阅 :: 管理 ::

面对字段类型为数值时,lucene表现得并不是很完美,经常会带来一些意想不到的“问题”。

下面从索引、排序、范围检索(rangeQuery)三个方面进行分析。

搜索我们做好准备工作,建立索引。

RAMDirectory dir = new RAMDirectory();

	public void index() {
		Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_36);
		try {
			IndexWriter writer = new IndexWriter(dir, new IndexWriterConfig(
					Version.LUCENE_36, analyzer));
			Random random = new Random();
			Fieldable f0 = new Field("f0", "c", Store.YES, Index.NOT_ANALYZED);
			Fieldable f1 = new Field("f1", "", Store.YES, Index.NOT_ANALYZED);
			Fieldable f2 = new Field("f2", "", Store.YES, Index.NOT_ANALYZED);
			Fieldable f3 = new NumericField("f3", Store.YES, true);
			Fieldable f4 = new NumericField("f4", Store.YES, true);
			for (int i = 0; i < 20; i++) {
				int value = random.nextInt(100);
				((Field) f1).setValue(value + "");
				((Field) f2).setValue(value + random.nextFloat() + "");
				((NumericField) f3).setIntValue(value);
				((NumericField) f4).setFloatValue(value + random.nextFloat());
				Document doc = new Document();
				doc.add(f0);
				doc.add(f1);
				doc.add(f2);
				doc.add(f3);
				doc.add(f4);
				writer.addDocument(doc);
			}
			writer.close();
		} catch (CorruptIndexException e) {
			e.printStackTrace();
		} catch (LockObtainFailedException e) {
			e.printStackTrace();
		} catch (IOException e) {
			e.printStackTrace();
		}
	}

这里共5个字段,

f1:field类型,填充int的StringValue;

f2:field类型,填充float的StringValue;

f3:numericField类型,填充int;

f4:numericField类型,填充float;

共20个document。

 

排序

从luceneApi可知,排序类型如下:

Field Summary
static int BYTE 
          Sort using term values as encoded Bytes.
static int CUSTOM 
          Sort using a custom Comparator.
static int DOC 
          Sort by document number (index order).
static int DOUBLE 
          Sort using term values as encoded Doubles.
static SortField FIELD_DOC 
          Represents sorting by document number (index order).
static SortField FIELD_SCORE 
          Represents sorting by document score (relevance).
static int FLOAT 
          Sort using term values as encoded Floats.
static int INT 
          Sort using term values as encoded Integers.
static int LONG 
          Sort using term values as encoded Longs.
static int SCORE 
          Sort by document score (relevance).
static int SHORT 
          Sort using term values as encoded Shorts.
static int STRING 
          Sort using term values as Strings.
static int STRING_VAL 
          Sort using term values as Strings, but comparing by value (using String.compareTo) for all comparisons.

这里我们只关注String、int、float。

public void sort() {
		IndexReader reader;
		try {
			reader = IndexReader.open(dir);
			IndexSearcher searcher = new IndexSearcher(reader);
			TermQuery query = new TermQuery(new Term("f0", "c"));
			// SortField field = new SortField("f1", SortField.STRING);// 有问题
			// SortField field = new SortField("f1", SortField.INT);// 没问题
			// SortField field = new SortField("f1", SortField.FLOAT);// 没问题

			// SortField field = new SortField("f2", SortField.STRING);// 有问题
			// SortField field = new SortField("f2", SortField.INT);//有问题
			// SortField field = new SortField("f2", SortField.FLOAT);// 没问题

			// SortField field = new SortField("f3", SortField.STRING);// 有问题
			// SortField field = new SortField("f3", SortField.INT);//没问题
			// SortField field = new SortField("f3", SortField.FLOAT);// 没问题

			// SortField field = new SortField("f3", SortField.STRING);// 没问题
			// SortField field = new SortField("f3", SortField.INT);// 没问题
			SortField field = new SortField("f3", SortField.FLOAT);// 没问题
			Sort sort = new Sort(field);
			TopFieldDocs docs = searcher.search(query, 20, sort);
			ScoreDoc[] sds = docs.scoreDocs;
			for (ScoreDoc sd : sds) {
				Document doc = reader.document(sd.doc);
				System.out.println(doc.get("f1") + "\t" + doc.get("f2") + "\t"
						+ doc.get("f3") + "\t" + doc.get("f4"));
			}
		} catch (CorruptIndexException e) {
			e.printStackTrace();
		} catch (IOException e) {
			e.printStackTrace();
		}
	}

从上面的测试可知:

如果使用field类进行索引,排序时可以指定“正确”的数据类型进行排序。使用String类型肯定不行,如果索引的时候存放的是float的StringValue,排序时使用SortField.INT同样会产生问题,异常如下:

java.lang.NumberFormatException: Invalid shift value in prefixCoded string (is encoded value really an INT?)

从异常可以判断,lucene排序的时候会先将String转换成指定的数值类型,如果指定错了(例如将1.2转成int型)就会遇到异常。

如果使用numericField进行索引,索引的是什么类型排序就选用什么类型。如果考虑其他的太纠结。

 

范围检索

public void rangeSearch() {
		IndexReader reader;
		try {
			reader = IndexReader.open(dir);
			IndexSearcher searcher = new IndexSearcher(reader);
			Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_36);
			// Query query = new TermRangeQuery("f1", "30", "60", true,
			// true);//有问题
			// Query query = NumericRangeQuery.newIntRange("f3", 30, 60,
			// true, true);//没问题
			// Query query = new TermRangeQuery("f2", "30", "60", true,
			// true);//有问题
			Query query = NumericRangeQuery.newFloatRange("f4", 30f, 60f, true,
					true);// 没问题
			TopDocs docs = searcher.search(query, 20);
			ScoreDoc[] sds = docs.scoreDocs;
			for (ScoreDoc sd : sds) {
				Document doc = reader.document(sd.doc);
				System.out.println(doc.get("f1") + "\t" + doc.get("f2") + "\t"
						+ doc.get("f3") + "\t" + doc.get("f4"));
			}
		} catch (CorruptIndexException e) {
			e.printStackTrace();
		} catch (IOException e) {
			e.printStackTrace();
		}
	}

 

检索时,我们常用queryParser,但是queryParser的范围检索对数值型不支持,因为lucene没有记录哪些域是数值型的,在queryParser解析时也会不特殊处理。

这时我们可以创建queryParser的子类,例如:

public class NumericQueryParser extends QueryParser {

	protected NumericQueryParser(Version matchVersion, String field, Analyzer a) {
		super(matchVersion, field, a);
	}

	@Override
	protected org.apache.lucene.search.Query getRangeQuery(String field,
			String part1, String part2, boolean inclusive)
			throws ParseException {
		TermRangeQuery query = (TermRangeQuery) super.getRangeQuery(field,
				part1, part2, inclusive);
		if ("f3".equals(field)) {
			return NumericRangeQuery.newIntRange(field,
					Integer.parseInt(query.getLowerTerm()),
					Integer.parseInt(query.getUpperTerm()),
					query.includesLower(), query.includesUpper());
		} else {
			return query;
		}
	}

}

  

使用其进行范围检索:

public void rangeSearch() {
		IndexReader reader;
		try {
			reader = IndexReader.open(dir);
			IndexSearcher searcher = new IndexSearcher(reader);
			Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_36);
			// QueryParser parser = new QueryParser(Version.LUCENE_36, "f0",
			// analyzer);//有问题
			NumericQueryParser parser = new NumericQueryParser(
					Version.LUCENE_36, "f0", analyzer);
			Query query = parser.parse("f3:[30 TO 60]");
			TopDocs docs = searcher.search(query, 20);
			ScoreDoc[] sds = docs.scoreDocs;
			for (ScoreDoc sd : sds) {
				Document doc = reader.document(sd.doc);
				System.out.println(doc.get("f1") + "\t" + doc.get("f2") + "\t"
						+ doc.get("f3") + "\t" + doc.get("f4"));
			}
		} catch (CorruptIndexException e) {
			e.printStackTrace();
		} catch (IOException e) {
			e.printStackTrace();
		} catch (ParseException e) {
			e.printStackTrace();
		}
	}

  

自我提醒:

1、有的问题从表面上不要考虑太多,例如上面的排序,如果是索引的是int,排序int肯定没有问题,不要再去尝试string,或者其他数值类型。没有太多意义!

2、如果要把这些问题考虑情况,从本质下手,从源码开始!

 

 

posted on 2012-08-10 09:15  huangfox  阅读(10054)  评论(0编辑  收藏  举报