使用Lucene索引和检索POI数据
1、简介
关于空间数据搜索,以前写过《使用Solr进行空间搜索》这篇文章,是基于Solr的GIS数据的索引和检索。
Solr和ElasticSearch这两者都是基于Lucene实现的,两者都可以进行空间搜索(Spatial Search),在有些场景,我们需要把Lucene嵌入到已有的系统提供数据索引和检索的功能,这篇文章介绍下用Lucene如何索引带有经纬度的POI信息并进行检索。
2、环境数据
Lucene版本:5.3.1
POI数据库:Base_Station测试数据,每条数据主要是ID,经纬度和地址。
3、实现
基本变量定义,这里对“地址”信息进行了分词,分词使用了Lucene自带的smartcnSmartChineseAnalyzer。
private String indexPath = "D:/IndexPoiData"; private IndexWriter indexWriter = null; private SmartChineseAnalyzer analyzer = new SmartChineseAnalyzer(true); private IndexSearcher indexSearcher = null; // Field Name private static final String IDFieldName = "id"; private static final String AddressFieldName = "address"; private static final String LatFieldName = "lat"; private static final String LngFieldName = "lng"; private static final String GeoFieldName = "geoField"; // Spatial index and search private SpatialContext ctx; private SpatialStrategy strategy; public PoiIndexService() throws IOException { init(); } public PoiIndexService(String indexPath) throws IOException { this.indexPath = indexPath; init(); } protected void init() throws IOException { Directory directory = new SimpleFSDirectory(Paths.get(indexPath)); IndexWriterConfig config = new IndexWriterConfig(analyzer); indexWriter = new IndexWriter(directory, config); DirectoryReader ireader = DirectoryReader.open(directory); indexSearcher = new IndexSearcher(ireader); // Typical geospatial context // These can also be constructed from SpatialContextFactory ctx = SpatialContext.GEO; int maxLevels = 11; // results in sub-meter precision for geohash // This can also be constructed from SpatialPrefixTreeFactory SpatialPrefixTree grid = new GeohashPrefixTree(ctx, maxLevels); strategy = new RecursivePrefixTreeStrategy(grid, GeoFieldName); }
索引数据
public boolean indexPoiDataList(List<PoiData> dataList) { try { if (dataList != null && dataList.size() > 0) { List<Document> docs = new ArrayList<>(); for (PoiData data : dataList) { Document doc = new Document(); doc.add(new LongField(IDFieldName, data.getId(), Field.Store.YES)); doc.add(new DoubleField(LatFieldName, data.getLat(), Field.Store.YES)); doc.add(new DoubleField(LngFieldName, data.getLng(), Field.Store.YES)); doc.add(new TextField(AddressFieldName, data.getAddress(), Field.Store.YES)); Point point = ctx.makePoint(data.getLng(),data.getLat()); for (Field f : strategy.createIndexableFields(point)) { doc.add(f); } docs.add(doc); } indexWriter.addDocuments(docs); indexWriter.commit(); return true; } return false; } catch (Exception e) { log.error(e.toString()); return false; } }
这里的PoiData是个普通的POJO。
检索圆形范围内的数据,按距离从近到远排序:
public List<PoiData> searchPoiInCircle(double lng, double lat, double radius){ List<PoiData> results= new ArrayList<>(); Shape circle = ctx.makeCircle(lng, lat, DistanceUtils.dist2Degrees(radius, DistanceUtils.EARTH_MEAN_RADIUS_KM)); SpatialArgs args = new SpatialArgs(SpatialOperation.Intersects, circle); Query query = strategy.makeQuery(args); Point pt = ctx.makePoint(lng, lat); ValueSource valueSource = strategy.makeDistanceValueSource(pt, DistanceUtils.DEG_TO_KM);//the distance (in km) Sort distSort = null; TopDocs docs = null; try { //false = asc dist distSort = new Sort(valueSource.getSortField(false)).rewrite(indexSearcher); docs = indexSearcher.search(query, 10, distSort); } catch (IOException e) { log.error(e.toString()); } if(docs!=null){ ScoreDoc[] scoreDocs = docs.scoreDocs; printDocs(scoreDocs); results = getPoiDatasFromDoc(scoreDocs); } return results; } private List<PoiData> getPoiDatasFromDoc(ScoreDoc[] scoreDocs){ List<PoiData> datas = new ArrayList<>(); if (scoreDocs != null) { //System.out.println("总数:" + scoreDocs.length); for (int i = 0; i < scoreDocs.length; i++) { try { Document hitDoc = indexSearcher.doc(scoreDocs[i].doc); PoiData data = new PoiData(); data.setId(Long.parseLong((hitDoc.get(IDFieldName)))); data.setLng(Double.parseDouble(hitDoc.get(LngFieldName))); data.setLat(Double.parseDouble(hitDoc.get(LatFieldName))); data.setAddress(hitDoc.get(AddressFieldName)); datas.add(data); } catch (IOException e) { log.error(e.toString()); } } } return datas; }
搜索矩形范围内的数据:
public List<PoiData> searchPoiInRectangle(double minLng, double minLat, double maxLng, double maxLat) { List<PoiData> results= new ArrayList<>(); Point lowerLeftPoint = ctx.makePoint(minLng, minLat); Point upperRightPoint = ctx.makePoint(maxLng, maxLat); Shape rect = ctx.makeRectangle(lowerLeftPoint, upperRightPoint); SpatialArgs args = new SpatialArgs(SpatialOperation.Intersects, rect); Query query = strategy.makeQuery(args); TopDocs docs = null; try { docs = indexSearcher.search(query, 10); } catch (IOException e) { log.error(e.toString()); } if(docs!=null){ ScoreDoc[] scoreDocs = docs.scoreDocs; printDocs(scoreDocs); results = getPoiDatasFromDoc(scoreDocs); } return results; }
搜索某个范围内并根据地址关键字信息来检索POI:
public List<PoiData>searchPoByRangeAndAddress(doublelng, doublelat, double range, String address){ List<PoiData> results= newArrayList<>(); SpatialArgsargs = newSpatialArgs(SpatialOperation.Intersects, ctx.makeCircle(lng, lat, DistanceUtils.dist2Degrees(range, DistanceUtils.EARTH_MEAN_RADIUS_KM))); Query geoQuery = strategy.makeQuery(args); QueryBuilder builder = newQueryBuilder(analyzer); Query addQuery = builder.createPhraseQuery(AddressFieldName, address); BooleanQuery.BuilderboolBuilder = newBooleanQuery.Builder(); boolBuilder.add(addQuery, Occur.SHOULD); boolBuilder.add(geoQuery,Occur.MUST); Query query = boolBuilder.build(); TopDocs docs = null; try { docs = indexSearcher.search(query, 10); } catch (IOException e) { log.error(e.toString()); } if(docs!=null){ ScoreDoc[] scoreDocs = docs.scoreDocs; printDocs(scoreDocs); results = getPoiDatasFromDoc(scoreDocs); } return results; }
4、关于分词
POI的地址属性和描述属性都需要做分词才能更好的进行检索和搜索。
简单对比了几种分词效果:
原文:
这是一个lucene中文分词的例子,你可以直接运行它!Chinese Analyer can analysis english text too.中国农业银行(农行)和建设银行(建行),江苏南京江宁上元大街12号。东南大学是一所985高校。
分词结果:
smartcn SmartChineseAnalyzer 这\是\一个\lucen\中文\分\词\的\例子\你\可以\直接\运行\它\chines\analy\can\analysi\english\text\too\中国\农业\银行\农行\和\建设\银行\建行\江苏\南京\江\宁\上\元\大街\12\号\东南\大学\是\一\所\985\高校\ MMSegAnalyzer ComplexAnalyzer 这是\一个\lucene\中文\分词\的\例子\你\可以\直接\运行\它\chinese\analyer\can\analysis\english\text\too\中国农业\银行\农行\和\建设银行\建\行\江苏南京\江\宁\上\元\大街\12\号\东南大学\是一\所\985\高校\ IKAnalyzer 这是\一个\lucene\中文\分词\的\例子\你\可以\直接\运行\它\chinese\analyer\can\analysis\english\text\too.\中国农业银行\农行\和\建设银行\建行\江苏\南京\江宁\上元\大街\12号\东南大学\是\一所\985\高校\
分词效果对比:
1)Smartcn不能正确的分出有些英文单词,有些中文单词也被分成单个字。
2)MMSegAnalyzer能正确的分出英文和中文,但对于类似“江宁”这样的地名和“建行”等信息不是很准确。MMSegAnalyzer支持自定义词库,词库可以大大提高分词的准确性。
3)IKAnalyzer能正确的分出英文和中文,中文分词比较不错,但也有些小问题,比如单词too和最后的点号分在了一起。IKAnalyzer也支持自定义词库,但是要扩展一些源码。
总结:使用Lucene强大的数据索引和检索能力可以为一些带有经纬度和需要分词检索的数据提供搜索功能。
代码托管在GitHub上:https://github.com/luxiaoxun/Code4Java