lucene之中文分词器

lucene本身不支持中文分词，毕竟嘛，这是人家外国人写的开源框架，当然不太会考虑其他国家的使用，不过它支持对中文分词的扩展。

下面是网上对几个中文分词器的评论：

            paoding ：Lucene中文分词“庖丁解牛” Paoding Analysis
            imdict ：imdict智能词典所采用的智能中文分词程序
            mmseg4j ：用 Chih-Hao Tsai 的 MMSeg 算法实现的中文分词器
            IKAnalyzer ：采用了特有的“正向迭代最细粒度切分算法“，多子处理器分析模式

http://linliangyi2007.iteye.com/blog/501228 IKAnalyzer分词器作者博客

以上4个分词器中，第一个已经不更新了，而支持lucene3.x的应该只有IKAnalyzer，IKAnalyzer最新版本3.28。对应与Lucene 3.x版本，其他lucene版本不兼容。建议使用最新版本的IKAnalyzer。作者linliangyi，感谢这位作者如此长时间的更新这个中文分词器。没有他的版本更新也许我们就只能自己去写分词器，或者改他们的分词器了。对应开源框架工具的提供者们我也表示感谢。我看了点他的资料，他是来自搜狐的一位搜索架构师，看上去很年青很有活力，希望他能保持对IKAnalyzer的更新，我想还有很多做垂直搜索，或者电子商务的人需要他的分词器。

以下是用IK分词器建立索引和查询索引。

创建中文索引：

     /**
   * 创建中文分词索引的类
   */
   public void createDBTableIndexByIK(){
       Connection conn=null;
       String sql="";
       String indexPath="D:\\luceneIndex";
       try {
           conn=DBUtil.getConnection();
           sql = "select GoodsId,StyleNo,GoodsName,IsNew,IsOffline,SellTotal,CreateDate,SalePrice,ActPrice from A_GoodsInfo";
           PreparedStatement pstmt = conn.prepareStatement(sql);
           // 查询获得结果集
            ResultSet rs = pstmt.executeQuery();
            System.out.println("连接成功!!");
            int id=0;
            String GoodsName="";
            String StyleNo="";
            float price=0.0f;
            int isNew=0;
            int IsOffline=0;
            int sellTotal=0;
            Date createDate=new Date();
            float salePrice=0.0f;
            float ActPrice=0.0f;
       //lucene index存放的目录
       //indexDir is the directory that hosts Lucene's index files

       //索引存放路径
        File   indexDir = new File(indexPath);
        //dataDir is the directory that hosts the text files that to be indexed

       //初始化IKAnalyzer
        Analyzer analyzer = new IKAnalyzer();
        IndexWriter indexWriter=null;
        Directory dir= FSDirectory.open(indexDir);
        indexWriter = new IndexWriter(dir,analyzer,IndexWriter.MaxFieldLength.LIMITED);
        //此处对数据库表构建lucene索引
        while(rs.next()){
           id=rs.getInt("GoodsId");
           GoodsName=rs.getString("GoodsName");
           StyleNo=rs.getString("StyleNo");
           isNew=rs.getInt("IsNew");
           IsOffline=rs.getInt("IsOffline");
           String flag="否";
           //sellTotal=Tools.IntToString(rs.getInt("SellTotal"));
           sellTotal=rs.getInt("SellTotal");
           createDate=rs.getDate("CreateDate");
           salePrice=rs.getFloat("SalePrice");
           ActPrice=rs.getFloat("ActPrice");
           if(ActPrice!=0.0f){
               price=ActPrice;
           }else{
               price=salePrice;
           }

           if(isNew==1){
               flag="是";
           }
            //索引文档对象
           Document document = new Document();

            //向索引文档对象加索引字段
           document.add(new Field("id",id+"",Field.Store.YES, Field.Index.ANALYZED));
           document.add(new Field("IsNew",flag+"",Field.Store.YES, Field.Index.ANALYZED));
           document.add(new Field("GoodsName",GoodsName+"",Field.Store.YES, Field.Index.ANALYZED));
           document.add(new Field("StyleNo",StyleNo+"",Field.Store.YES, Field.Index.ANALYZED));
           //document.add(new Field("sellTotal",sellTotal,Field.Store.YES, Field.Index.NOT_ANALYZED));
           document.add(new NumericField("sellTotal", Field.Store.YES, true).setIntValue(sellTotal));
           document.add(new Field("time",createDate.getTime()+"",Field.Store.YES, Field.Index.NOT_ANALYZED));
           //document.add(new Field("salePrice",salePrice+"",Field.Store.YES, Field.Index.ANALYZED));
           document.add(new NumericField("price", Field.Store.YES, true).setFloatValue(price));

           indexWriter.addDocument(document);
           System.out.println("插入name的索引的值为： "+GoodsName+", 类型编号： "+StyleNo+" 销量: "+sellTotal+" 商品入库时间 :"+createDate);
           System.out.println("商品价格： "+price);

        }
        indexWriter.optimize();
        indexWriter.close();
       conn.close();

       } catch (Exception e) {
           e.printStackTrace();
       }
       finally{
       }
   }

posted @ 2013-04-23 15:49 Dream-Weaver 阅读(231) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

Dream-Weaver

lucene之中文分词器

公告