LuceneNet的使用

 

先建立索引,再查询,速度很快。

索引花费的时间比较多,但还可以接受,200M的文本需要1分多钟,40G的文本需要4个小时多。

刚开始我用 2.9版本的,是选择将文本也保存在索引中,占据的空间比原先的文本2倍多。

而且发现如果里面的邮箱地址不带@后面无法查询出来,也许是所用的分词的关系,也不知道怎样才可以自定义分隔符。

后来改用了4.8版本,索引的空间只比原先的文本大一点点,而且不带@的关键字也可以查询出来。

但还是有个问题,查询出来的内容中文是乱码,也无法用中文查询。

 

 

无论是 NLuke 还是 Luke , 都没法成功打开索引文件。

IndexWriter 构造函数使用了另一个重载,即第三个参数为 bool,如果为 true 表示不存在就创建、存在就覆盖,为 false 表示不存在就出错、存在就追加。这个不方便,因为我们需要的是:不存在就创建、存在就追加,怎样才能实现这个功能呢?省略掉,就实现这个功能了。
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
void Main()<br>{<br> <br> string idxpath = @"D:\data\DB\txt\index\";<br> string dir = @"D:\data\DB\txt\search\tianya\";<br> //!!怎样用@分词?<br> string keyword = "zhaoshu0997";<br> Utils.FullSearch.FileHelper filehelper= new Utils.FullSearch.FileHelper(idxpath);<br> //filehelper.BuildIndex(dir);<br> Utils.FullSearch.SearchResults results = filehelper.Search(keyword);<br> Console.WriteLine($"{keyword.ToString()}");<br> results.Dump();<br>}<br><br>namespace Utils.FullSearch
{
    public class SearchResults{
        public int TotalHits{get;set;}
        public List<Hit> SearchContents{get;set;}
    }
     
    public class Hit{
        public float Score{get;set;}
        public string Content{get;set;}
    }
     
    public class FileHelper
    {
        private const LuceneVersion MATCH_LUCENE_VERSION= LuceneVersion.LUCENE_48;
        private const string Field_Name= "content";
        private const int Results_Per_Page = 10;
        //private  IndexWriter writer;
        private  StandardAnalyzer analyzer;
        private  QueryParser queryParser;
        //private  SearcherManager searchManager;
        private string _indexPath;
         
        private StandardAnalyzer SetupAnalyzer() => new StandardAnalyzer(MATCH_LUCENE_VERSION);
        private QueryParser SetupQueryParser(StandardAnalyzer analyzer) => new QueryParser(MATCH_LUCENE_VERSION, Field_Name, analyzer);
         
        public FileHelper(string indexPath)
        {
            analyzer = SetupAnalyzer();
            queryParser = SetupQueryParser(analyzer);
            _indexPath = indexPath;
             
        }
         
        public void BuildIndex(string dir)
        {
            var watch = Stopwatch.StartNew();
            List<string> fpaths = FindFile(dir);
             
            IndexWriter writer = new IndexWriter(FSDirectory.Open(_indexPath), new IndexWriterConfig(MATCH_LUCENE_VERSION, analyzer));
            foreach(string fpath in fpaths){
                string[] contents = File.ReadAllLines(fpath, Encoding.UTF8);
                foreach(string content in contents){
                    Document doc = new Document
                    {               
                        new TextField(Field_Name, content, Field.Store.YES)
                    };
                    writer.AddDocument(doc);
                }
                ($"index time for {fpath}:{watch.ElapsedMilliseconds/1000.0}second").Dump();
            }
               
            writer.Flush(true, true);
            writer.Commit();
            writer.Dispose();
            watch.Stop();
            ($"index time for {dir}:{watch.ElapsedMilliseconds/1000.0}second").Dump();
        }
         
        public static List<string> FindFile(string sSourcePath)
        {
            List<String> list = new List<string>();
            DirectoryInfo theFolder = new DirectoryInfo(sSourcePath);
            FileInfo[] thefileInfo = theFolder.GetFiles("*.*", SearchOption.TopDirectoryOnly);
            foreach (FileInfo NextFile in thefileInfo)  //遍历文件
                list.Add(NextFile.FullName);
                 DirectoryInfo[] dirInfo = theFolder.GetDirectories();
                foreach (DirectoryInfo NextFolder in dirInfo)
                {
                    //list.Add(NextFolder.ToString());
                    FileInfo[] fileInfo = NextFolder.GetFiles("*.*", SearchOption.AllDirectories);
                    foreach (FileInfo NextFile in fileInfo)  //遍历文件
                        list.Add(NextFile.FullName);
                }          
            return list;
        }
             
        public  SearchResults Search(string queryString)
        {
            var watch = Stopwatch.StartNew();
            Query query = queryParser.Parse(queryString);
            IndexWriter writer = new IndexWriter(FSDirectory.Open(_indexPath), new IndexWriterConfig(MATCH_LUCENE_VERSION, analyzer));
            SearcherManager searchManager = new SearcherManager(writer, true, null);
            searchManager.MaybeRefreshBlocking();
            IndexSearcher searcher = searchManager.Acquire();
         
            try
            {
                TopDocs topdDocs = searcher.Search(query, Results_Per_Page);        
                SearchResults searchResults = new SearchResults() { TotalHits = topdDocs.TotalHits, SearchContents = new List<Hit>() };
                foreach (var result in topdDocs.ScoreDocs)
                {
                    Document document = searcher.Doc(result.Doc);
                    Hit searchResult = new Hit
                    {
                        Score = result.Score,
                        Content = document.GetField(Field_Name)?.GetStringValue()
                    };
                    searchResults.SearchContents.Add(searchResult);
                }
                ($"search time for {queryString}:{watch.ElapsedMilliseconds/1000.0}second").Dump();
                return searchResults;
            }
            finally
            {
                searchManager.Release(searcher);
                searcher = null;
            }
             
        }
    }
}

  

posted on   白马酒凉  阅读(485)  评论(2编辑  收藏  举报

编辑推荐:
· 记一次.NET内存居高不下排查解决与启示
· 探究高空视频全景AR技术的实现原理
· 理解Rust引用及其生命周期标识(上)
· 浏览器原生「磁吸」效果!Anchor Positioning 锚点定位神器解析
· 没有源码,如何修改代码逻辑?
阅读排行:
· 分享4款.NET开源、免费、实用的商城系统
· 全程不用写代码,我用AI程序员写了一个飞机大战
· MongoDB 8.0这个新功能碉堡了,比商业数据库还牛
· 白话解读 Dapr 1.15:你的「微服务管家」又秀新绝活了
· 上周热点回顾(2.24-3.2)
< 2025年3月 >
23 24 25 26 27 28 1
2 3 4 5 6 7 8
9 10 11 12 13 14 15
16 17 18 19 20 21 22
23 24 25 26 27 28 29
30 31 1 2 3 4 5

导航

统计

点击右上角即可分享
微信分享提示